[168] Proximal Policy Optimization Algorithms

TL;DR

I read this because.. : Background Tea
task : RL
problem : q-learning is too unstable, trpo is relatively complicated. Is there a data efficient and sclable arch?
idea : Use clipping instead of KL divergence term. Vary step size $\beta$ adaptively.
input/output : state -> action
architecture : MLP
objective : $L^{CLIP}(\theta) = \hat{\mathbb{E}}_t \left[ \min(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon)\hat{A}_t) \right]$
baseline : loss(no clipping, KL penalty), A2C, CEM, TRPO
data : OpenAI Gym(Mujoco), human control task(Roboschool), Atari
evaluation : don’t know
result : good
contribution : Simple and intuitive loss… Actually, this paper alone makes intuitive sense, so I don’t even think I should have known the previous content… ㅎㅎ
etc. :

Details

preliminary

policy gradient method

Derivative of the gradient over the policy network weighted by advantage where E is a form of sampling and just taking the average

loss should maximize the expected value of the weighted sum of advantages for trajectories that follow these policies However, performance is poor for large policy updates

Trust Region methods TRPO proved that a surrogate function can be used to update a policy with guaranteed performance improvement (https://github.com/long8v/PTIR/issues/154) , and this creates a constraint on when the policy is updated, resulting in the following form of multiplying the importance weight and advantage of the new and old policy network

Adds a penalty term, not actually a constraint, where $beta$ is a hyperparameter.

The problem with TRPO in this paper is that $\beta$ can’t be picked.

Clipped Surrogate Objective

In conclusion, we can say that we clipped $r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{old}}(a_t | s_t)}$ so that it is not too large or too small

This resulted in less fluctuation in the policy network.

Adaptive KL Penalty Coefficient

Wow, this is so intuitive and simple.

Algorithm

Above, we’ve decided how we want to update the gradient for the policy, and everything else is fine. Reducing the variance of reward by introducing V(s), i.e. adding a policy surrogate and a value function error term. Adding an entropy bonus to incentivize more exploration.

advantage is the sum of the current discounted Reward minus the current V and the discounted future V.

I don’t remember, but I used a technique to optimize bias // vairance at a fixed time step to get something like this

Experiments

loss term
other RL algorithms I don’t even know what to say

TL;DR#

Details#

preliminary#

Clipped Surrogate Objective#

Adaptive KL Penalty Coefficient#

Algorithm#

Experiments#