TL;DR
- I read this because.. : Background Tea
- task : RL
- problem : q-learning is too unstable, trpo is relatively complicated. Is there a data efficient and sclable arch?
- idea : Use clipping instead of KL divergence term. Vary step size $\beta$ adaptively.
- input/output : state -> action
- architecture : MLP
- objective : $L^{CLIP}(\theta) = \hat{\mathbb{E}}_t \left[ \min(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon)\hat{A}_t) \right]$
- baseline : loss(no clipping, KL penalty), A2C, CEM, TRPO
- data : OpenAI Gym(Mujoco), human control task(Roboschool), Atari
- evaluation : don’t know
- result : good
- contribution : Simple and intuitive loss… Actually, this paper alone makes intuitive sense, so I don’t even think I should have known the previous content… ă…Žă…Ž
- etc. :
Details
preliminary
- policy gradient method
Derivative of the gradient over the policy network weighted by advantage where E is a form of sampling and just taking the average
loss should maximize the expected value of the weighted sum of advantages for trajectories that follow these policies However, performance is poor for large policy updates
- Trust Region methods
TRPO proved that a surrogate function can be used to update a policy with guaranteed performance improvement (https://github.com/long8v/PTIR/issues/154)
, and this creates a constraint on when the policy is updated, resulting in the following form of multiplying the importance weight and advantage of the new and old policy network
Adds a penalty term, not actually a constraint, where $beta$ is a hyperparameter.
The problem with TRPO in this paper is that $\beta$ can’t be picked.
Clipped Surrogate Objective
In conclusion, we can say that we clipped $r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{old}}(a_t | s_t)}$ so that it is not too large or too small
This resulted in less fluctuation in the policy network.
Adaptive KL Penalty Coefficient
Wow, this is so intuitive and simple.
Algorithm
Above, we’ve decided how we want to update the gradient for the policy, and everything else is fine.
Reducing the variance of reward by introducing V(s), i.e. adding a policy surrogate and a value function error term. Adding an entropy bonus to incentivize more exploration.
advantage is the sum of the current discounted Reward minus the current V and the discounted future V.
I don’t remember, but I used a technique to optimize bias // vairance at a fixed time step to get something like this
Experiments
loss term
other RL algorithms I don’t even know what to say