image

paper

TL;DR

  • I read this because.. : Background Tea
  • task : RL
  • problem : q-learning is too unstable, trpo is relatively complicated. Is there a data efficient and sclable arch?
  • idea : Use clipping instead of KL divergence term. Vary step size $\beta$ adaptively.
  • input/output : state -> action
  • architecture : MLP
  • objective : $L^{CLIP}(\theta) = \hat{\mathbb{E}}_t \left[ \min(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon)\hat{A}_t) \right]$
  • baseline : loss(no clipping, KL penalty), A2C, CEM, TRPO
  • data : OpenAI Gym(Mujoco), human control task(Roboschool), Atari
  • evaluation : don’t know
  • result : good
  • contribution : Simple and intuitive loss… Actually, this paper alone makes intuitive sense, so I don’t even think I should have known the previous content… ă…Žă…Ž
  • etc. :

Details

preliminary

  • policy gradient method image

Derivative of the gradient over the policy network weighted by advantage where E is a form of sampling and just taking the average

image

loss should maximize the expected value of the weighted sum of advantages for trajectories that follow these policies However, performance is poor for large policy updates

  • Trust Region methods TRPO proved that a surrogate function can be used to update a policy with guaranteed performance improvement (https://github.com/long8v/PTIR/issues/154) , and this creates a constraint on when the policy is updated, resulting in the following form of multiplying the importance weight and advantage of the new and old policy network image

Adds a penalty term, not actually a constraint, where $beta$ is a hyperparameter. image

The problem with TRPO in this paper is that $\beta$ can’t be picked.

Clipped Surrogate Objective

image image

In conclusion, we can say that we clipped $r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{old}}(a_t | s_t)}$ so that it is not too large or too small

image

This resulted in less fluctuation in the policy network.

image

Adaptive KL Penalty Coefficient

image

Wow, this is so intuitive and simple.

Algorithm

Above, we’ve decided how we want to update the gradient for the policy, and everything else is fine. Reducing the variance of reward by introducing V(s), i.e. adding a policy surrogate and a value function error term. Adding an entropy bonus to incentivize more exploration. image

advantage is the sum of the current discounted Reward minus the current V and the discounted future V. image

I don’t remember, but I used a technique to optimize bias // vairance at a fixed time step to get something like this image

Experiments

  • loss term image

  • other RL algorithms I don’t even know what to say image