TL;DR

I read this because.. : CS285 기말과제
task : reinforcement learning
problem : 이론적으로 무조건 성능이 개선되는 policy update 방식이 있을까
idea : conservative policy iteration에서 증명한 lower bound를 일반적인 policy network에 대해 구하고 이 lower bound를 surrogate function으로 해서 maximization하자
input/output : {s, a, r, … } -> policy
architecture : conv+ linear
baseline : deep Q-learning
result : 나쁘지 않은 성능. Deep Q-learning보다 별로 좋진 않음
contribution : PPO의 전신
etc. :

Details