image

paper

TL;DR

  • I read this because.. : CS285 Final Assignment
  • task : reinforcement learning
  • problem : Is there a policy update method that theoretically improves performance unconditionally?
  • idea : Find the lower bound proved by the conservative policy iteration for the general policy network and maximize this lower bound as a surrogate function.
  • input/output : {s, a, r, … } -> policy
  • architecture : conv+ linear
  • baseline : deep Q-learning
  • result : Not bad performance. Not much better than Deep Q-learning
  • contribution :** Predecessor of PPO
  • etc. :

Details

TRPO.pptx

  • objective แ„‰แ…ณแ„แ…ณแ„…แ…ตแ†ซแ„‰แ…ฃแ†บ 2023-12-25 แ„‹แ…ฉแ„’แ…ฎ 8 53 45