image

paper

TL;DR

  • I read this because.. : ๋ฐฐ๊ฒฝ์ง€์‹ ์ฐจ
  • task : RL
  • problem : q-learning์€ ๋„ˆ๋ฌด ๋ถˆ์•ˆ์ •ํ•˜๊ณ , trpo ๋Š” ์ƒ๋Œ€์ ์œผ๋กœ ๋ณต์žก. data efficientํ•˜๊ณ  sclableํ•œ arch?
  • idea : KL divergence term ๋Œ€์‹ ์— clipping์„ ์‚ฌ์šฉํ•˜์ž. step size $\beta$๋ฅผ adaptiveํ•˜๊ฒŒ ๋ณ€๋™์‹œํ‚ค์ž
  • input/output : state -> action
  • architecture : MLP
  • objective : $L^{CLIP}(\theta) = \hat{\mathbb{E}}_t \left[ \min(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon)\hat{A}_t) \right]$
  • baseline : loss(no clipping, KL penalty), A2C, CEM, TRPO
  • data : OpenAI Gym(Mujoco), human control task(Roboschool), Atari
  • evaluation : ๋ชฐ๋ผ
  • result : ์ข‹๋‹ค
  • contribution : ๊ฐ„๋‹จํ•˜๊ณ  ์ง๊ด€์ ์ธ loss.. ์‚ฌ์‹ค ์ด ๋…ผ๋ฌธ๋งŒ ๋ณด๋ฉด ์ง๊ด€์ ์ธ ์ดํ•ด๊ฐ€ ๋˜์–ด์„œ ๊ตณ์ด ๊ทธ ์ „ ๋‚ด์šฉ์„ ์•Œ์•˜์–ด์•ผ ํ•˜๋Š” ์ƒ๊ฐ๋„.. ใ…Žใ…Ž
  • etc. :

Details

preliminary

  • policy gradient method image

policy network์— ๋Œ€ํ•œ gradient๋ฅผ advantage์— ๊ฐ€์ค‘ํ•˜์—ฌ ๋ฏธ๋ถ„ํ•˜๋Š” ํ˜•ํƒœ ์—ฌ๊ธฐ์„œ E๋Š” sample์„ ๋ฝ‘์•„์„œ ๊ทธ๋ƒฅ ํ‰๊ท  ์ทจํ•˜๋Š” ํ˜•ํƒœ

image

loss๋Š” ์ด๋Ÿฌํ•œ policy๋ฅผ ๋”ฐ๋ฅด๋Š” trajectory์— ๋Œ€ํ•œ advantage ๊ฐ€์ค‘ํ•ฉ์˜ ๊ธฐ๋Œ€๊ฐ’์ด ์ตœ๋Œ€๊ฐ€ ๋˜๋„๋ก ํ•˜๋ฉด ๋œ๋‹ค ๊ทธ๋Ÿฌ๋‚˜ large policy update๊ฐ€ ๋  ๊ฒฝ์šฐ ์„ฑ๋Šฅ์ด ์ข‹์ง€ ์•Š๋‹ค

  • Trust Region methods TRPO๋Š” ๋Œ€๋ฆฌ ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์„ฑ๋Šฅ ๊ฐœ์„ ์ด ๋ณด์žฅ๋˜๋Š” policy update๋ฅผ ํ•  ์ˆ˜ ์žˆ๋Š” surrogate ํ•จ์ˆ˜๋ฅผ ์ฆ๋ช…ํ–ˆ๊ณ (https://github.com/long8v/PTIR/issues/154 ) ์ด๋Š” policy๊ฐ€ ์—…๋ฐ์ดํŠธ ๋  ๋•Œ์˜ ์ œ์•ฝ์ด ์ƒ๊ฒจ์„œ ๊ฒฐ๋ก ์ ์œผ๋กœ ์•„๋ž˜์™€ ๊ฐ™์ด new, old policy network์˜ importance weight์™€ advantage๊ฐ€ ๊ณฑํ•ด์ง„ ํ˜•ํƒœ image

์‹ค์ œ๋กœ๋Š” ์ œ์•ฝ์ด ์•„๋‹ˆ๋ผ penalty term์„ ์ถ”๊ฐ€ํ•จ ์—ฌ๊ธฐ์„œ $beta$๋Š” ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ์ž„. image

์ด ๋…ผ๋ฌธ์—์„œ ๋งํ•˜๋Š” TRPO์˜ ๋ฌธ์ œ์ ์€ $\beta$๊ฐ€ ํ•˜๋‚˜๋กœ ๊ณจ๋ผ์งˆ ์ˆ˜ ์—†๋‹ค ์ •๋„์ธ๋“ฏ.

Clipped Surrogate Objective

image image

๊ฒฐ๋ก ์ ์œผ๋กœ $r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{old}}(a_t | s_t)}$๊ฐ€ ๋„ˆ๋ฌด ํฌ๊ฑฐ๋‚˜ ์ž‘์ง€ ์•Š๋„๋ก clipํ–ˆ๋‹ค๊ณ  ๋ณด๋ฉด ๋จ

image

์ด๋ ‡๊ฒŒ ํ•˜๋‹ˆ policy network์˜ ๋ณ€๋™์ด ์ ์—ˆ๋‹ค๊ณ  ํ•จ

image

Adaptive KL Penalty Coefficient

image

์•„์ด์ฐธ ์ด๊ฒƒ๋„ ์•„์ฃผ ์ง๊ด€์ ์ด๊ณ  ๊ฐ„๋‹จํ•˜๋‹ค..

Algorithm

์œ„์—๋Š” policy์— ๋Œ€ํ•œ gradient ์—…๋ฐ์ดํŠธ๋ฅผ ์–ด๋–ป๊ฒŒ ํ• ์ง€ ์ •ํ•œ๊ฑฐ๊ณ  ์™ธ์˜ ๊ฒƒ๋“ค์€ ๊ดœ์ฐฎ์€ ๊ฒƒ๋“ค ๊ฐ–๋‹ค ์”€. V(s)๋ฅผ ๋„์ž…ํ•˜์—ฌ reward์˜ variance ์ค„์ž„. ์ฆ‰ policy surrogate์™€ value function error term์„ ์ถ”๊ฐ€ํ•จ. ์—ฌ๊ธฐ์— exploration ๋” ๋งŽ์ด ํ•˜๋ผ๊ณ  entropy bonus๋ฅผ ์ถ”๊ฐ€ํ•ด์คŒ. image

advantage๋Š” ํ˜„์žฌ ํ• ์ธ๋œ Reward์˜ summation์—์„œ ํ˜„์žฌ์˜ V๋ฅผ ๋นผ๊ณ  ๋ฏธ๋ž˜์˜ V๋ฅผ ํ• ์ธํ•ด์„œ ๋”ํ•จ. image

์ž˜ ๊ธฐ์–ต ์•ˆ๋‚˜๋Š”๋ฐ.. fixed time step์—์„œ bias // vairance๋ฅผ ์ตœ์ ํ™”ํ•˜๋Š” ๊ธฐ๋ฒ•์„ ์จ์„œ ์•„๋ž˜์™€ ๊ฐ™์ด ๊ตฌํ•จ image

Experiments

  • loss term image

  • other RL algorithms ๋ด๋„ ๋ชจ๋ฅด์ฅฌ ใ…Ž image