image

paper

TL;DR

  • I read this because.. : ๋‚จ์ข…๋Œ€ ๊ฐ€์„ํ•™๊ธฐ ์ˆ˜์—…์—์„œ ์ฝ์œผ๋ผ๊ณ  ์ถ”์ฒœ๋ฐ›์Œ
  • task : Deep Reinforcement Learning
  • problem : online RL์ด unstableํ•จ. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด replay buffer(์ด์ „์˜ trasition์—์„œ s, a, s’๋ฅผ ๊ฐ€์ ธ์˜ค๋Š” ๊ฒƒ) ๋“ฑ์ด ๊ณ ์•ˆ๋˜์—ˆ์œผ๋‚˜ ์ด๋Š” off-line RL algorithm์—๋งŒ ๊ตญํ•œ๋จ
  • idea : multiple agent๋ฅผ parallelํ•˜๊ฒŒ ์‹คํ–‰ํ•ด์„œ ๋ฐฐ์น˜๋กœ ๋ฌถ์–ด์„œ ์—…๋ฐ์ดํŠธ ํ•˜์ž!
  • input/output : trajectory / policy
  • architecture : one-step Q-learning, one-step Sarsa, multi-step Q-learning, proposed A3C(actor-critic). Value ๋˜๋Š” policy network๋Š” FFN or LSTM๋กœ ๊ตฌ์„ฑ
  • objective : policy $\phi$๋ฅผ ๋”ฐ๋ž์„ ๋•Œ advantage(value based), policy๋ฅผ ๋”ฐ๋ž์„ ๋•Œ reward์˜ expectation(policy based) + policy์˜ entropy๋ฅผ loss term์— ์ถ”๊ฐ€ํ•˜๋‹ˆ๊นŒ ๋” ์•ˆ์ •์ 
  • baseline : one-step Q-learning, one-step Sarsa, multi-step Q-learning, advantage actor-critic
  • data : Atari 2600, TORCS, Mujoco, Labyrinth
  • evaluation : score, data efficiency, stability
  • result : ๋†’์€ score. ๋น ๋ฅธ ์ˆ˜๋ ด. ๋” ์ ์€ training step์œผ๋กœ ๋” ๋†’์€ ์„ฑ๋Šฅ(data efficiency). ๋‹ค๋ฅธ์• ๋“ค์€ GPU๋งŒ ์“ฐ๋Š”๋ฐ ์–˜๋Š” CPU multi core๋งŒ ์”€
  • contribution : ์ง€๊ธˆ ์ด ์‹œ์ ์—์„œ ๋ณด๊ธฐ์—” ๊ฐ„๋‹จํ•œ ์•„์ด๋””์–ด๋กœ ์ข‹์€ ์„ฑ๋Šฅ
  • etc. : ๊ทธ ์œ ๋ช…ํ•œ A3C๊ฐ€ ์ด๊ฑฐ๊ตฐ… RL์€ ๋ชจ๋ธ ์ด๋ฆ„์ด ๋‹ค ๊ฐœ์„ฑ์ด ๋„˜์นœ๋‹น.. gorilla, REINFORCE, A3C, …

Details

introduction

  • online RL agent๊ฐ€ ๋งŒ๋‚˜๋Š” ๋ฐ์ดํ„ฐ๋“ค์˜ ๋ฌธ์ œ์ 
    • non-stationary: ์‹œ๊ณ„์—ด์—์„œ ๋งํ•˜๋Š” ์ •์ƒ์„ฑ? time step์— ๋”ฐ๋ผ ๋ถ„ํฌ๊ฐ€ ๋‹ฌ๋ผ์ง„๋‹ค?
    • strongly correlated: ์ด๊ฒƒ๋„ ์‹œ๊ณ„์—ด์—์„œ ๋งํ•˜๋Š” ๊ฒƒ? ์ด์ „ time step (t-1)์— ์˜ํ•ด t๊ฐ€ ๋ฌด์Šจ ์—ฐ๊ด€์„ฑ์„ ๊ฐ–๋‚˜๋ด„ ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๊ณ ์•ˆ๋œ๊ฒŒ ๋‹ค๋ฅธ time step์— ๋Œ€ํ•ด์„œ replay buffer, data batched, randomly sample ํ•˜๋Š”๊ฒŒ ์žˆ์—ˆ๋‹ค ๊ทผ๋ฐ ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ off-policy ๋ฐฉ๋ฒ•์œผ๋กœ ๊ตญํ•œ๋œ๋‹น. ์ด์ „์˜ transition์€ ์ด์ „์˜ policy๋ฅผ ๋”ฐ๋ฅด๊ณ  ์žˆ๋Š” ๊ฒฐ๊ณผ์ด๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค

Reinforcement Learning Background

์šฐ๋ฆฌ๊ฐ€ RL์—์„œ ํ•˜๋ ค๋Š” ๊ฑด environment $\epsilon$๊ณผ ์ƒํ˜ธ์ž‘์šฉํ•˜๋Š” agent๊ฐ€ ์žˆ์„ ๋•Œ, time step t์˜ $s_t$์ด ์ฃผ์–ด์กŒ์„ ๋•Œ action $a_t$๋ฅผ ๋„์ถœํ•˜๋Š” policy $\pi$๊ฐ€ ์žˆ์„ ๋•Œ discount factor $\gamma$๋กœ ํ• ์ธ๋œ $R_t=\sum_{k=0}^{\infty} \gamma^k r_{t+k}$์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๊ฒƒ!

์ด๋•Œ action value Q๋Š” ์•„๋ž˜์™€ ๊ฐ™์ด ํ‘œํ˜„๋˜๊ณ  $Q^\pi (s) = \mathbb{E}[R_t|s_t =s, a]$ ์ด๋Š” policy $\pi$๋ฅผ ๋”ฐ๋ž์„ ๋•Œ state s์—์„œ action a๋ฅผ ์ทจํ•  ๋•Œ sum of reward์˜ ๊ธฐ๋Œ€๊ฐ’์ด๋‹ค. value of state s๋„ ์œ ์‚ฌํ•˜๊ฒŒ ์•„๋ž˜์™€ ๊ฐ™์ด ํ‘œํ˜„๋œ๋‹ค. $V^\pi (s) = \mathbb{E}[R_t|s_t =s ] $์ด๋Š” policy $\pi$๋ฅผ ๋”ฐ๋ž์„ ๋•Œ state s์˜ sum of reward์˜ ๊ธฐ๋Œ€๊ฐ’์ด๋‹ค.

์—ฌ๊ธฐ๊นŒ์ง€๊ฐ€ RL์˜ ๊ธฐ๋ณธ ์…‹ํŒ…! ์—ฌ๊ธฐ์„œ value-based model-free method๋ฅผ ์“ฐ๋ฉด $Q(s,a;\theta)$๋ฅผ ๋ฐ”๋กœ NN๋กœ ๊ทผ์‚ฌํ•œ๋‹ค. ์ด๊ฒŒ Q-learning. ๊ทธ๋Ÿผ ์šฐ๋ฆฌ๋Š” optimal $Q^*(s,a)$๋ฅผ NN์˜ ํŒŒ๋ผ๋ฏธํ„ฐ $\theta$๋กœ ๋ฐ”๋กœ ๊ทผ์‚ฌํ•˜๋ฉด ๋œ๋‹ค. ์ด๋•Œ ์šฐ๋ฆฌ์˜ loss๋Š” image ์•„ ์ข€ ํ—ท๊ฐˆ๋ฆฌ๋Š”๋Ž….. state s, a์—์„œ ์ „์ด๋˜๋Š” s’๋ฅผ maximizeํ•˜๋Š” a’๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋Š” $\theta$๋ฅผ ๊ตฌํ•˜๋‹ˆ๊นŒ policy๋„ ๊ตฌํ•  ์ˆ˜ ์žˆ๋Š”๊ฑด๊ฐ€ Q๋งŒ ๊ตฌํ•ด์„œ ๋ญ์— ์“ฐ๋Š”๊ฑฐ์ง€? Q๋ฅผ ๊ตฌํ•˜๋ฉด policy๋„ ์ž๋™์œผ๋กœ ๊ตฌํ•ด์ง€๋Š”๊ฑด๊ฐ• (Q๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋Š” a๋ฅผ ๊ตฌํ•˜๋ฉด ๋˜๋‹ˆ๊นŒ?) RL์€ ์–ธ์ œ๋‚˜ policy network๊ฐ€ ์žˆ์Œ! Q๋Š” (t+1) ์‹œ์  ์ดํ›„๋กœ์˜ reward๋ฅผ ๊ทผ์‚ฌํ•˜๋Š” ๊ฐ’! q-learning์—์„œ๋Š” network๋Š” ์•”์‹œ์ ์œผ๋กœ ์„ค์ •๋˜๋Š”๋“ฏ. ๋ณ„๋„์˜ policy network๊ฐ€ ์žˆ๋Š” ๊ฒƒ์ด ์•„๋‹˜. ์›๋ž˜์˜ ์ดํ•ด๊ฐ€ ๋งž์Œ(24.08.21)

์ด๋•Œ Q-learning์˜ ๋‹จ์ ์€ reward๋ฅผ ์–ป๋Š” (s, a) pair๋งŒ ์ง์ ‘ ์˜ํ–ฅ์„ ๋ฐ›๊ณ  ๋‹ค๋ฅธ (s, a) pair๋“ค์€ ๋น„๊ฐ„์ ‘์ ์œผ๋กœ ์˜ํ–ฅ์„ ๋ฐ›์•„์„œ ํ•™์Šต์ด ๋А๋ฆฌ๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋‚˜์˜จ ๊ฒƒ์ด n-step Q-learning. ์ด๊ฑด ํ• ์ธ ratio $\gamma$๋ฅผ ์ ์šฉํ•ด์„œ ๋‹ค๋ฅธ time step๊นŒ์ง€ ํ˜„์žฌ reward์— ์˜ํ–ฅ์„ ์ฃผ๊ฒŒ ํ•˜๋Š” ๊ฒƒ์ธ๋“ฏ? image

์ด๋ ‡๊ฒŒ ํ•˜๋ฉด single reward $r$์ด ์ด์ „์˜ state action pair์—๋„ ์ง์ ‘์ ์œผ๋กœ ์˜ํ–ฅ์„ ์ค„ ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค.

๋ฐ˜๋Œ€๋กœ policy-based model๋“ค์€ policy $\pi(a|s, \theta)$๋ฅผ ๋ฐ”๋กœ parametrizeํ•œ๋‹ค. ์ด๋•Œ์˜ loss๋Š” $\mathbb{E}[R_t]$์ด๋‹ค(gradient ascent) REINFOCE ๋ฅ˜๋“ค์ด ์ด๋ ‡๊ฒŒ ํ•˜๋Š”๋ฐ $\theta$๋ฅผ ์•„๋ž˜์™€ ๊ฐ™์ด ๊ตฌํ•˜๊ณ  image

์ด๊ฒŒ variance๊ฐ€ ๋†’์•„์„œ ์ด๋ฅผ ๋‚ฎ์ถ”๋ ค๊ณ  bias term์„ ๋นผ๊ฒŒ ๋œ๋‹ค. image

์ด๋•Œ ์ด bias term์„ V๋กœ ๊ทผ์‚ฌํ•ด์„œ ๊ตฌํ•˜๋ฉด ๋” variance๊ฐ€ ๋‚ฎ์•„์ง€๋Š”๋ฐ ์ด๊ฒŒ ๋ฐ”๋กœ actor-critic architecture๋‹ค image

Asynchronous RL Framework

multi-thread๋ฅผ ์จ์„œ asynchronousํ•˜๊ฒŒ ํ•˜๋ฉด ๋œ๋‹ค.

  • one-step Q-learning์˜ pseudo-code๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค. image

๋ณ„๊ฑฐ ์—†๊ณ  ๊ทธ๋ƒฅ thread T๊ฐœ ์ผ ๋•Œ๊นŒ์ง€ grad accum ํ–ˆ๋‹ค๊ฐ€ ํ•œ๋ฒˆ์— ์—…๋ฐ์ดํŠธํ•˜๋Š”๊ฑฐ~

  • n-step Q-Learning image
image

์œ„์˜ ๋ง ๋ฌด์Šจ ๋ง์ธ์ง€ ๋ชจ๋ฅด๊ฒ ์Œ ์›๋ž˜๋Š” ๊ณผ๊ฑฐ๋กœ ๊ฐ€์•ผ๋˜๋Š”๋ฐ ๋ฏธ๋ž˜๋กœ ๊ฐ„๋‹ค..? $t_max$์ผ ๋•Œ ๊นŒ์ง€ exploration ํ•œ ๋‹ค์Œ์— ํ•œ๋ฒˆ์— ์—…๋ฐ์ดํŠธํ•˜๋Š” ๋“ฏ ํ•˜๋‹ค.

  • Asynchronous advantage actor-critic advantage actor-critic์— multi-thread ์ถ”๊ฐ€ + policy์˜ entropy๋ฅผ Loss์— ์ถ”๊ฐ€ image

ํ•™์Šต์€ RMSProp ์‚ฌ์šฉ

Result

image
  • Data Efficiency image

์ด๋ก ์ ์œผ๋กœ๋Š” ๊ฐ™์€ sample ๊ฐœ์ˆ˜๋ฅผ ๋ดค์„ ๋•Œ ๋™์ผ์„ฑ๋Šฅ์ด ๋‚˜์˜ค๋ฉด ์ข‹์Œ. ๊ทผ๋ฐ ์šฐ๋ฆฐ multi-thread ์“ฐ๋‹ˆ๊นŒ 4๊ฐœ thread์“ฐ๋ฉด wall-clock time์ด 4๋ฐฐ ๋‹จ์ถ•๋˜๋Š” ํšจ๊ณผ! ๊ทธ๋Ÿฐ๋ฐ ์ถ”๊ฐ€๋กœ ๋†€๋ž๊ฒŒ๋„ Q-learning๊ณผ sarsa ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๊ฒฝ์šฐ ๋™์ผ ์ƒ˜ํ”Œ ๊ฐœ์ˆ˜ ๋Œ€๋น„ ์„ฑ๋Šฅ์ด ๋” ์ข‹์•˜๋‹ค๊ณ . one-step method๋ณด๋‹ค bias๋ฅผ ์ค„์—ฌ์„œ?