image

paper

TL;DR

  • I read this because.. : ๋ฐฐ๊ฒฝ์ง€์‹ ์ฐจ
  • task : RL
  • problem : TRPO๋„ ๋ณ„๋„์˜ Reward model์„ ํ•™์Šตํ•ด์•ผ ํ•˜๋Š”๋ฐ ๋ชจ๋ธ์ด ์ปค์ง์— ๋”ฐ๋ผ ๋„ˆ๋ฌด ํž˜๋“ฆ
  • idea : reward model์„ ๋”ฐ๋กœ ์—†์ด loss์— reward ์— ๋Œ€ํ•œ loss๊นŒ์ง€ directํ•˜๊ฒŒ ํ•™์Šตํ•  ์ˆ˜ ์—†์„๊นŒ?
  • input/output : {state, reward} -> action
  • architecture : GPT2-Large
  • objective : proposed.
  • baseline : zero-shot to GP-J, SFT, Preferre-ft, Unlikelihood, PPO, PPO-GT, Bes of N baseline(SFT reponse ์ค‘์— ๊ฐ€์žฅ reward๊ฐ€ ๋†’์€ ๊ฐ’ return)
  • data : IMDb , Reddit TL;DR
  • evaluation : GPT-4 Evaluator
  • result : ๋ฒ ์ด์Šค๋ผ์ธ ๋Œ€๋น„ ๋น„์Šทํ•˜๊ฑฐ๋‚˜ ๋‚˜์€ ์„ฑ๋Šฅ
  • contribution :
  • etc. : ํ•€ ๊ต์ˆ˜๋‹˜ ์—ฌ๊ธฐ์„œ ๋ต™๋Š”๊ตฐ์š” ..!

Details

Preliminaries

  • SFT ์†Œ๋Ÿ‰์˜ ์–‘์งˆ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ด์„œ $\pi^{SFT}$๋ฅผ ๋งŒ๋“ฆ

  • Reward modeling (Bradley-Terry model) image

์ด๊ฑธ binary ๋ฌธ์ œ๋กœ ์น˜ํ™˜ํ•˜๋ฉด image

  • RL finetuning phrase image

DPO

์œ„์˜ ํ•จ์ˆ˜๋ฅผ ๋‹ค์‹œ ์“ฐ๋ฉด image

image

partition function์€ ํ™•๋ฅ ๋ถ„ํฌ๋กœ ๋งŒ๋“ค์–ด์ฃผ๋Š” ์—ญํ• ?

optimal policy์— ๋Œ€ํ•ด bradely-terry model์€ ์•„๋ž˜์™€ ๊ฐ™์€ preferenc๊ฐ€ ์„ฑ๋ฆฝ

image

policy์˜ ๊ด€์ ์—์„œ human preference data๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์œผ๋‹ˆ ์ด๋ฅผ mle objective๋กœ ํ‘œํ˜„ํ•˜๋ฉด image

what does the DPO updates?

image