Image

paper

TL;DR

  • I read this because.. : PRIME์ด๋ž€ ๋…ผ๋ฌธ์„ ์ฝ๊ธฐ ์ „ ์„ ํ–‰์ง€์‹. ์ด์ „ ์—ฐ๊ตฌ ๋•Œ๋ฌธ์— implicit dense reward์— ๊ด€์‹ฌ ๋งŽ์Œ ใ…Žใ…Ž
  • task : reward modeling
  • problem : PRM์ด ๋” ์„ฑ๋Šฅ์€ ์ข‹์€๋ฐ ORM์— ๋น„ํ•ด ๋„ˆ๋ฌด ๋น„์‹ธ๋‹ค
  • idea : ORM๋งŒ ํ•™์Šตํ•ด์„œ PRM์ฒ˜๋Ÿผ sparse reward ๋ชป ์–ป๋‚˜?
  • input/output : prompt, y -> reward of y_t
  • architecture : Llama-3.1-8B-Instruct
  • objective : ๋ชจ๋“  q๊ฐ€ ๋“ค์–ด๊ฐ€๋Š” ๊ณณ์— $\frac{\pi_\theta(y_i|y_{<i})}{\pi_{ref}(y_i|y_{<i})}$๋ฅผ ๋„ฃ์ž. DPO, KTO, NCA, CE
  • baseline : MathShepherd, AutoPSV, RLHFlow, open ORM/ PRM models
  • data : UltraInteract – 8 rollouts per instruction from Llama-3.1-8B-instruct
  • evaluation : Math-500 BoN / Mistral-Instruct-v0.3, Llama-3.1-8B-Instruct, Llama-3.1-70B-Instruct
  • result : Math-Shepherd, AutoPSV๋ณด๋‹ค ์ข‹์€ ์„ฑ๋Šฅ.
  • contribution : DPO is secretly.. Q-learning ๋…ผ๋ฌธ์€ DPO์— ๊ตญํ•œ๋˜์—ˆ๋‹ค๋ฉด ์ด๊ฑด ๋Œ€๋ถ€๋ถ„์˜ loss term์— ์ ์šฉ ๊ฐ€๋Šฅ
  • etc. :

Details

  • advantage r์„ reference์™€์˜ ๋น„์œจ๋กœ ๋‘๋ฉด q๊ฐ€ ์ •ํ™•ํžˆ exponential average of $r_\theta$ at step t๊ฐ€ ๋จ Image
Image

์ฆ‰ ORM์„ ํ•™์Šตํ•  ๋•Œ r์„ ์ €๋ ‡๊ฒŒ ์ฃผ๋ฉด PRM์ฒ˜๋Ÿผ ๊ฐ๊ฐ์˜ step์— ๋Œ€ํ•œ $y_t$๊ฐ€ Q๊ฐ€ ๋˜์–ด์„œ ์ด๊ฑธ sparse reward์ฒ˜๋Ÿผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Œ

  • second proposition
Image

์ดํ•ด ๋ชปํ•จ ใ…Ž

  • CE loss์—๋„ ์ด๊ฑธ ์ ์šฉ ๊ฐ€๋Šฅ
Image
  • result
Image
  • efficiancy
Image
  • with majority vote
Image

c.f. UltraInteract ์ˆ˜ํ•™์ด ์žˆ๊ตฌ๋‚ญ Image