Image

paper

TL;DR

  • I read this because.. : Prior knowledge before reading the paper called PRIME. Interested in implicit dense reward because of previous research lol
  • task : reward modeling
  • Problem :** PRM performs better but is too expensive compared to ORM
  • idea : Can’t we just learn ORMs and get sparse rewards like PRMs?
  • input/output : prompt, y -> reward of y_t
  • architecture : Llama-3.1-8B-Instruct
  • objective : Put $\frac{\pi_\theta(y_i|y_{<i})}{\pi_{ref}(y_i|y_{<i})}$ where all q’s go. dpo, kto, nca, ce
  • baseline : MathShepherd, AutoPSV, RLHFlow, open ORM/ PRM models
  • data : UltraInteract – 8 rollouts per instruction from Llama-3.1-8B-instruct
  • evaluation : Math-500 BoN / Mistral-Instruct-v0.3, Llama-3.1-8B-Instruct, Llama-3.1-70B-Instruct
  • result : Better performance than Math-Shepherd, AutoPSV.
  • contribution : DPO is secretly… If the Q-learning thesis was limited to DPO, this can be applied to most loss terms.
  • etc. :

Details

  • Letting advantage r be the ratio to reference ensures that q is exactly the exponential average of $r_\theta$ at step t Image
Image

This means that when learning an ORM, if we give r like that, the $y_t$ for each step will be Q, like in PRM, so we can use this as a sparse reward

  • second proposition
Image

Does not understand he

  • This also applies to CE loss
Image
  • result
Image
  • efficiancy
Image
  • with majority vote
Image

c.f. UltraInteract Math Image