TL;DR
- I read this because.. : Prior knowledge before reading the paper called PRIME. Interested in implicit dense reward because of previous research lol
- task : reward modeling
- Problem :** PRM performs better but is too expensive compared to ORM
- idea : Can’t we just learn ORMs and get sparse rewards like PRMs?
- input/output : prompt, y -> reward of y_t
- architecture : Llama-3.1-8B-Instruct
- objective : Put $\frac{\pi_\theta(y_i|y_{<i})}{\pi_{ref}(y_i|y_{<i})}$ where all q’s go. dpo, kto, nca, ce
- baseline : MathShepherd, AutoPSV, RLHFlow, open ORM/ PRM models
- data : UltraInteract – 8 rollouts per instruction from Llama-3.1-8B-instruct
- evaluation : Math-500 BoN / Mistral-Instruct-v0.3, Llama-3.1-8B-Instruct, Llama-3.1-70B-Instruct
- result : Better performance than Math-Shepherd, AutoPSV.
- contribution : DPO is secretly… If the Q-learning thesis was limited to DPO, this can be applied to most loss terms.
- etc. :
Details
- Letting advantage r be the ratio to reference ensures that q is exactly the exponential average of $r_\theta$ at step t
This means that when learning an ORM, if we give r like that, the $y_t$ for each step will be Q, like in PRM, so we can use this as a sparse reward
- second proposition
Does not understand he
- This also applies to CE loss
- result
- efficiancy
- with majority vote
c.f. UltraInteract
Math