[197] Free Process Rewards without Process Labels

TL;DR

I read this because.. : Prior knowledge before reading the paper called PRIME. Interested in implicit dense reward because of previous research lol
task : reward modeling
Problem :** PRM performs better but is too expensive compared to ORM
idea : Can’t we just learn ORMs and get sparse rewards like PRMs?
input/output : prompt, y -> reward of y_t
architecture : Llama-3.1-8B-Instruct
objective : Put $\frac{\pi_\theta(y_i|y_{<i})}{\pi_{ref}(y_i|y_{<i})}$ where all q’s go. dpo, kto, nca, ce
baseline : MathShepherd, AutoPSV, RLHFlow, open ORM/ PRM models
data : UltraInteract – 8 rollouts per instruction from Llama-3.1-8B-instruct
evaluation : Math-500 BoN / Mistral-Instruct-v0.3, Llama-3.1-8B-Instruct, Llama-3.1-70B-Instruct
result : Better performance than Math-Shepherd, AutoPSV.
contribution : DPO is secretly… If the Q-learning thesis was limited to DPO, this can be applied to most loss terms.
etc. :