[197] Free Process Rewards without Process Labels

TL;DR

I read this because.. : PRIME이란 논문을 읽기 전 선행지식. 이전 연구 때문에 implicit dense reward에 관심 많음 ㅎㅎ
task : reward modeling
problem : PRM이 더 성능은 좋은데 ORM에 비해 너무 비싸다
idea : ORM만 학습해서 PRM처럼 sparse reward 못 얻나?
input/output : prompt, y -> reward of y_t
architecture : Llama-3.1-8B-Instruct
objective : 모든 q가 들어가는 곳에 $\frac{\pi_\theta(y_i|y_{<i})}{\pi_{ref}(y_i|y_{<i})}$를 넣자. DPO, KTO, NCA, CE
baseline : MathShepherd, AutoPSV, RLHFlow, open ORM/ PRM models
data : UltraInteract – 8 rollouts per instruction from Llama-3.1-8B-instruct
evaluation : Math-500 BoN / Mistral-Instruct-v0.3, Llama-3.1-8B-Instruct, Llama-3.1-70B-Instruct
result : Math-Shepherd, AutoPSV보다 좋은 성능.
contribution : DPO is secretly.. Q-learning 논문은 DPO에 국한되었다면 이건 대부분의 loss term에 적용 가능
etc. :