TL;DR
- I read this because.. : Background Tea
- task : RL
- problem : TRPO also needs to learn a separate reward model, which is too hard as the model grows.
- IDEA : Can we learn LOSS to REWARD directly without a reward model?
- input/output : {state, reward} -> action
- architecture : GPT2-Large
- objective : proposed.
- baseline : zero-shot to GP-J, SFT, Preferre-ft, Unlikelihood, PPO, PPO-GT, Bes of N baseline (returns the most rewarding of the SFT responses)
- data : IMDb , Reddit TL;DR
- evaluation : GPT-4 Evaluator
- result : Similar or better performance than baseline
- contribution :
- etc. : Professor Finn, I see you here ..!
Details
Preliminaries
SFT Create $\pi^{SFT}$ using a small amount of good quality data
Reward modeling (Bradley-Terry model)
If we replace this with a binary problem
- RL finetuning phrase
DPO
If we rewrite the above function
What does a partition function do for a probability distribution?
For the optimal policy, the bradely-terry model has the following preferences
From a policy perspective, we have human preference data, so we can express this as an MLE objective by using the