image

paper

TL;DR

  • I read this because.. : Background Tea
  • task : RL
  • problem : TRPO also needs to learn a separate reward model, which is too hard as the model grows.
  • IDEA : Can we learn LOSS to REWARD directly without a reward model?
  • input/output : {state, reward} -> action
  • architecture : GPT2-Large
  • objective : proposed.
  • baseline : zero-shot to GP-J, SFT, Preferre-ft, Unlikelihood, PPO, PPO-GT, Bes of N baseline (returns the most rewarding of the SFT responses)
  • data : IMDb , Reddit TL;DR
  • evaluation : GPT-4 Evaluator
  • result : Similar or better performance than baseline
  • contribution :
  • etc. : Professor Finn, I see you here ..!

Details

Preliminaries

  • SFT Create $\pi^{SFT}$ using a small amount of good quality data

  • Reward modeling (Bradley-Terry model) image

If we replace this with a binary problem image

  • RL finetuning phrase image

DPO

If we rewrite the above function image

image

What does a partition function do for a probability distribution?

For the optimal policy, the bradely-terry model has the following preferences

image

From a policy perspective, we have human preference data, so we can express this as an MLE objective by using the image

what does the DPO updates?

image