[169] Direct Preference Optimization: Your Language Model is Secretly a Reward Model

TL;DR

I read this because.. : Background Tea
task : RL
problem : TRPO also needs to learn a separate reward model, which is too hard as the model grows.
IDEA : Can we learn LOSS to REWARD directly without a reward model?
input/output : {state, reward} -> action
architecture : GPT2-Large
objective : proposed.
baseline : zero-shot to GP-J, SFT, Preferre-ft, Unlikelihood, PPO, PPO-GT, Bes of N baseline (returns the most rewarding of the SFT responses)
data : IMDb , Reddit TL;DR
evaluation : GPT-4 Evaluator
result : Similar or better performance than baseline
contribution :
etc. : Professor Finn, I see you here ..!