image

paper

TL;DR

  • I read this because… : PPO / DPO Comparison Paper
  • task : RL
  • problem : Ablation of PPO, DPO, RM model size, RM data, prompt in PPO (what questions to ask and roll out), etc.
  • architecture : TULU 2 13B(LLama2 finetuned)
  • objective : PPO / DPO loss
  • baseline : TULU 2 SFT
  • data : preference data human-annotated(HH-RLHF, HelpSteer, Chatbot Arena 2023-4, AlpacaFarm human, PRM600k), Web-scraping(SHP-2, StackExchange), synthetic(Ultra-Feedback, Nectrar, Orca, Capybara, AlapacaFarm GPT-4)
  • evaluation : factuality(MMLU), reasoning(GSM8k, Big Bench Hard), truthfulness(TruthfulQA), coding(HumanEval+, MBPP+), safety(ToxiGen, XSTest), instruction folloiwng(AlpacaEval 1,2, IFEval)
  • result : 1) PPO is better than DPO 2) Bigger RM is better, but RM metrics are not necessarily good downstream 3) Good quality and quantity of synthetic preference data is good 4) Among them, Ultra-F with fine-grained grades (itemized scores) is good 5) RLHF increases Truthfulness, instruction following ability 6) PPO increases reasoning, coding, safety. 7) Prompts should be diversified to suit the downstream task, but it was not possible to generalize for small RMs.
  • contribution :
  • etc. :

Details

  • overall image

  • PPO vs DPO image

  • Preference data for DPO image

Coming out of the DPO as synthetic » human. Even when quantities are similar… Is synthetic more consistent than human? UltraFeedback (fine-grained, domain-specific scoring) worked best for us.

  • DPO vs PPO
image

Compared to DPO, the areas that stand out are reasoning, coding, safety Especially with crawled data like stackexchange, I didn’t increase my coding skills in DPO, but I did in PPO. PPOs seem to have better chain-of-thought skills, which may be due to increased comprehension reasoning.

  • reward model image

RM was run with datasets including UltraFeedback, where Mix performed best, but using more Reward datasets performed better on RM metrics. The evaluation of the reward model itself and the evaluation when going to the PPO did not correspond. Some metrics showed 13B Mix RM as the best, when in fact it was not. The 70B RM had significantly better RM metrics than the 13B model, but the performance in PPO was no better or about the same.

  • policy training prompt image

The prompt used for learning PPOs should be closer to downstream.