image

paper

TL;DR

  • I read this because.. : PPO / DPO ๋น„๊ต ๋…ผ๋ฌธ
  • task : RL
  • problem : PPO, DPO, RM ๋ชจ๋ธ์˜ ํฌ๊ธฐ, RM data, PPO์—์„œ prompt(์–ด๋–ค ์งˆ๋ฌธ์„ ์ฃผ๊ณ  rollout ์‹œํ‚ฌ๊ฑด์ง€) ๋“ฑ์— ๋Œ€ํ•œ ablation
  • architecture : TULU 2 13B(LLama2 finetuned)
  • objective : PPO / DPO loss
  • baseline : TULU 2 SFT
  • data : preference data human-annotated(HH-RLHF, HelpSteer, Chatbot Arena 2023-4, AlpacaFarm human, PRM600k), Web-scraping(SHP-2, StackExchange), synthetic(Ultra-Feedback, Nectrar, Orca, Capybara, AlapacaFarm GPT-4)
  • evaluation : factuality(MMLU), reasoning(GSM8k, Big Bench Hard), truthfulness(TruthfulQA), coding(HumanEval+, MBPP+), safety(ToxiGen, XSTest), instruction folloiwng(AlpacaEval 1,2, IFEval)
  • result : 1) DPO๋ณด๋‹ค PPO๊ฐ€ ์ข‹๋‹ค 2) RM์€ ํด์ˆ˜๋ก ์ข‹์ง€๋งŒ RM ์ง€ํ‘œ๊ฐ€ ๊ผญ ๋‹ค์šด ์ŠคํŠธ๋ฆผ์—์„œ ์ข‹์€ ๊ฒƒ์€ ์•„๋‹ˆ๋‹ค 3) ์งˆ ์ข‹๊ณ  ์–‘์ข‹์€ synthetic preference data๊ฐ€ ์ข‹๋‹ค 4) ๊ทธ์ค‘์—๋Š” finegrained ์„ฑ์ (ํ•ญ๋ชฉ๋ณ„ ์ ์ˆ˜)๋ฅผ ๋‚ด๋Š” Ultra-F๊ฐ€ ์ข‹๋‹ค 5) RLHF๋กœ ๋Š˜์–ด๋‚˜๋Š” ๊ฒƒ์€ Truthfulness, instruction following ๋Šฅ๋ ฅ์ด๋‹ค 6) PPO์—์„œ๋Š” reasoning, coding, safety๊ฐ€ ๋Š˜์–ด๋‚ฌ๋‹ค. 7) prompt๋Š” down stream task์— ๋งž๊ฒŒ ๋‹ค์–‘ํ™”ํ•˜๋ฉด ์ข‹์œผ๋‚˜ ์ž‘์€ RM์— ๋Œ€ํ•ด์„  generalize๋ฅผ ๋ชปํ•ด์„œ ์ผ๋ฐ˜ํ™”๋ฅผ ํ•˜์ง€ ๋ชปํ–ˆ๋‹ค.
  • contribution :
  • etc. :

Details

  • overall image

  • PPO vs DPO image

  • Preference data for DPO image

DPO์—์„œ synthetic » human ์œผ๋กœ ๋‚˜์˜ด. ์ˆ˜๋Ÿ‰์ด ๋น„์Šทํ•œ ๊ฒฝ์šฐ์—๋„ ๊ทธ๋ ‡๋„ค.. human๋ณด๋‹ค synthetic์ด ๋” ์ผ๊ด€์ ์ธ๊ฑด๊ฐ€? ๊ฐœ์ค‘์—๋Š” UltraFeedback (fine-grainedํ•˜๊ฒŒ ์˜์—ญ๋ณ„๋กœ ์ ์ˆ˜๋ฅผ ๋‚ธ ๊ฒƒ)์ด ๊ฐ€์žฅ ํšจ๊ณผ๊ฐ€ ์ข‹์•˜์Œ.

  • DPO vs PPO
image

DPO ๋Œ€๋น„ ๋‘๋“œ๋Ÿฌ์ง€๋Š” ๋ถ€๋ฌธ์€ reasoning, coding, safety ํŠนํžˆ stackexchange ๊ฐ™์€ crawled data๊ฐ€ DPO์—์„œ๋Š” coding ์‹ค๋ ฅ์„ ๋Š˜๋ฆฌ์ง€ ๋ชปํ–ˆ๋Š”๋ฐ PPO๋Š” ๋Š˜๋ ธ์Œ. PPO๊ฐ€ chain-of-thought ๋Šฅ๋ ฅ์ด ๋” ๋›ฐ์–ด๋‚œ ๊ฒƒ ๊ฐ™๊ณ  ์ด๋กœ ์ดํ•ด reasoning ๋Šฅ๋ ฅ์ด ๋Š˜์–ด๋‚œ๊ฒŒ ์•„๋‹๊นŒ ํ•˜๋Š” ๋ถ„์„

  • reward model image

Mix๊ฐ€ ๊ฐ€์žฅ ์„ฑ๋Šฅ์ด ์ข‹์•˜๋˜ UltraFeedback์„ ํฌํ•จํ•œ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ RM์„ ํ•œ๊ฑด๋ฐ ๋” ๋งŽ์€ Reward dataset์„ ์“ฐ๋Š”๊ฒŒ RM ์ง€ํ‘œ ์ƒ ์„ฑ๋Šฅ์ด ์ข‹์•˜์Œ. reward model ์ž์ฒด์˜ ํ‰๊ฐ€๋ž‘ PPO ๊นŒ์ง€ ๊ฐ”์„ ๋•Œ ํ‰๊ฐ€๊ฐ€ ์ƒ์‘ํ•˜์ง€ ์•Š์•˜์Œ.
13B Mix RM์ด ๊ฐ€์žฅ ์ข‹๊ฒŒ ๋‚˜์˜จ ์ง€ํ‘œ๋„ ์žˆ์—ˆ๋Š”๋ฐ ์‹ค์ œ๋กœ ๊ทธ๋ ‡์ง€ ์•Š์•˜์Œ. 70B RM์ด 13B๋ชจ๋ธ ๋ณด๋‹ค rm ์ง€ํ‘œ๋Š” ์ƒ๋‹นํžˆ ์ข‹์•˜๋Š”๋ฐ, PPO์—์„œ์˜ ์„ฑ๋Šฅ์€ ๊ฐœ์„ ์ด ์—†๊ฑฐ๋‚˜ ๊ฑฐ์˜ ๋น„์Šทํ–ˆ์Œ.

  • policy training prompt image

PPO ํ•™์Šต ์‹œ ์‚ฌ์šฉ๋˜๋Š” prompt๋Š” downstream์— ๊ฐ€๊นŒ์šธ ์ˆ˜๋ก ์ข‹์•˜์Œ.