Image

paper

TL;DR

  • I read this because.. : post training์ด ๊ถ๊ธˆํ•ด์„œ
  • task : LLM
  • problem :
  • idea :
  • input/output :
  • architecture :
  • objective :
  • baseline :
  • data :
  • evaluation :
  • result :
  • contribution :
  • etc. :

Details

Post-training

  • SFT
    • 1.5M์˜ ๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ์— ๋Œ€ํ•œ instruction tuning data๋ฅผ ๋ชจ์Œ
    • Reasoning data
      • internal Deepseek-R1์„ ๊ฐ€์ง€๊ณ  ์ƒ์„ฑ.
      • ๊ทธ๋Ÿฌ๋‚˜ overthink, poor formatting, excessive length ํ•ด์„œ r1์˜ ๋†’์€ ์ •ํ™•๋„์™€ ๋ณดํ†ต์˜ ์ž˜ ํฌ๋งทํŒ… ๋œ reasoning data์˜ concise ํ•จ์„ ์ž˜ ๊ท ํ˜•์žกํžˆ๊ฒŒ ํ•˜๋Š”๊ฒŒ ๋ชฉํ‘œ
      • ์ด๋ฅผ ์œ„ํ•ด code, math, general reasoning ๊ณผ ๊ฐ™์€ ํŠน์ •ํ•œ ๋„๋ฉ”์ธ์— sft + rl ํ•™์Šต๋œ Expert model ์„ ๋งŒ๋“ค๊ณ  ์ด๋ฅผ data generator๋กœ ์‚ฌ์šฉํ•˜๊ณ ์ž ํ•จ
        • ํ•™์Šต์€ ๋‘๊ฐœ์˜ ๋‹ค๋ฅธ SFT sample์„ ์ƒ์„ฑํ•˜๋Š”๋ฐ ๋ชฉํ‘œ. ํ•˜๋‚˜๋Š” <problem, original response> <system prompt, problem, R1 response>
        • ์ด๋•Œ system prompt๋Š” reflection๊ณผ verification์„ ํ•  ์ˆ˜ ์žˆ๋„๋ก ์„ฌ์„ธํ•˜๊ฒŒ ๋””์ž์ธํ•จ
      • RL phase์—์„œ๋Š” model์ด high temperature sampling์„ ํ•˜์—ฌ system prompt์—†์ด๋„r1-generated, original data ๋‘˜๋‹ค ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•จ.
      • RL์„ ํ•˜๊ณ  ๋‚˜์„œ rejection sampling์„ ํ•˜์—ฌ high quality sft๋งŒ ๋‚จ๊น€.
    • Non-reasoning data
      • Deepseek v2.5๋กœ ๋งŒ๋“ค๊ณ  human annotator๊ฐ€ ์ •ํ™•๋„๋ฅผ ๊ฒ€์ฆํ•จ
    • SFT – two epochs
  • Reinforcement Learning
    • Reward Model
      • Rule-based RM
      • math: format์— ๋งž์ถ˜(in a box) ๋’ค rule based / code: compiler to test code (leetcode)
      • Model-based RM
        • for free-form ground-truth answer
        • DeepSeek-v3 sft checkpoint๋กœ ๋ถ€ํ„ฐ ํ•™์Šต. reward ์ฃผ๊ธฐ ์ „์— CoT ์ƒ์„ฑ -> reward hacking์— ๋„์›€๋˜์—ˆ๋‹ค๊ณ  ํ•จ
      • GRPO
        • critic model์ด ์—†์ด group์œผ๋กœ ๋ฌถ์—ฌ์„œ ๊ณ„์‚ฐ ํ•˜๋Š” GRPO๋กœ ํ•™์Šต
        • Image
        • Image
        • $o_i$๋Š” old policy๋กœ ๋ถ€ํ„ฐ ๋‚˜์˜จ sample๋“ค

Ablations

  • distiliation from deepseek-r1

Image