Image

paper

TL;DR

  • I read this because.. : because the world is going crazy with deepseek r1
  • task : reasoning in LLM
  • problem : MCTS, PRM, ORM methodologies can’t keep up with o1 performance
  • idea : Let’s do a big RL.
  • architecture : DeepSeek-R1-Zero
  • PPO with objective : GRPO as an advantage…? I’m confused if GRPO is the objective itself or the trick.
  • baseline : OpenAI o1, OpenAI o1-mini, Deepseek-v3
  • data : (rl) verifiable prompts? (cold start sft) thousand of answer from DeepSeek-v3 CoT prompt (sft) QA used in deepseek-v3, prompt + rejection sampling (distil)
  • evaluation : AIME, Codeforces, GPQA diamond, MATH-500, MMLU, SWE-bench
  • Result : Improved o1
  • contribution : Probably the first open model to beat O1?

Details

  • benchmark thumbnail
Image

DeepSeek-R1-Zero

  • Version with no SFT data at all

  • GRPO Image

  • RM

  • accuracy rewards: final answer in sepcific format. leetcode problem. compiler

  • format rewards: putting thinking process between <think>, </think> tags

  • Training template

Image
  • performance Incremental performance improvements with RL alone.
Image

And as learning progresses, the sequence length increases as reflection (revisit or reevaluate) increases.

Image

An interesting one is the “aha moment,” where a model suddenly has an aha moment during training and changes its initial approach.

Image

Funny point that I did RL and it did reflection on its own

  • drawback language mixing, poor readability.

DeepSeek-R1: rl with cold start

few shot long CoT prompt, Deepseek-r1-zero + human annotator postprocessing, directly prompting to generate detailed answer with reflection and verification. Aim for readability and performance improvements

  • reasoning oriented rl
  • Focus on coding, math, science, and logical thinking
  • Add language consistency reward
  • rejection sampling and supervised finetuning
  • Create SFT data with the trained model after the reasoning RL. This is not limited to reasoning, but it is said to be adapted to writing, role-playing, and general-purpose tasks.
    • reasoning sft data :
  • Evaluate with generative reward with deepseek-v3 judgemnet
  • lanauage mix, chaotic cot cleared by filter
    • non-reasoning data:
  • I created 200K training data by using deepseek-v3 sft data, creating cot with deepseek-v3-base, and filtering cot for the ones that don’t need cot.
    • secondary rl
  • For reasoning, rule based reward / general data is referred to as using RM (helpful, harmless, etc.)

Distilation

Qwen, Llama was used by distil with 800K CoT data from DeepSeek-R1. RL was not used.

Performance

Image

distil models

Image

discussion

  • rl vs distil Image

distil performs better

  • unsuccessful attempts
  • PRM: Automatically getting process labels is noisy, hard to scale up if done by humans, and open to reward hacking.
  • MCTS: Too large a search space and risk of falling into local optima