TL;DR
- I read this because.. : because the world is going crazy with deepseek r1
- task : reasoning in LLM
- problem : MCTS, PRM, ORM methodologies can’t keep up with o1 performance
- idea : Let’s do a big RL.
- architecture : DeepSeek-R1-Zero
- PPO with objective : GRPO as an advantage…? I’m confused if GRPO is the objective itself or the trick.
- baseline : OpenAI o1, OpenAI o1-mini, Deepseek-v3
- data : (rl) verifiable prompts? (cold start sft) thousand of answer from DeepSeek-v3 CoT prompt (sft) QA used in deepseek-v3, prompt + rejection sampling (distil)
- evaluation : AIME, Codeforces, GPQA diamond, MATH-500, MMLU, SWE-bench
- Result : Improved o1
- contribution : Probably the first open model to beat O1?
Details
- benchmark thumbnail
DeepSeek-R1-Zero
Version with no SFT data at all
GRPO
RM
accuracy rewards: final answer in sepcific format. leetcode problem. compiler
format rewards: putting thinking process between
<think>,</think>tagsTraining template
- performance Incremental performance improvements with RL alone.
And as learning progresses, the sequence length increases as reflection (revisit or reevaluate) increases.
An interesting one is the “aha moment,” where a model suddenly has an aha moment during training and changes its initial approach.
Funny point that I did RL and it did reflection on its own
- drawback language mixing, poor readability.
DeepSeek-R1: rl with cold start
few shot long CoT prompt, Deepseek-r1-zero + human annotator postprocessing, directly prompting to generate detailed answer with reflection and verification. Aim for readability and performance improvements
- reasoning oriented rl
- Focus on coding, math, science, and logical thinking
- Add language consistency reward
- rejection sampling and supervised finetuning
- Create SFT data with the trained model after the reasoning RL. This is not limited to reasoning, but it is said to be adapted to writing, role-playing, and general-purpose tasks.
- reasoning sft data :
- Evaluate with generative reward with deepseek-v3 judgemnet
- lanauage mix, chaotic cot cleared by filter
- non-reasoning data:
- I created 200K training data by using deepseek-v3 sft data, creating cot with deepseek-v3-base, and filtering cot for the ones that don’t need cot.
- secondary rl
- For reasoning, rule based reward / general data is referred to as using RM (helpful, harmless, etc.)
Distilation
Qwen, Llama was used by distil with 800K CoT data from DeepSeek-R1. RL was not used.
Performance
distil models
discussion
- rl vs distil
distil performs better
- unsuccessful attempts
- PRM: Automatically getting process labels is noisy, hard to scale up if done by humans, and open to reward hacking.
- MCTS: Too large a search space and risk of falling into local optima