[199] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

paper

TL;DR

I read this because.. : because the world is going crazy with deepseek r1
task : reasoning in LLM
problem : MCTS, PRM, ORM methodologies can’t keep up with o1 performance
idea : Let’s do a big RL.
architecture : DeepSeek-R1-Zero
PPO with objective : GRPO as an advantage…? I’m confused if GRPO is the objective itself or the trick.
baseline : OpenAI o1, OpenAI o1-mini, Deepseek-v3
data : (rl) verifiable prompts? (cold start sft) thousand of answer from DeepSeek-v3 CoT prompt (sft) QA used in deepseek-v3, prompt + rejection sampling (distil)
evaluation : AIME, Codeforces, GPQA diamond, MATH-500, MMLU, SWE-bench
Result : Improved o1
contribution : Probably the first open model to beat O1?

Details

benchmark thumbnail

DeepSeek-R1-Zero

Version with no SFT data at all
GRPO
RM
accuracy rewards: final answer in sepcific format. leetcode problem. compiler
format rewards: putting thinking process between <think>, </think> tags
Training template

performance Incremental performance improvements with RL alone.

And as learning progresses, the sequence length increases as reflection (revisit or reevaluate) increases.

An interesting one is the “aha moment,” where a model suddenly has an aha moment during training and changes its initial approach.

Funny point that I did RL and it did reflection on its own

drawback language mixing, poor readability.

DeepSeek-R1: rl with cold start

few shot long CoT prompt, Deepseek-r1-zero + human annotator postprocessing, directly prompting to generate detailed answer with reflection and verification. Aim for readability and performance improvements

reasoning oriented rl
Focus on coding, math, science, and logical thinking
Add language consistency reward
rejection sampling and supervised finetuning
Create SFT data with the trained model after the reasoning RL. This is not limited to reasoning, but it is said to be adapted to writing, role-playing, and general-purpose tasks.
- reasoning sft data :
Evaluate with generative reward with deepseek-v3 judgemnet
lanauage mix, chaotic cot cleared by filter
- non-reasoning data:
I created 200K training data by using deepseek-v3 sft data, creating cot with deepseek-v3-base, and filtering cot for the ones that don’t need cot.
- secondary rl
For reasoning, rule based reward / general data is referred to as using RM (helpful, harmless, etc.)

Distilation

Qwen, Llama was used by distil with 800K CoT data from DeepSeek-R1. RL was not used.

Performance

distil models

discussion

rl vs distil

distil performs better

unsuccessful attempts
PRM: Automatically getting process labels is noisy, hard to scale up if done by humans, and open to reward hacking.
MCTS: Too large a search space and risk of falling into local optima

TL;DR#

Details#

DeepSeek-R1-Zero#

DeepSeek-R1: rl with cold start#

Distilation#

Performance#

discussion#