[203] DeepSeek-V3 Technical Report

TL;DR

SFT
Collected 1.5M of instruction tuning data for different domains
- Reasoning data
Created with internal Deepseek-R1.
However, the goal is to balance the high accuracy of r1 with the conciseness of normal, well-formatted reasoning data without overthinking, poor formatting, or excessive length.
To do this, we want to create sft + rl trained expert models for specific domains such as code, math, general reasoning, and use them as data generators.
The training aims to generate two different SFT samples. One is <problem, original response> <system prompt, problem, R1 response>
The system prompts are carefully designed to allow for reflection and verification.
In the RL phase, the model does high temperature sampling, allowing both r1-generated and original data to be generated without a system prompt.
RL and then rejection sampling to keep only high quality sft.
- Non-reasoning data
Created with Deepseek v2.5 and validated for accuracy by human annotators
- SFT – two epochs
Reinforcement Learning
- Reward Model
  - Rule-based RM
math: formatted (in a box) followed by rule based / code: compiler to test code (leetcode)
- Model-based RM
  - for free-form ground-truth answer
Learned from DeepSeek-v3 sft checkpoints. Generated CoTs before reward cycle -> helped with reward hacking
- GRPO
Learning with GRPO, which computes in groups without a critic model - -
$o_i$ is samples from old policy