Collected 1.5M of instruction tuning data for different domains
Reasoning data
Created with internal Deepseek-R1.
However, the goal is to balance the high accuracy of r1 with the conciseness of normal, well-formatted reasoning data without overthinking, poor formatting, or excessive length.
To do this, we want to create sft + rl trained expert models for specific domains such as code, math, general reasoning, and use them as data generators.
The training aims to generate two different SFT samples. One is <problem, original response> <system prompt, problem, R1 response>
The system prompts are carefully designed to allow for reflection and verification.
In the RL phase, the model does high temperature sampling, allowing both r1-generated and original data to be generated without a system prompt.
RL and then rejection sampling to keep only high quality sft.
Non-reasoning data
Created with Deepseek v2.5 and validated for accuracy by human annotators
SFT – two epochs
Reinforcement Learning
Reward Model
Rule-based RM
math: formatted (in a box) followed by rule based / code: compiler to test code (leetcode)
Model-based RM
for free-form ground-truth answer
Learned from DeepSeek-v3 sft checkpoints. Generated CoTs before reward cycle -> helped with reward hacking
GRPO
Learning with GRPO, which computes in groups without a critic model
-
-