[204] DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL

TL;DR

I read this because.. : was mentioned.
task : math in llm
problem : cheap o1 replicate.
idea : use qwen already distil, try with small model, gradually increase context
input/output : question, answer
architecture : DeepSeek-R1-Distill-Qwen-1.5B
objective : GRPO
baseline : DeepSeek-R1-Distill-Qwen-1.5B, rStar-Math-7B, o1-preview, …
data : AIME, AMC, Omni-Math, Still
evaluation : AIME2024, AMC2023, MATH-500, Minereva Math, Olympiad batch
result : minerva math minus sota. aime, also beats o1-preview for math
Trick to gradually increase the contribution : context length.
etc. :

source: AIME, AMC, Omni-Math, Still
Extracting the answer from the solution with gemini-1.5-pro-002
Use sentence-transformers/all-MiniLM-L6-v2 to remove duplicate questions
Problems that can’t be solved with sympy are filtered. llm judge The many layers that must be used can slow down learning and give a noisy signal.

First, we perform RL training with 8K max context for more effective reasoning and efficient training.
Our intuition for this was that when we solved AIME with Deepseek-R1-Distilled-Qwen-1.5B, incorrect responses were three times longer than correct responses, meaning that just learning longer responses would waste most of the tokens, and we saw a repetitive pattern for these longer responses.
This improves performance and the length of the answer response drops from 5,500 to 3,500
Next, we scale up training to 16K and 24K contexts so that the model can solve more challenging, previously unsolved problems.
While learning 8K, there is a sudden increase in response length. This is due to dropping the context limit, which is truncated and causes the return to drop.
Now I’m thinking longer, so I’ve increased the context window to 16K, so I’m using the Learning
Learn 500 steps, then increase to 24K and learn