0: if no formatted(/think), or answer is not correct.
Iterative Context Lengthening: Think Shorter, then Longer#
First, we perform RL training with 8K max context for more effective reasoning and efficient training.
Our intuition for this was that when we solved AIME with Deepseek-R1-Distilled-Qwen-1.5B, incorrect responses were three times longer than correct responses, meaning that just learning longer responses would waste most of the tokens, and we saw a repetitive pattern for these longer responses.
This improves performance and the length of the answer response drops from 5,500 to 3,500
Next, we scale up training to 16K and 24K contexts so that the model can solve more challenging, previously unsolved problems.
While learning 8K, there is a sudden increase in response length. This is due to dropping the context limit, which is truncated and causes the return to drop.
Now I’m thinking longer, so I’ve increased the context window to 16K, so I’m using the Learning