Image

technical report

TL;DR

  • I read this because.. : was mentioned.
  • task : math in llm
  • problem : cheap o1 replicate.
  • idea : use qwen already distil, try with small model, gradually increase context
  • input/output : question, answer
  • architecture : DeepSeek-R1-Distill-Qwen-1.5B
  • objective : GRPO
  • baseline : DeepSeek-R1-Distill-Qwen-1.5B, rStar-Math-7B, o1-preview, …
  • data : AIME, AMC, Omni-Math, Still
  • evaluation : AIME2024, AMC2023, MATH-500, Minereva Math, Olympiad batch
  • result : minerva math minus sota. aime, also beats o1-preview for math
  • Trick to gradually increase the contribution : context length.
  • etc. :

Details

  • thumbnail
Image

dataset curation

  • source: AIME, AMC, Omni-Math, Still
  • Extracting the answer from the solution with gemini-1.5-pro-002
  • Use sentence-transformers/all-MiniLM-L6-v2 to remove duplicate questions
  • Problems that can’t be solved with sympy are filtered. llm judge The many layers that must be used can slow down learning and give a noisy signal.

reward

  • 1: latext/sympy check ok
  • 0: if no formatted(/think), or answer is not correct.

Iterative Context Lengthening: Think Shorter, then Longer

  • First, we perform RL training with 8K max context for more effective reasoning and efficient training.
  • Our intuition for this was that when we solved AIME with Deepseek-R1-Distilled-Qwen-1.5B, incorrect responses were three times longer than correct responses, meaning that just learning longer responses would waste most of the tokens, and we saw a repetitive pattern for these longer responses.
  • This improves performance and the length of the answer response drops from 5,500 to 3,500
    • Image
    • Image
  • Next, we scale up training to 16K and 24K contexts so that the model can solve more challenging, previously unsolved problems.
  • While learning 8K, there is a sudden increase in response length. This is due to dropping the context limit, which is truncated and causes the return to drop.
    • Image
  • Now I’m thinking longer, so I’ve increased the context window to 16K, so I’m using the Learning
  • Learn 500 steps, then increase to 24K and learn

evaluation

Image Image