[204] DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL

TL;DR

source: AIME, AMC, Omni-Math, Still
gemini-1.5-pro-002로 풀이과정에서 answer 추출
sentence-transformers/all-MiniLM-L6-v2 사용해서 중복 질문 제거
sympy로 해결할 수 없는 문제는 filtering. llm judge 사용해야하는 다볍은 학습을 느리게 하고 noisy signal을 줄 수 있음.

First, we perform RL training with 8K max context for more effective reasoning and efficient training.
- 이에 대한 직관은 Deepseek-R1-Distilled-Qwen-1.5B로 AIME을 풀어봤을 때 incorrect response가 correct response보다 3배나 답변길이가 긴 현상이 있었음. 즉 그냥 길게 학습하는건 대부분의 토큰이 낭비가 되고, 이 길어진 response에 대해서는 repetitive pattern이 보였기 때문임.
- 이때 성능이 개선되고 이에 따라 answer response의 길이는 5,500에서 3,500으로 떨어짐
Next, we scale up training to 16K and 24K contexts so that the model can solve more challenging, previously unsolved problems.
- 8K를 학습하다가 갑자기 response length가 늘어나는 구간이 있음. 이는 context limit을 떨어뜨려 truncate되어 return을 떨어뜨리는 현상.
- 이제 think longer를 하는 현상이 있어서 context window를 16K로 늘려서 학습
- 500 step 학습 후 24K로 늘리고 학습