TL;DR
- I read this because.. : because it mentions
- task : reasoning in LLM
- problem : How can we make test time scaling simple?
- idea : Let’s filter the data well. When inferring, let’s add
waitif it’s not up to the desired length, and eot if it’s too long (Budget Forcing) - architecture : Qwen2.5-32B-Instruct
- objective : ce loss (SFT only)
- baseline : OpenAI o1 series, DeepSeek r1 series, QwQ-32B-preview, Sky-T1-32B-Preview, Bespoke-32B, Google Gemini 2.0 Flash Thinking Experimental //
- data : s1K (proposed) – NuminaMATH, AIME, OlympicArena, OmniMath, AGIEval + additionally crawled from [Stanford Statistics Department PhD Qualifying Exam] (https://statistics.stanford.edu/ ) and [PuzzledQuant] (https://www.puzzledquant.com/ ) homepages.
- evaluation : AIME24, MATH500, GPQA diamond
- result : Good performance relative to the number of training samples. Good performance when quality, difficulty, and diverse criteria are all used. Suggested
- contribution : 1) Confirmed that SFT alone is test-time-scaling 2) Filtering related ablation
- etc. :
Details
- thumbnail
reasoning data curation to create s1k
- inital collection of 59K
- NuminaMATH, AIME, OlympicArena, OmniMath, AGIEval + additionally crawl [Stanford Statistics PhD Qualifying Exam] (https://statistics.stanford.edu/ ) and [PuzzledQuant] (https://www.puzzledquant.com/ ) from their homepages
- Deduplicate to 8-gram
- final selection of 1K sample
- quality: api error, formatting issue(e.g. scii art diagram, non-existent image reference, inconsistent question numbering) –> 51K remaining
- difficulty: Solved using Qwen2.5-7B/32B-Instruct and evaluated with Claude 3.5 sonnet. Filtered by Qwen 2.5 tokenizer assuming long as hard. –> 25K left
- diversity : Claude 3.5 Sonnet divides math and science (biology, physics, economics) into categories (geometry, dynamic system, … ) –> 24K left
- Additionally, following the philosophy of difficulty, we picked one problem per domain as a longerreasoning trace
- That leaves us with 50 domains
-
proposed budget forcing
Result
- overall
Performance is better than w/o BF, and QwQ-32B seems to have similar overall performance. AIME is relatively weak, and MATH500 is almost o1-level in performance. GPQA diamond and AIME seem to have ambiguous performance, but they are better than sky-t1 and weaker than bespoke and MATH. Overall, sample efficient, then contribution.
budget forcing
filtering ablation
w/ parallel scaling