[202] s1: Simple test-time scaling

paper , code

TL;DR

I read this because.. : because it mentions
task : reasoning in LLM
problem : How can we make test time scaling simple?
idea : Let’s filter the data well. When inferring, let’s add wait if it’s not up to the desired length, and eot if it’s too long (Budget Forcing)
architecture : Qwen2.5-32B-Instruct
objective : ce loss (SFT only)
baseline : OpenAI o1 series, DeepSeek r1 series, QwQ-32B-preview, Sky-T1-32B-Preview, Bespoke-32B, Google Gemini 2.0 Flash Thinking Experimental //
data : s1K (proposed) – NuminaMATH, AIME, OlympicArena, OmniMath, AGIEval + additionally crawled from [Stanford Statistics Department PhD Qualifying Exam] (https://statistics.stanford.edu/ ) and [PuzzledQuant] (https://www.puzzledquant.com/ ) homepages.
evaluation : AIME24, MATH500, GPQA diamond
result : Good performance relative to the number of training samples. Good performance when quality, difficulty, and diverse criteria are all used. Suggested
contribution : 1) Confirmed that SFT alone is test-time-scaling 2) Filtering related ablation
etc. :

Details

thumbnail

reasoning data curation to create s1k

inital collection of 59K
NuminaMATH, AIME, OlympicArena, OmniMath, AGIEval + additionally crawl [Stanford Statistics PhD Qualifying Exam] (https://statistics.stanford.edu/ ) and [PuzzledQuant] (https://www.puzzledquant.com/ ) from their homepages
Deduplicate to 8-gram
- final selection of 1K sample
quality: api error, formatting issue(e.g. scii art diagram, non-existent image reference, inconsistent question numbering) –> 51K remaining
difficulty: Solved using Qwen2.5-7B/32B-Instruct and evaluated with Claude 3.5 sonnet. Filtered by Qwen 2.5 tokenizer assuming long as hard. –> 25K left
diversity : Claude 3.5 Sonnet divides math and science (biology, physics, economics) into categories (geometry, dynamic system, … ) –> 24K left
Additionally, following the philosophy of difficulty, we picked one problem per domain as a longerreasoning trace
That leaves us with 50 domains -

proposed budget forcing

Result

overall

Performance is better than w/o BF, and QwQ-32B seems to have similar overall performance. AIME is relatively weak, and MATH500 is almost o1-level in performance. GPQA diamond and AIME seem to have ambiguous performance, but they are better than sky-t1 and weaker than bespoke and MATH. Overall, sample efficient, then contribution.

budget forcing
filtering ablation
w/ parallel scaling

TL;DR#

Details#

reasoning data curation to create s1k#

proposed budget forcing#

Result#

TL;DR

Details

reasoning data curation to create s1k

proposed budget forcing

Result