Image

paper , code

TL;DR

  • I read this because.. : because it mentions
  • task : reasoning in LLM
  • problem : How can we make test time scaling simple?
  • idea : Let’s filter the data well. When inferring, let’s add wait if it’s not up to the desired length, and eot if it’s too long (Budget Forcing)
  • architecture : Qwen2.5-32B-Instruct
  • objective : ce loss (SFT only)
  • baseline : OpenAI o1 series, DeepSeek r1 series, QwQ-32B-preview, Sky-T1-32B-Preview, Bespoke-32B, Google Gemini 2.0 Flash Thinking Experimental //
  • data : s1K (proposed) – NuminaMATH, AIME, OlympicArena, OmniMath, AGIEval + additionally crawled from [Stanford Statistics Department PhD Qualifying Exam] (https://statistics.stanford.edu/ ) and [PuzzledQuant] (https://www.puzzledquant.com/ ) homepages.
  • evaluation : AIME24, MATH500, GPQA diamond
  • result : Good performance relative to the number of training samples. Good performance when quality, difficulty, and diverse criteria are all used. Suggested
  • contribution : 1) Confirmed that SFT alone is test-time-scaling 2) Filtering related ablation
  • etc. :

Details

  • thumbnail Image
Image

reasoning data curation to create s1k

  • inital collection of 59K
  • NuminaMATH, AIME, OlympicArena, OmniMath, AGIEval + additionally crawl [Stanford Statistics PhD Qualifying Exam] (https://statistics.stanford.edu/ ) and [PuzzledQuant] (https://www.puzzledquant.com/ ) from their homepages
  • Deduplicate to 8-gram
    • final selection of 1K sample
  • quality: api error, formatting issue(e.g. scii art diagram, non-existent image reference, inconsistent question numbering) –> 51K remaining
  • difficulty: Solved using Qwen2.5-7B/32B-Instruct and evaluated with Claude 3.5 sonnet. Filtered by Qwen 2.5 tokenizer assuming long as hard. –> 25K left
  • diversity : Claude 3.5 Sonnet divides math and science (biology, physics, economics) into categories (geometry, dynamic system, … ) –> 24K left
  • Additionally, following the philosophy of difficulty, we picked one problem per domain as a longerreasoning trace
  • That leaves us with 50 domains - Image

proposed budget forcing

Image

Result

  • overall
Image

Performance is better than w/o BF, and QwQ-32B seems to have similar overall performance. AIME is relatively weak, and MATH500 is almost o1-level in performance. GPQA diamond and AIME seem to have ambiguous performance, but they are better than sky-t1 and weaker than bespoke and MATH. Overall, sample efficient, then contribution.

  • budget forcing Image

  • filtering ablation Image

  • w/ parallel scaling Image

Image