Image

paper , code

TL;DR

  • I read this because.. : ์–ธ๊ธ‰๋˜์–ด
  • task : reasoning in LLM
  • problem : ์–ด๋–ป๊ฒŒ ํ•˜๋ฉด ๊ฐ„๋‹จํ•˜๊ฒŒ test time scaling์„ ํ•  ์ˆ˜ ์žˆ์„๊นŒ?
  • idea : ๋ฐ์ดํ„ฐ ํ•„ํ„ฐ๋ง ์ž˜ ํ•˜์ž. inferenceํ•  ๋•Œ ์›ํ•˜๋Š” ๊ธธ์ด๊นŒ์ง€ ์•ˆ๋‚˜์˜ค๋ฉด wait์„ ๋„ฃ์–ด์ฃผ๊ณ , ๋„ˆ๋ฌด ๊ธธ๋ฉด ๊ฐ•์ œ๋กœ eot๋ฅผ ๋„ฃ์–ด์ฃผ์ž(Budget Forcing)
  • architecture : Qwen2.5-32B-Instruct
  • objective : ce loss (SFT only)
  • baseline : OpenAI o1 series, DeepSeek r1 series, QwQ-32B-preview, Sky-T1-32B-Preview, Bespoke-32B, Google Gemini 2.0 Flash Thinking Experimental //
  • data : s1K(proposed) – NuminaMATH, AIME, OlympicArena, OmniMath, AGIEval + ์ถ”๊ฐ€๋กœ ์Šคํƒ ํฌํŠธ ํ†ต๊ณ„ํ•™๊ณผ ๋ฐ•์‚ฌ ์ž๊ฒฉ ์‹œํ—˜ ๊ณผ PuzzledQuant ๋ž€ ํ™ˆํŽ˜์ด์ง€์—์„œ ํฌ๋กค๋ง
  • evaluation : AIME24, MATH500, GPQA diamond
  • result : ํ•™์Šต ์ƒ˜ํ”Œ ๊ฐœ์ˆ˜ ๋Œ€๋น„ ์ข‹์€ ์„ฑ๋Šฅ. quality, difficulty, diverse ๊ธฐ์ค€ ๋ชจ๋‘ ์‚ฌ์šฉํ•ด์•ผ ์„ฑ๋Šฅ์ด ์ข‹์Œ. ์ œ์•ˆํ•œ
  • contribution : 1) SFT๋งŒ์œผ๋กœ๋„ test-time-scaling์ด ๋˜๋Š”๊ฒƒ์„ ํ™•์ธ 2) ํ•„ํ„ฐ๋ง ๊ด€๋ จ ablation
  • etc. :

Details

  • thumbnail Image
Image

reasoning data curation to create s1k

  • inital collection of 59K
  • final selection of 1K sample
    • quality: api error, formatting issue(e.g. scii art diagrm, non-existent image reference, incosistent question numbering) –> 51K ๋‚จ์Œ
    • difficulty: Qwen2.5-7B/32B-Instruct๋ฅผ ์‚ฌ์šฉํ•ด์„œ ํ’€๊ฒŒํ•˜๊ณ  Claude 3.5 sonnet์œผ๋กœ ํ‰๊ฐ€. Qwen 2.5 tokenizer ๊ธฐ์ค€์œผ๋กœ ๊ธด ๊ฒƒ์„ ์–ด๋ ต๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๊ณ  ํ•„ํ„ฐ๋ง. –> 25K ๋‚จ์Œ
    • diversity : Claude 3.5 Sonnet์œผ๋กœ ์ˆ˜ํ•™ ๋ฐ ๊ณผํ•™(biology, physics, economics) ๋ถ„๋ฅ˜๋ฅผ ๋‚˜๋ˆ”(geometry, dynamic system, … ) –> 24K ๋‚จ์Œ
      • ์ถ”๊ฐ€๋กœ difficulty์˜ ์ฒ ํ•™์— ๋”ฐ๋ผ longerreasoning trace์ธ ๊ฑธ๋กœ domain ๋ณ„๋กœ ํ•˜๋‚˜์˜ ๋ฌธ์ œ๋ฅผ ๋ฝ‘์Œ
    • ๊ฒฐ๋ก ์ ์œผ๋กœ 50๊ฐœ ๋„๋ฉ”์ธ์ด ๋‚จ์Œ
      • Image

proposed budget forcing

Image

Result

  • overall
Image

w/o BF์— ๋น„ํ•ด์„œ ์„ฑ๋Šฅ์ด ์˜ค๋ฅด๋ฉฐ QwQ-32B๋ž€ ์ „์ฒด์ ์œผ๋กœ ์„ฑ๋Šฅ์ด ๋น„์Šทํ•œ๋“ฏ. AIME์€ ์ƒ๋Œ€์ ์œผ๋กœ ์„ฑ๋Šฅ์ด ์•ฝํ•˜๊ณ  MATH500์€ ์„ฑ๋Šฅ์ด ๊ฑฐ์˜ o1 ๊ธ‰. GPQA diamnond๋ž‘ AIME์€ ์„ฑ๋Šฅ์ด ์• ๋งคํ•œ ๊ฒƒ ๊ฐ™์€๋ฐ sky-t1๋ณด๋‹ค๋Š” ์ข‹๊ณ  bespoke๋ณด๋‹ค๋Š” MATH๋Š” ์•ฝํ•˜๋‹ค. ์ „๋ฐ˜์ ์œผ๋กœ sample efficientํ•˜๋‹ค๊ฐ€ contribution.

  • budget forcing Image

  • filtering ablation Image

  • w/ parallel scaling Image

Image