Image

paper , code

TL;DR

  • I read this because.. : CoT์˜ ๋‚ด์šฉ์ด ์ค‘์š”ํ•œ๊ฐ€ ์•„๋‹ˆ๋ฉด ๊ตฌ์กฐ๊ฐ€ ์ค‘์š”ํ•œ๊ฐ€? ์ถ”์ฒœ๋ฐ›์•„ aka SkyThought
  • task : reasoning in LLM
  • problem : long CoT๋ฅผ ์–ด๋–ป๊ฒŒ ํ•™์Šตํ•  ๊ฒƒ์ธ๊ฐ€์— ๋Œ€ํ•œ ablation
  • idea : CoT์— ๋Œ€ํ•œ ablation ์‹คํ—˜ ํ•ด๋ณด์ž
  • input/output : Q -> {reasoning(long CoT), A}
  • architecture : Qwen2.5-32B-Instruct
  • objective : ce loss
  • baseline : Qwen2.5-32B-Instruct, QwQ
  • data : proposed 17K samples (prompts from {AMC/AIME, Math, Olympiad subset from NuminaMATH, APPS, TACO} + distil from {DeepSeek-R1, QwQ-32B preview} + R1-17K reasoning
  • evaluation : MATH-500, OlympiadBench, AIME-2024, AMC23, LiveCodeBench
  • result : long CoT ๋‚ด๋ถ€์˜ correctness ์—ฌ๋ถ€๋ณด๋‹ค structure๊ฐ€ ๋” ์ค‘์š”.
  • contribution : ablations

Details

thumbnail

Image

contributions

  • 17K ์ ์€ sample๋กœ lora tuning ํ•ด๋„ reasoning ๋Šฅ๋ ฅ์ด ๋ฐœํ˜„์ด ๋œ๋‹ค๋Š” ๊ฒƒ์„ ๋ฐํž˜
  • Long CoT์˜ ๊ตฌ์กฐ๊ฐ€ ์ค‘์š”ํ•˜์ง€ ๊ฐ๊ฐ์˜ reasoning step์˜ ์ •ํ™•๋„๊ฐ€ ์ค‘์š”ํ•˜์ง€ ์•Š์Œ
  • model size, arch, dataset size, data generation model์— ๋Œ€ํ•œ ๋‹ค์–‘ํ•œ ablation ์„ ์ง„ํ–‰ํ–ˆ๋‹ค

Simple distilation is effective

  • distilation data curation ->12k math / 5k coding
    • prompt : math – {AMC/AIME, MATH, Olympiad, Numina-Math} + code – {APPS, TACO}
    • distill model : {DeepSeek-R1, QwQ-32B-Preview}
    • GPT-4o-mini๋กœ difficulty prompt ๊ตฌ๋ถ„ ์‹œํ‚ด / ground truth solution validate
    • +) open R1-17K reasoning dataset (https://huggingface.co/datasets/bespokelabs/Bespoke-Stratos-17k )
  • training details
    • (code) llama-factory
    • (base model) Qwen2.5-32B-Instruct
    • lr 1e-5 / lora lr 1e-4

Result

  • small amount of data is enought Image

16๋งŒ์œผ๋กœ๋„ ์ถฉ๋ถ„ํ•œ ์„ฑ๋Šฅ.

  • lora finetuning without performance degradation
Image

long cot: structure is key

  • CoT๋ฅผ local content / global structure ์ค‘ ๋ญ๊ฐ€ ๋” ์ค‘์š”ํ•œ์ง€๋ฅผ ablation
  • local content
    • ์ตœ์ข… ์ •๋‹ต, math derivation ๋‚ด์˜ ์ˆซ์ž, rasoning keywords
  • global structure
    • reflection, self-validation, backtracking
  • setting: QwQ-32B-Preview๋ฅผ ์‚ฌ์šฉ ํ•ด์„œ 4618๊ฐœ์˜ correct response ๊ธฐ์ค€์œผ๋กœ ablation
Image

local content

  • wrong answer sample
    • 3.2%p ์ •๋„ ๋ฐ–์— ์„ฑ๋Šฅ ์ €ํ•˜๊ฐ€ ์—†์Œ
  • digits corrupted samples
    • ์ผ๋ถ€๋Ÿฌ ์ค‘๊ฐ„์˜ ์ˆซ์ž๋ฅผ randomํ•˜๊ฒŒ corrupt ํ•จ.
    • 70% ์ •๋„์˜ ์ˆซ์ž๋ฅผ corruptํ•ด๋„ ์„ฑ๋Šฅ์ด 4.3% ๋ฐ–์— ์•ˆ๋–จ์–ด์ง.
    • ๋‹ค corruptํ•˜๋Š”๊ฑด ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง
  • reasoning keyword removal
  • wait, let me think again, but ์ด๋Ÿฐ ๋‹จ์–ด๋“ค์„ ๋ชจ๋‘ ์ œ๊ฑฐํ•ด๋„ ์„ฑ๋Šฅ 3.3% ์ •๋„๋ฐ–์— ์•ˆ๋–จ์–ด์ง-

global structure

  • Llama-3.3.-70-B-instruct ์‚ฌ์šฉํ•ด์„œ reasoning step์„ ์—ฌ๋Ÿฌ๊ฐœ๋กœ ๋‚˜๋ˆ”
  • ์ดํ›„ insert, delete, shuffle์„ ๋น„์œจ ๋งŒํผ ์ง„ํ–‰ํ•จ Image

degradation์ด ์—„์ฒญ ์‹ฌํ•จ. (์ž์„ธํžˆ ์•ˆ์ฝ์Œ)

more ablations

  • long cot ํ•™์Šต์ด non-reasoning task ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ์ผ์œผํ‚ค๋Š”๊ฐ€? Image

๊ทธ๋ ‡์ง€ ์•Š์•˜๊ณ  ์˜คํžˆ๋ ค ์„ฑ๋Šฅ์ด ์˜ค๋ฅธ๋‹ค.

  • student model์— ๋Œ€ํ•œ ablation Image

Qwen2.5-32B-Instruct๋ฅผ ์ œ์™ธํ•˜๊ณ  ์˜ฌ๋ž๋‹ค. ์–˜๋Š” ์™œ ์•ˆ๋˜๋Š”์ง€ ์ž˜ ๋ชจ๋ฅด๊ฒ ๋‹ค.

  • BoN๊ณผ์˜ ๋น„๊ต
Image
  • comparsion to short cot finetuning
Image

short cot ์„ฑ๋Šฅ์ด ์ข‹์ง€ ์•Š์•˜๋‹ค