Image

paper , code

TL;DR

  • I read this because.. : Is it the content or the structure of CoT that matters? Recommended by aka SkyThought
  • task : reasoning in LLM
  • problem : ablation on how to learn a long CoT
  • Idea:** Experiment with ablation for CoTs
  • input/output : Q -> {reasoning(long CoT), A}
  • architecture : Qwen2.5-32B-Instruct
  • objective : ce loss
  • baseline : Qwen2.5-32B-Instruct, QwQ
  • data : proposed 17K samples (prompts from {AMC/AIME, Math, Olympiad subset from NuminaMATH, APPS, TACO} + distil from {DeepSeek-R1, QwQ-32B preview} + R1-17K reasoning
  • evaluation : MATH-500, OlympiadBench, AIME-2024, AMC23, LiveCodeBench
  • result : long Structure is more important than correctness inside CoT.
  • contribution : ablations

Details

thumbnail

Image

contributions

  • Reveals that tuning lora with 17K fewer samples can result in reasoning capabilities
  • The structure of the Long CoT doesn’t matter The accuracy of each reasoning step doesn’t matter
  • Performed various ablations for model size, arch, dataset size, and data generation model

Simple distilation is effective

  • distilation data curation ->12k math / 5k coding
    • prompt : math – {AMC/AIME, MATH, Olympiad, Numina-Math} + code – {APPS, TACO}
    • distill model : {DeepSeek-R1, QwQ-32B-Preview}
  • GPT-4o-mini to differentiate difficulty prompt / ground truth solution validate
  • training details
    • (code) llama-factory
    • (base model) Qwen2.5-32B-Instruct
    • lr 1e-5 / lora lr 1e-4

Result

  • small amount of data is enought Image

16 is more than enough performance.

  • lora finetuning without performance degradation
Image

long cot: structure is key

  • Ablation of CoTs to local content / global structure
  • local content
  • Final answer, numbers within the math derivation, reasoning keywords
  • global structure
    • reflection, self-validation, backtracking
  • setting: Ablation using QwQ-32B-Preview with 4618 correct responses as the criterion
Image

local content

  • wrong answer sample
  • Only 3.2 percentage points of performance degradation
  • digits corrupted samples
  • Intentionally randomly corrupted the numbers in the middle.
  • Corrupting 70% of the numbers only degrades performance by 4.3%.
  • Corrupting everything is bad for performance
  • reasoning keyword removal

global structure

  • Using Llama-3.3.-70-B-instruct to split a reasoning step into several
  • Then insert, delete, and shuffle by a percentage Image

The degradation is really bad. (Read more)

more ablations

  • does long cot learning cause non-reasoning task performance degradation? Image

It didn’t, and performance actually went up.

  • ablation for student model Image

It went up except for Qwen2.5-32B-Instruct. I’m not sure why this one doesn’t.

  • Comparison with BoN
Image
  • comparsion to short cot finetuning
Image

short cot performance was poor