TL;DR
- I read this because.. : Is it the content or the structure of CoT that matters? Recommended by aka SkyThought
- task : reasoning in LLM
- problem : ablation on how to learn a long CoT
- Idea:** Experiment with ablation for CoTs
- input/output : Q -> {reasoning(long CoT), A}
- architecture : Qwen2.5-32B-Instruct
- objective : ce loss
- baseline : Qwen2.5-32B-Instruct, QwQ
- data : proposed 17K samples (prompts from {AMC/AIME, Math, Olympiad subset from NuminaMATH, APPS, TACO} + distil from {DeepSeek-R1, QwQ-32B preview} + R1-17K reasoning
- evaluation : MATH-500, OlympiadBench, AIME-2024, AMC23, LiveCodeBench
- result : long Structure is more important than correctness inside CoT.
- contribution : ablations
Details
thumbnail
contributions
- Reveals that tuning lora with 17K fewer samples can result in reasoning capabilities
- The structure of the Long CoT doesn’t matter The accuracy of each reasoning step doesn’t matter
- Performed various ablations for model size, arch, dataset size, and data generation model
Simple distilation is effective
- distilation data curation ->12k math / 5k coding
- prompt : math – {AMC/AIME, MATH, Olympiad, Numina-Math} + code – {APPS, TACO}
- distill model : {DeepSeek-R1, QwQ-32B-Preview}
- GPT-4o-mini to differentiate difficulty prompt / ground truth solution validate
- +) open R1-17K reasoning dataset (https://huggingface.co/datasets/bespokelabs/Bespoke-Stratos-17k )
- training details
- (code) llama-factory
- (base model) Qwen2.5-32B-Instruct
- lr 1e-5 / lora lr 1e-4
Result
- small amount of data is enought
16 is more than enough performance.
- lora finetuning without performance degradation
long cot: structure is key
- Ablation of CoTs to local content / global structure
- local content
- Final answer, numbers within the math derivation, reasoning keywords
- global structure
- reflection, self-validation, backtracking
- setting: Ablation using QwQ-32B-Preview with 4618 correct responses as the criterion
local content
- wrong answer sample
- Only 3.2 percentage points of performance degradation
- digits corrupted samples
- Intentionally randomly corrupted the numbers in the middle.
- Corrupting 70% of the numbers only degrades performance by 4.3%.
- Corrupting everything is bad for performance
- reasoning keyword removal
global structure
- Using Llama-3.3.-70-B-instruct to split a reasoning step into several
- Then insert, delete, and shuffle by a percentage
The degradation is really bad. (Read more)
more ablations
- does long cot learning cause non-reasoning task performance degradation?
It didn’t, and performance actually went up.
- ablation for student model
It went up except for Qwen2.5-32B-Instruct. I’m not sure why this one doesn’t.
- Comparison with BoN
- comparsion to short cot finetuning
short cot performance was poor