[205] LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!

TL;DR

I read this because.. : Is it the content or the structure of CoT that matters? Recommended by aka SkyThought
task : reasoning in LLM
problem : ablation on how to learn a long CoT
Idea:** Experiment with ablation for CoTs
input/output : Q -> {reasoning(long CoT), A}
architecture : Qwen2.5-32B-Instruct
objective : ce loss
baseline : Qwen2.5-32B-Instruct, QwQ
data : proposed 17K samples (prompts from {AMC/AIME, Math, Olympiad subset from NuminaMATH, APPS, TACO} + distil from {DeepSeek-R1, QwQ-32B preview} + R1-17K reasoning
evaluation : MATH-500, OlympiadBench, AIME-2024, AMC23, LiveCodeBench
result : long Structure is more important than correctness inside CoT.
contribution : ablations

contributions

Reveals that tuning lora with 17K fewer samples can result in reasoning capabilities
The structure of the Long CoT doesn’t matter The accuracy of each reasoning step doesn’t matter
Performed various ablations for model size, arch, dataset size, and data generation model

distilation data curation ->12k math / 5k coding
- prompt : math – {AMC/AIME, MATH, Olympiad, Numina-Math} + code – {APPS, TACO}
- distill model : {DeepSeek-R1, QwQ-32B-Preview}
GPT-4o-mini to differentiate difficulty prompt / ground truth solution validate
- +) open R1-17K reasoning dataset (https://huggingface.co/datasets/bespokelabs/Bespoke-Stratos-17k )
training details
- (code) llama-factory
- (base model) Qwen2.5-32B-Instruct
- lr 1e-5 / lora lr 1e-4

16 is more than enough performance.

Ablation of CoTs to local content / global structure
local content
Final answer, numbers within the math derivation, reasoning keywords
global structure
- reflection, self-validation, backtracking
setting: Ablation using QwQ-32B-Preview with 4618 correct responses as the criterion

The degradation is really bad. (Read more)

It didn’t, and performance actually went up.

It went up except for Qwen2.5-32B-Instruct. I’m not sure why this one doesn’t.

short cot performance was poor