[205] LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!

TL;DR

I read this because.. : CoT의 내용이 중요한가 아니면 구조가 중요한가? 추천받아 aka SkyThought
task : reasoning in LLM
problem : long CoT를 어떻게 학습할 것인가에 대한 ablation
idea : CoT에 대한 ablation 실험 해보자
input/output : Q -> {reasoning(long CoT), A}
architecture : Qwen2.5-32B-Instruct
objective : ce loss
baseline : Qwen2.5-32B-Instruct, QwQ
data : proposed 17K samples (prompts from {AMC/AIME, Math, Olympiad subset from NuminaMATH, APPS, TACO} + distil from {DeepSeek-R1, QwQ-32B preview} + R1-17K reasoning
evaluation : MATH-500, OlympiadBench, AIME-2024, AMC23, LiveCodeBench
result : long CoT 내부의 correctness 여부보다 structure가 더 중요.
contribution : ablations

contributions

distilation data curation ->12k math / 5k coding
- prompt : math – {AMC/AIME, MATH, Olympiad, Numina-Math} + code – {APPS, TACO}
- distill model : {DeepSeek-R1, QwQ-32B-Preview}
- GPT-4o-mini로 difficulty prompt 구분 시킴 / ground truth solution validate
- +) open R1-17K reasoning dataset (https://huggingface.co/datasets/bespokelabs/Bespoke-Stratos-17k )
training details
- (code) llama-factory
- (base model) Qwen2.5-32B-Instruct
- lr 1e-5 / lora lr 1e-4

16만으로도 충분한 성능.

wrong answer sample
- 3.2%p 정도 밖에 성능 저하가 없음
digits corrupted samples
- 일부러 중간의 숫자를 random하게 corrupt 함.
- 70% 정도의 숫자를 corrupt해도 성능이 4.3% 밖에 안떨어짐.
- 다 corrupt하는건 성능이 떨어짐
reasoning keyword removal
wait, let me think again, but 이런 단어들을 모두 제거해도 성능 3.3% 정도밖에 안떨어짐-

degradation이 엄청 심함. (자세히 안읽음)

그렇지 않았고 오히려 성능이 오른다.

Qwen2.5-32B-Instruct를 제외하고 올랐다. 얘는 왜 안되는지 잘 모르겠다.

short cot 성능이 좋지 않았다