TL;DR
- I read this because.. : The answer-only reward approach that’s all the rage these days
- task : RL reasoning
- problem : scale RL
- idea : 1) make cot sft data good 2) let them explore a lot (temperature when exploring, entropy bonus) 3) reward only for correct answers + undesirable behavior
- input/output : Q -> A
- architecture : Qwen2.5-32B
- objective : 1 or 0 reward + RLOO + entropy bonus
- baseline : QwQ-32B-preview
- data : MATH-train, NuminaMATH
- evaluation : MATH500, AIME2024, Omni-math-500
- result : Higher performance than QwQ-32B-preview
- contribution : Various ablations and methodologies are also shown in #220 and similarly
- etc. : Now that I look at it again, it looks like you also emphasized on-policy a lot?
Details
overall pipeline
There, the correct answer is compared to the ground truth, and if it’s correct, it’s either 1 or 0.
Initializing Policy with CoT for Reasoning A collection of different attempts to prompt X using different LLMs.
scaling response sampling with high temperature Give temperature more than 1 to get different responses Reward scaling using RLOO
auxiliary entropy bonus
- on-policy kl divergence
Also scaling for KL divergence term
Applying EMA to a reference model
- Penalizing Unexpected Patterns in RL Training
Add -1 to reward for repeated / overlong answer. This was detected by rule based (n-gram repetition, etc.)
details
- data construction
- Breaking down MATH, NuminaMATH for SFT/RL
- Apply additional filtering to SFT data – remove data that is too easy or noisy.
- Generate 16 responses and keep the kids with a 0.3 or lower score.
result
overall results
ablation on sampling more
Increasing sampling K increases the length of answers and increases accuracy. (a), (b) Also, for the same reward, the KL divergence is smaller and slower to increase (c) – why is this good?
Ultimate performance
- exploration
1.2 is optimal. Too big is bad.
- penalty reward