Image

paper , github

TL;DR

  • I read this because.. : The answer-only reward approach that’s all the rage these days
  • task : RL reasoning
  • problem : scale RL
  • idea : 1) make cot sft data good 2) let them explore a lot (temperature when exploring, entropy bonus) 3) reward only for correct answers + undesirable behavior
  • input/output : Q -> A
  • architecture : Qwen2.5-32B
  • objective : 1 or 0 reward + RLOO + entropy bonus
  • baseline : QwQ-32B-preview
  • data : MATH-train, NuminaMATH
  • evaluation : MATH500, AIME2024, Omni-math-500
  • result : Higher performance than QwQ-32B-preview
  • contribution : Various ablations and methodologies are also shown in #220 and similarly
  • etc. : Now that I look at it again, it looks like you also emphasized on-policy a lot?

Details

overall pipeline

Image

There, the correct answer is compared to the ground truth, and if it’s correct, it’s either 1 or 0.

  • Initializing Policy with CoT for Reasoning A collection of different attempts to prompt X using different LLMs.

  • scaling response sampling with high temperature Give temperature more than 1 to get different responses Reward scaling using RLOO Image

  • auxiliary entropy bonus

Image
  • on-policy kl divergence
Image

Also scaling for KL divergence term

Image

Applying EMA to a reference model

  • Penalizing Unexpected Patterns in RL Training Image

Add -1 to reward for repeated / overlong answer. This was detected by rule based (n-gram repetition, etc.)

details

  • data construction
  • Breaking down MATH, NuminaMATH for SFT/RL
  • Apply additional filtering to SFT data – remove data that is too easy or noisy.
  • Generate 16 responses and keep the kids with a 0.3 or lower score.

result

  • overall results Image

  • ablation on sampling more

Image

Increasing sampling K increases the length of answers and increases accuracy. (a), (b) Also, for the same reward, the KL divergence is smaller and slower to increase (c) – why is this good?

Image

Ultimate performance

  • exploration
Image Image

1.2 is optimal. Too big is bad.

  • penalty reward
Image