[200] Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling

paper , github

TL;DR

I read this because.. : The answer-only reward approach that’s all the rage these days
task : RL reasoning
problem : scale RL
idea : 1) make cot sft data good 2) let them explore a lot (temperature when exploring, entropy bonus) 3) reward only for correct answers + undesirable behavior
input/output : Q -> A
architecture : Qwen2.5-32B
objective : 1 or 0 reward + RLOO + entropy bonus
baseline : QwQ-32B-preview
data : MATH-train, NuminaMATH
evaluation : MATH500, AIME2024, Omni-math-500
result : Higher performance than QwQ-32B-preview
contribution : Various ablations and methodologies are also shown in #220 and similarly
etc. : Now that I look at it again, it looks like you also emphasized on-policy a lot?

Details

overall pipeline

There, the correct answer is compared to the ground truth, and if it’s correct, it’s either 1 or 0.

Initializing Policy with CoT for Reasoning A collection of different attempts to prompt X using different LLMs.
scaling response sampling with high temperature Give temperature more than 1 to get different responses Reward scaling using RLOO
auxiliary entropy bonus

on-policy kl divergence

Also scaling for KL divergence term

Applying EMA to a reference model

Penalizing Unexpected Patterns in RL Training

Add -1 to reward for repeated / overlong answer. This was detected by rule based (n-gram repetition, etc.)

details

data construction
Breaking down MATH, NuminaMATH for SFT/RL
Apply additional filtering to SFT data – remove data that is too easy or noisy.
Generate 16 responses and keep the kids with a 0.3 or lower score.

result

overall results
ablation on sampling more

Increasing sampling K increases the length of answers and increases accuracy. (a), (b) Also, for the same reward, the KL divergence is smaller and slower to increase (c) – why is this good?

Ultimate performance

exploration

1.2 is optimal. Too big is bad.

penalty reward

TL;DR#

Details#

overall pipeline#

details#

result#

TL;DR

Details

overall pipeline

details

result