[215] Group Sequence Policy Optimization

TL;DR

I read this because: It came out as a GRPO alternative and was viral.
Task: large reasoning model
Problem: Training instability and model collapse due to per-token importance ratio in the existing GRPO algorithm.
Idea: Use per-sequence rather than per-token importance ratio for reliable RL training
Input/Output: query -> {reasoning, answer}
Architecture: Qwen3-30B-A3B-Base
Objective: GSPO(proposed)
Baseline: GRPO
Data: RL training on math (AIME'24), coding (LiveCodeBench, CodeForces) tasks
Evaluation: Training stability, efficiency metrics, downstream task performance
Result: Superior training stability, efficiency, stabilized MoE models, and significantly improved Qwen3 model performance
Contribution: Stabilizes RL training with sequence-wise importance sampling, simplifies MoE RL training
Etc: Developed by Alibaba Qwen team, applied to real Qwen3 models to achieve performance improvements

Details

Problem Analysis

GRPO

where $w_{i,t}$ corrects for this probability because we did not sample from the original distribution, $\pi_{tar}$, in the form For normal importance sampling, it is common to let N be greater than 1 and give the mean.

However, in GRPO, we only get the next token probability 1) from a single sample 2) (not the entire probability distribution), which makes the model very sensitive to noise. Also, as this noise accumulates in long sequences, the noise gets louder, making it harder to reverse once it converges incorrectly and very sensitive to hparams (clipping hparam, rl prompt, .. etc). There is also a mismatch where the reward comes for one sequence but the optimization objective comes per token.

GSPO Algorithm

Determine clipping for the entire sequence, not per token
Apply equal weight to all tokens
Divide $s_i$ by the length of $|y_i|$ and length normalize (to get a similar clip range regardless of length)

gradient

Experimental Results

Training Efficiency:

Achieve higher training rewards compared to GRPO
Better performance for the same amount of computation
More stable convergence curves
Better bench performance on AIME'24, LiveCodeBench, and CodeForces

Clipping Analysis:

GSPO: 15% token clipping
GRPO: 0.13% token clipping
Paradoxically, more clipping leads to better performance

MoE Training Benefits

When training MoE-Qwen3 with GRPO, it tended to be unstable because the experts activated in the previous policy and the experts activated in the current policy were different, resulting in much higher variability in the importance ratio. To solve this, we cache the active expert for $\pi _{old}$ and do a trick to make sure that $\pi$ and $\pi _{old}$ have the same expert.

GSPO was better than that. The resulting complexity is lower.

Benefit of GSPO for RL Infrastructure

Should have recalculated likelihood for old policy due to precision issue while using sglang, vllm for rollout and megatron for training engine. (old policy is not a target for updating, so it shouldn’t have been necessary in the first place) However, compared to token-level likelihood, sequence-level likelihood is not as sensitive to precision and does not require recalculation. This makes it slightly more efficient in partial rollout and multi-turn RL and in the training-inference disaggregated frameworks situation

c.f. DAPO It’s different because it’s about normalize

TL;DR#

Details#

Problem Analysis#

GSPO Algorithm#

Experimental Results#

MoE Training Benefits#

Benefit of GSPO for RL Infrastructure#