TL;DR
- I read this because: It came out as a GRPO alternative and was viral.
- Task: large reasoning model
- Problem: Training instability and model collapse due to per-token importance ratio in the existing GRPO algorithm.
- Idea: Use per-sequence rather than per-token importance ratio for reliable RL training
- Input/Output: query -> {reasoning, answer}
- Architecture: Qwen3-30B-A3B-Base
- Objective: GSPO(proposed)
- Baseline: GRPO
- Data: RL training on math (AIME'24), coding (LiveCodeBench, CodeForces) tasks
- Evaluation: Training stability, efficiency metrics, downstream task performance
- Result: Superior training stability, efficiency, stabilized MoE models, and significantly improved Qwen3 model performance
- Contribution: Stabilizes RL training with sequence-wise importance sampling, simplifies MoE RL training
- Etc: Developed by Alibaba Qwen team, applied to real Qwen3 models to achieve performance improvements
Details
Problem Analysis
- GRPO
where $w_{i,t}$ corrects for this probability because we did not sample from the original distribution, $\pi_{tar}$, in the form For normal importance sampling, it is common to let N be greater than 1 and give the mean.
However, in GRPO, we only get the next token probability 1) from a single sample 2) (not the entire probability distribution), which makes the model very sensitive to noise. Also, as this noise accumulates in long sequences, the noise gets louder, making it harder to reverse once it converges incorrectly and very sensitive to hparams (clipping hparam, rl prompt, .. etc). There is also a mismatch where the reward comes for one sequence but the optimization objective comes per token.
GSPO Algorithm
- Determine clipping for the entire sequence, not per token
- Apply equal weight to all tokens
- Divide $s_i$ by the length of $|y_i|$ and length normalize (to get a similar clip range regardless of length)
gradient
Experimental Results
Training Efficiency:
- Achieve higher training rewards compared to GRPO
- Better performance for the same amount of computation
- More stable convergence curves
- Better bench performance on AIME'24, LiveCodeBench, and CodeForces
Clipping Analysis:
- GSPO: 15% token clipping
- GRPO: 0.13% token clipping
- Paradoxically, more clipping leads to better performance
MoE Training Benefits
When training MoE-Qwen3 with GRPO, it tended to be unstable because the experts activated in the previous policy and the experts activated in the current policy were different, resulting in much higher variability in the importance ratio. To solve this, we cache the active expert for $\pi _{old}$ and do a trick to make sure that $\pi$ and $\pi _{old}$ have the same expert.
GSPO was better than that. The resulting complexity is lower.
Benefit of GSPO for RL Infrastructure
Should have recalculated likelihood for old policy due to precision issue while using sglang, vllm for rollout and megatron for training engine. (old policy is not a target for updating, so it shouldn’t have been necessary in the first place) However, compared to token-level likelihood, sequence-level likelihood is not as sensitive to precision and does not require recalculation. This makes it slightly more efficient in partial rollout and multi-turn RL and in the training-inference disaggregated frameworks situation
c.f. DAPO It’s different because it’s about normalize