Image

paper

TL;DR

  • I read this because: It came out as a GRPO alternative and was viral.
  • Task: large reasoning model
  • Problem: Training instability and model collapse due to per-token importance ratio in the existing GRPO algorithm.
  • Idea: Use per-sequence rather than per-token importance ratio for reliable RL training
  • Input/Output: query -> {reasoning, answer}
  • Architecture: Qwen3-30B-A3B-Base
  • Objective: GSPO(proposed)
  • Baseline: GRPO
  • Data: RL training on math (AIME'24), coding (LiveCodeBench, CodeForces) tasks
  • Evaluation: Training stability, efficiency metrics, downstream task performance
  • Result: Superior training stability, efficiency, stabilized MoE models, and significantly improved Qwen3 model performance
  • Contribution: Stabilizes RL training with sequence-wise importance sampling, simplifies MoE RL training
  • Etc: Developed by Alibaba Qwen team, applied to real Qwen3 models to achieve performance improvements

Details

Problem Analysis

  • GRPO
Image

where $w_{i,t}$ corrects for this probability because we did not sample from the original distribution, $\pi_{tar}$, in the form For normal importance sampling, it is common to let N be greater than 1 and give the mean.

Image

However, in GRPO, we only get the next token probability 1) from a single sample 2) (not the entire probability distribution), which makes the model very sensitive to noise. Also, as this noise accumulates in long sequences, the noise gets louder, making it harder to reverse once it converges incorrectly and very sensitive to hparams (clipping hparam, rl prompt, .. etc). There is also a mismatch where the reward comes for one sequence but the optimization objective comes per token.

GSPO Algorithm

Image
  • Determine clipping for the entire sequence, not per token
  • Apply equal weight to all tokens
  • Divide $s_i$ by the length of $|y_i|$ and length normalize (to get a similar clip range regardless of length)

gradient

Image Image

Experimental Results

Training Efficiency:

Image
  • Achieve higher training rewards compared to GRPO
  • Better performance for the same amount of computation
  • More stable convergence curves
  • Better bench performance on AIME'24, LiveCodeBench, and CodeForces

Clipping Analysis:

Image
  • GSPO: 15% token clipping
  • GRPO: 0.13% token clipping
  • Paradoxically, more clipping leads to better performance

MoE Training Benefits

When training MoE-Qwen3 with GRPO, it tended to be unstable because the experts activated in the previous policy and the experts activated in the current policy were different, resulting in much higher variability in the importance ratio. To solve this, we cache the active expert for $\pi _{old}$ and do a trick to make sure that $\pi$ and $\pi _{old}$ have the same expert.

Image Image

GSPO was better than that. The resulting complexity is lower.

Benefit of GSPO for RL Infrastructure

Should have recalculated likelihood for old policy due to precision issue while using sglang, vllm for rollout and megatron for training engine. (old policy is not a target for updating, so it shouldn’t have been necessary in the first place) However, compared to token-level likelihood, sequence-level likelihood is not as sensitive to precision and does not require recalculation. This makes it slightly more efficient in partial rollout and multi-turn RL and in the training-inference disaggregated frameworks situation

c.f. DAPO It’s different because it’s about normalize

Image