Image

paper

TL;DR

  • I read this because: GRPO ๋Œ€์•ˆ์œผ๋กœ ๋‚˜์˜ค๋ฉด์„œ ๋ฐ”์ด๋Ÿด
  • Task: large reasoning model
  • Problem: ๊ธฐ์กด GRPO ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ํ† ํฐ ๋‹จ์œ„ importance ratio๋กœ ์ธํ•œ ํ›ˆ๋ จ ๋ถˆ์•ˆ์ •์„ฑ๊ณผ model collapse ๋ฌธ์ œ
  • Idea: ํ† ํฐ ๋‹จ์œ„๊ฐ€ ์•„๋‹Œ ์‹œํ€€์Šค ๋‹จ์œ„์˜ importance ratio ์‚ฌ์šฉ์œผ๋กœ ์•ˆ์ •์ ์ธ RL ํ›ˆ๋ จ ๊ตฌํ˜„
  • Input/Output: query -> {reasoning, answer}
  • Architecture: Qwen3-30B-A3B-Base
  • Objective: GSPO(proposed)
  • Baseline: GRPO
  • Data: RL training on math (AIME'24), coding (LiveCodeBench, CodeForces) tasks
  • Evaluation: Training stability, efficiency metrics, downstream task performance
  • Result: Superior training stability, ํšจ์œจ์„ฑ, MoE ๋ชจ๋ธ ์•ˆ์ •ํ™”, Qwen3 ๋ชจ๋ธ ์„ฑ๋Šฅ ํฌ๊ฒŒ ๊ฐœ์„ 
  • Contribution: ์‹œํ€€์Šค ๋‹จ์œ„ importance sampling์œผ๋กœ RL ํ›ˆ๋ จ ์•ˆ์ •ํ™”, MoE RL ํ›ˆ๋ จ ๋‹จ์ˆœํ™”
  • Etc: Alibaba QwenํŒ€์—์„œ ๊ฐœ๋ฐœ, ์‹ค์ œ Qwen3 ๋ชจ๋ธ์— ์ ์šฉ๋˜์–ด ์„ฑ๋Šฅ ํ–ฅ์ƒ ๋‹ฌ์„ฑ

Details

Problem Analysis

  • GRPO
Image

์—ฌ๊ธฐ์„œ $w_{i,t}$๋Š” ์›๋ž˜ ๋ถ„ํฌ์ธ $\pi_{tar}$์—์„œ ์ƒ˜ํ”Œ๋งํ•˜์ง€ ์•Š์•˜๊ธฐ ๋•Œ๋ฌธ์— ์ด ํ™•๋ฅ ์„ ๋ณด์ •ํ•ด์ฃผ๋Š” ํ˜•ํƒœ ๋ณดํ†ต์˜ importance sampling์€ N์„ 1๋ณด๋‹ค ํฌ๊ฒŒ ์ฃผ๊ณ  ํ‰๊ท ์„ ์ฃผ์–ด ํ•˜๋Š” ๊ฒƒ์ด ์ผ๋ฐ˜์ ์ž„.

Image

๊ทธ๋Ÿฐ๋ฐ GRPO์—์„  1) ํ•˜๋‚˜์˜ sample๋กœ 2) (์ „์ฒด ํ™•๋ฅ ๋ถ„ํฌ๊ฐ€ ์•„๋‹Œ) next token probability์— ๋Œ€ํ•ด์„œ๋งŒ ๊ตฌํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ชจ๋ธ์ด noise์— ๋งค์šฐ ๋ฏผ๊ฐํ•ด์ง€๋Š” ๊ฒฐ๊ณผ๋ฅผ ๋ƒ„. ๋˜ํ•œ ์ด๋Ÿฌํ•œ noise๊ฐ€ ๊ธด ์‹œํ€€์Šค์— ๋ˆ„์ ๋˜๋ฉด์„œ noise๊ฐ€ ๋” ์ปค์ ธ์„œ ํ•œ๋ฒˆ ์ž˜๋ชป ์ˆ˜๋ ดํ•˜๋ฉด ๋Œ์ดํ‚ค๊ธฐ ์–ด๋ ต๊ณ  hparam(clipping hparam, rl prompt, .. ๋“ฑ)์— ๋งค์šฐ ๋ฏผ๊ฐํ•˜๊ฒŒ ๋จ. ๋˜ํ•œ reward๋Š” ํ•œ ์‹œํ€€์Šค์— ๋Œ€ํ•ด ๋‚˜์˜ค๋Š”๋ฐ optimization objective๋Š” token ๋‹จ์œ„๋กœ ์˜ค๋Š” ๋ถˆ์ผ์น˜๊ฐ€ ์žˆ์Œ.

GSPO Algorithm

Image
  • ํ† ํฐ ๋‹จ์œ„๊ฐ€ ์•„๋‹Œ ์‹œํ€€์Šค ์ „์ฒด์— ๋Œ€ํ•œ clipping ๊ฒฐ์ •
  • ๋ชจ๋“  ํ† ํฐ์— ๋™์ผํ•œ ๊ฐ€์ค‘์น˜ ์ ์šฉ
  • $s_i$๋ฅผ $|y_i|$์˜ ๊ธธ์ด๋กœ ๋‚˜๋ˆ ์ฃผ๋ฉด์„œ length normalize (๊ธธ์ด์™€ ์ƒ๊ด€์—†์ด clip range๋ฅผ ๋น„์Šทํ•˜๊ฒŒ ๊ฐ€์ ธ๊ฐ€๊ธฐ ์œ„ํ•ด์„œ)

gradient

Image Image

Experimental Results

Training Efficiency:

Image
  • GRPO ๋Œ€๋น„ ๋” ๋†’์€ training reward ๋‹ฌ์„ฑ
  • ๋™์ผ ๊ณ„์‚ฐ๋Ÿ‰์—์„œ ๋” ๋‚˜์€ ์„ฑ๋Šฅ
  • ๋” ์•ˆ์ •์ ์ธ ์ˆ˜๋ ด ๊ณก์„ 
  • AIME'24, LiveCodeBench, CodeForces ์—์„œ ๋” ๋‚˜์€ ๋ฒค์น˜ ์„ฑ๋Šฅ

Clipping Analysis:

Image
  • GSPO: 15% ํ† ํฐ clipping
  • GRPO: 0.13% ํ† ํฐ clipping
  • ์—ญ์„ค์ ์œผ๋กœ ๋” ๋งŽ์€ clipping์ด ๋” ์ข‹์€ ์„ฑ๋Šฅ์œผ๋กœ ์ด์–ด์ง

MoE Training Benefits

MoE-Qwen3๋ฅผ GRPO๋กœ ํ•™์Šตํ•  ๋•Œ ๋ถˆ์•ˆ์ •ํ•œ ๊ฒฝํ–ฅ์„ฑ์ด ์žˆ์—ˆ๋Š”๋ฐ ์ด๋Š” ์ด์ „์˜ policy์—์„œ activate๋œ expert์™€ ํ˜„์žฌ policy์—์„œ activate๋œ Expert๊ฐ€ ๋‹ฌ๋ผ์ง€๋ฉด์„œ Importance ratio์˜ ๋ณ€๋™์„ฑ์ด ํ›จ์”ฌ ์ปค์ ธ์„œ์ž„ ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด $\pi _{old}$์— ๋Œ€ํ•ด activate ๋œ expert๋ฅผ cacheํ•ด๋‘๊ณ  $\pi$ ์™€ $\pi _{old}$๊ฐ€ ๊ฐ™์€ expert๋ฅผ ๊ฐ€์ง€๋„๋ก ํ•˜๋Š” trick์„ ์งฌ.

Image Image

๊ทธ๊ฒƒ๋ณด๋‹ค GSPO๊ฐ€ ๋” ์ข‹์•˜์Œ. ์ด๋กœ์ธํ•œ ๋ณต์žก๋„๊ฐ€ ๋‚ฎ์•„์ง.

Benefit of GSPO for RL Infrastructure

rollout์€ sglang, vllm์œผ๋กœ ํ•˜๊ณ  training engine์€ megatron์œผ๋กœ ํ•˜๋ฉด์„œ ์ •๋ฐ€๋„ ์ด์Šˆ ๋•Œ๋ฌธ์— old policy์— ๋Œ€ํ•œ likelihood๋ฅผ ๋‹ค์‹œ ๊ณ„์‚ฐํ–ˆ์–ด์•ผ ํ–ˆ์Œ. (old policy๋Š” ์—…๋ฐ์ดํŠธ ๋˜๋Š” ๋Œ€์ƒ์ด ์•„๋‹ˆ๋ผ์„œ ์›๋ž˜๋Š” ์•ˆํ•ด๋„ ๋จ) ๊ทธ๋Ÿฌ๋‚˜ token-level likelihood์— ๋น„ํ•ด sequence-level likelihood๋Š” ์ •๋ฐ€๋„์— ๋ฏผ๊ฐํ•˜์ง€ ์•Š์•„์„œ ์žฌ๊ณ„์‚ฐ ํ•˜์ง€ ์•Š์•„๋„ ๋จ ์ด๋กœ ์ธํ•ด partial rollout and multi-turn RL and in the training-inference disaggregated frameworks ์ƒํ™ฉ์—์„œ ์กฐ๊ธˆ ๋” ํšจ์œจ์„ฑ์ด ์ข‹์Œ

c.f. DAPO normalize์— ๋Œ€ํ•œ ๋ถ€๋ถ„์ด๋ผ ๋‚ด์šฉ์ด ๋‹ค๋ฆ„

Image