Image

paper , github

TL;DR

  • I read this because.. : ์š”์ฆ˜ ๋Œ€์„ธ์ธ answer๋งŒ์œผ๋กœ reward๋ฅผ ์ฃผ๋Š” ์ ‘๊ทผ๋ก 
  • task : RL reasoning
  • problem : scale RL
  • idea : 1) cot sft ๋ฐ์ดํ„ฐ๋ฅผ ์ž˜ ๋งŒ๋“ค์ž 2) exploration ์„ ๋งŽ์ด ์‹œํ‚ค์ž(temperature when exploration, entropy bonus) 3) ์ •๋‹ต + undesirable ํ–‰๋™์— ๋Œ€ํ•ด์„œ๋งŒ reward๋ฅผ ์ฃผ์ž
  • input/output : Q -> A
  • architecture : Qwen2.5-32B
  • objective : 1 or 0 reward + RLOO + entropy bonus
  • baseline : QwQ-32B-preview
  • data : MATH-train, NuminaMATH
  • evaluation : MATH500, AIME2024, Omni-math-500
  • result : QwQ-32B-preview ๋ณด๋‹ค ๋†’์€ ์„ฑ๋Šฅ
  • contribution : ๋‹ค์–‘ํ•œ ablation๊ณผ ๋ฐฉ๋ฒ•๋ก ๋„ #220 ๊ณผ ๋Œ€๋™์†Œ์ด
  • etc. : ์ง€๊ธˆ ๋‹ค์‹œ ๋ณด๋‹ˆ on-policy๋„ ๋งŽ์ด ๊ฐ•์กฐํ•œ๋“ฏ?

Details

overall pipeline

Image

์ €๊ธฐ์„œ ์ •๋‹ต์€ ground truth์™€ ๋น„๊ตํ•˜์—ฌ ๋งž์œผ๋ฉด 1 ์•„๋‹ˆ๋ฉด 0

  • Initializing Policy with CoT for Reasoning ๋‹ค์–‘ํ•œ llm์„ ์‚ฌ์šฉํ•˜์—ฌ prompt x์— ๋Œ€ํ•œ ๋‹ค์–‘ํ•œ attempt๋ฅผ ๋ชจ์Œ.

  • scaling response sampling with high temperature temperature๋ฅผ 1 ์ด์ƒ์œผ๋กœ ์ฃผ์–ด ๋‹ค์–‘ํ•œ response๊ฐ€ ๋‚˜์˜ค๋„๋ก ํ•จ RLOO๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ reward scaling Image

  • auxiliary entropy bonus

Image
  • on-policy kl divergence
Image

kl divergence term์— ๋Œ€ํ•ด์„œ๋„ scaling์„ ์ ์šฉํ•จ

Image

reference model์— ema ์ ์šฉ

  • Penalizing Unexpected Patterns in RL Training Image

repeated / overlong answer์— ๋Œ€ํ•ด reward์— -1๋ฅผ ๋”ํ•ด์คŒ. ์ด๊ฑด rule based (n-gram ๋ฐ˜๋ณต๋“ฑ)์œผ๋กœ ํƒ์ง€ํ–ˆ์Œ

details

  • data construction
    • MATH, NuminaMATH๋ฅผ SFT / RL ์šฉ์œผ๋กœ ๋‚˜๋ˆ”
    • sft ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ ์ถ”๊ฐ€์ ์ธ ํ•„ํ„ฐ๋ง ์ ์šฉ – ๋„ˆ๋ฌด ์‰ฝ๊ฑฐ๋‚˜ noisy ๋ฐ์ดํ„ฐ ์ œ๊ฑฐ.
    • 16๊ฐœ์˜ response๋ฅผ ์ƒ์„ฑํ•œ ๋’ค ์ •๋‹ต๋ฅ  0.3 ์ดํ•˜์ธ ์• ๋“ค๋งŒ ๋‚จ๊น€

result

  • overall results Image

  • ablation on sampling more

Image

sampling K๋ฅผ ๋Š˜๋ฆฌ๋ฉด ๋‹ต๋ณ€๊ธธ์ด๋„ ๋Š˜์–ด๋‚˜๊ณ  ์ •ํ™•๋„๋„ ๋Š˜์–ด๋‚จ. (a), (b) ๋˜ํ•œ ๊ฐ™์€ reward์— ๋Œ€๋น„ํ•˜์—ฌ KL divergence๊ฐ€ ์ž‘๊ณ  ๋Š˜์–ด๋‚˜๋Š” ์†๋„๋„ ๋А๋ฆผ (c) – ์ด๊ฒŒ ์™œ ์ข‹์€๊ฑฐ์ง€?

Image

์ตœ์ข…์ ์ธ ์„ฑ๋Šฅ

  • exploration
Image Image

1.2๊ฐ€ ์ตœ์ . ๋„ˆ๋ฌด ํฌ๋ฉด ์•ˆ์ข‹์•˜์Œ.

  • penalty reward
Image