[200] Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling

TL;DR

I read this because.. : 요즘 대세인 answer만으로 reward를 주는 접근론
task : RL reasoning
problem : scale RL
idea : 1) cot sft 데이터를 잘 만들자 2) exploration 을 많이 시키자(temperature when exploration, entropy bonus) 3) 정답 + undesirable 행동에 대해서만 reward를 주자
input/output : Q -> A
architecture : Qwen2.5-32B
objective : 1 or 0 reward + RLOO + entropy bonus
baseline : QwQ-32B-preview
data : MATH-train, NuminaMATH
evaluation : MATH500, AIME2024, Omni-math-500
result : QwQ-32B-preview 보다 높은 성능
contribution : 다양한 ablation과 방법론도 #220 과 대동소이
etc. : 지금 다시 보니 on-policy도 많이 강조한듯?

저기서 정답은 ground truth와 비교하여 맞으면 1 아니면 0

Initializing Policy with CoT for Reasoning 다양한 llm을 사용하여 prompt x에 대한 다양한 attempt를 모음.
scaling response sampling with high temperature temperature를 1 이상으로 주어 다양한 response가 나오도록 함 RLOO를 사용하여 reward scaling
auxiliary entropy bonus