Image

paper

TL;DR

  • I read this because.. : deepseek-r1 is mentioned along with. multimodal mention.
  • task : reasoning ability in LLM
  • problem : I want to learn LLM which requires long reasoning, but value function/ PRM/ MCTS is too complicated.
  • idea : Apply RLOO, reward should be verifiable, and prompt should be good. Learn with long context first and distil short ones
  • input/output : {q, (optional) image} -> a
  • architecture : (proposed) kimi k1.5. architecture or size can be found in the previous paper… I can’t find it.
  • objective : (pretraining, sft) ce loss -> (rl) RLOO loss with offline samples
  • baseline : OpenAI o1, OpenAI o1 mini, QVQ-72B mini , QwQ-32B Preview
  • data : (all proposed, not open) (PT) ?? (SFT) 1M SFT, 1M VLM SFT (CoT SFT) ? (RL) diverse / non-hackbable prompts
  • evaluation : AIME, MATH500, Codeforces, LiveCodeBench v5, Mathvista, MMMU
  • result : Better performance for both long and short than openai models
  • contribution : More detailed summary of the prompt refinement process, etc. than in R1.
  • etc. : (comment by Sunghyun)

Details

thumbnail

  • long cot Image

  • short cot Image

RL prompt set curation

They said they found high quality prompts by focusing on three things

  • diverse coverage : stem, code, general reasoning -> To do this, we tagged each dataset with domain / discipline and chose a balanced selection.
  • balanced difficulty: should be a balanced mix of easy, moderate, and difficult -> model based, saying that you measured difficulty by looking at the pass rate out of 10 times
  • accurate evaluability: must allow the verifier to make an objective and reliable evaluation. should not rely on superficial patterns / random guesses -> Remove multi-choice, true/false, and proof-based questions to prevent reward hacking -> To eliminate ’easy-to-hack’, the questions were answered without a CoT reasoning step and the questions that were answered correctly in 8 attempts were removed.

Long-CoT SFT

  • Data that allows for planning / evaluation / reflection / exploration.
  • Said it was generated by prompting the model (rejection sampling). Not sure what model was used or what the quantity was.

Reinfocement Learning

problem setting

  • I liken it to a planning algorithm, but it learns a similar but ultimately flattened sequence of reasoning paths (maybe it’s because I don’t have an RL background, but it just seems wordy…)
  • Our objective is to create a
    • Image
  • where $z$ is the reasoning steps / $r$ is the verifiable reward that falls into {0,1}

policy optimization

You borrowed the online policy mirror decent algorithm? (https://www.ijcai.org/proceedings/2019/0434.pdf , https://github.com/manantomar/Mirror-Descent-Policy-Optimization // I looked it up a bit and it’s optimization under constraints like PPO, but the optimization method seems to be something other than gradient descent.

Image

I feel like I’ve seen this a lot in TRPO, but I’ll have to watch it slowly later.

Image

In conclusion, the gradient looks like the above, which is similar to the policy gradient, except that the baseline uses mean of sampled reward (https://github.com/long8v/PTIR/issues/215#issuecomment-2608698801) . Oh, and another difference is that we used online rollout originally, but here we used the rollout of the reference model The reason why it was good to delete the “value network” itself is that when there is a value network in the process of learning a long reasoning path, the intermediate steps of the negative COT have a negative advantage right away, and since exploration is also necessary to learn various reasoning paths, it seems that it was more important to roll out in the “RLOO” algorithm and go to the end (!!).

length penalty

To prevent overthinking

Image

sampling

  • curriculum sampling: From low to high difficulty.
  • prioritized sampling: The model focuses on the lowest performing

Reward Modeling for math

I use RM for math because the answer is $a^2-4$ or something like $(a+2)(a-2)$ and it’s hard to figure it out This is called using CoT RM (#211) – 800K CoT-Labeled reward model The performance difference was 84.4 -> 98.5 compared to classic RM, so it was said that CoT RM was used.

vision data

  • real world: chart understanding, science question, graphical comprehension
  • synthetic visual reasoning : clevr-like?
  • text rendeered data: text / code / structured data

long2short

  • model merging : with long-cot model
  • shortest rejection sampling : The shortest of the correct answers from 8 rejection samplings is used as the sft dataset.
  • DPO: (chosen) short correct answer (rejected) long and wrong answer, long and correct answer
  • long2short: apply the above length penalty

other training

  • pretraining : Vision PT up to singularity. I need to organize this later…
  • vanilla SFT: text 1M / Vision 1M
    • seq len 23K

RL infrastructure

  • Omit

Result

Image Image Image
  • ablations
Image