[198] Kimi k1.5: Scaling Reinforcement Learning with LLMs

paper

TL;DR

I read this because.. : deepseek-r1 is mentioned along with. multimodal mention.
task : reasoning ability in LLM
problem : I want to learn LLM which requires long reasoning, but value function/ PRM/ MCTS is too complicated.
idea : Apply RLOO, reward should be verifiable, and prompt should be good. Learn with long context first and distil short ones
input/output : {q, (optional) image} -> a
architecture : (proposed) kimi k1.5. architecture or size can be found in the previous paper… I can’t find it.
objective : (pretraining, sft) ce loss -> (rl) RLOO loss with offline samples
baseline : OpenAI o1, OpenAI o1 mini, QVQ-72B mini , QwQ-32B Preview
data : (all proposed, not open) (PT) ?? (SFT) 1M SFT, 1M VLM SFT (CoT SFT) ? (RL) diverse / non-hackbable prompts
evaluation : AIME, MATH500, Codeforces, LiveCodeBench v5, Mathvista, MMMU
result : Better performance for both long and short than openai models
contribution : More detailed summary of the prompt refinement process, etc. than in R1.
etc. : (comment by Sunghyun)

Details

thumbnail

long cot
short cot

RL prompt set curation

They said they found high quality prompts by focusing on three things

diverse coverage : stem, code, general reasoning -> To do this, we tagged each dataset with domain / discipline and chose a balanced selection.
balanced difficulty: should be a balanced mix of easy, moderate, and difficult -> model based, saying that you measured difficulty by looking at the pass rate out of 10 times
accurate evaluability: must allow the verifier to make an objective and reliable evaluation. should not rely on superficial patterns / random guesses -> Remove multi-choice, true/false, and proof-based questions to prevent reward hacking -> To eliminate ’easy-to-hack’, the questions were answered without a CoT reasoning step and the questions that were answered correctly in 8 attempts were removed.

Long-CoT SFT

Data that allows for planning / evaluation / reflection / exploration.
Said it was generated by prompting the model (rejection sampling). Not sure what model was used or what the quantity was.

Reinfocement Learning

problem setting

I liken it to a planning algorithm, but it learns a similar but ultimately flattened sequence of reasoning paths (maybe it’s because I don’t have an RL background, but it just seems wordy…)
Our objective is to create a
where $z$ is the reasoning steps / $r$ is the verifiable reward that falls into {0,1}

policy optimization

You borrowed the online policy mirror decent algorithm? (https://www.ijcai.org/proceedings/2019/0434.pdf , https://github.com/manantomar/Mirror-Descent-Policy-Optimization // I looked it up a bit and it’s optimization under constraints like PPO, but the optimization method seems to be something other than gradient descent.

I feel like I’ve seen this a lot in TRPO, but I’ll have to watch it slowly later.

In conclusion, the gradient looks like the above, which is similar to the policy gradient, except that the baseline uses mean of sampled reward (https://github.com/long8v/PTIR/issues/215#issuecomment-2608698801) . Oh, and another difference is that we used online rollout originally, but here we used the rollout of the reference model The reason why it was good to delete the “value network” itself is that when there is a value network in the process of learning a long reasoning path, the intermediate steps of the negative COT have a negative advantage right away, and since exploration is also necessary to learn various reasoning paths, it seems that it was more important to roll out in the “RLOO” algorithm and go to the end (!!).

length penalty

To prevent overthinking

sampling

curriculum sampling: From low to high difficulty.
prioritized sampling: The model focuses on the lowest performing

Reward Modeling for math

I use RM for math because the answer is $a^2-4$ or something like $(a+2)(a-2)$ and it’s hard to figure it out This is called using CoT RM (#211) – 800K CoT-Labeled reward model The performance difference was 84.4 -> 98.5 compared to classic RM, so it was said that CoT RM was used.

vision data

real world: chart understanding, science question, graphical comprehension
synthetic visual reasoning : clevr-like?
text rendeered data: text / code / structured data

long2short

model merging : with long-cot model
shortest rejection sampling : The shortest of the correct answers from 8 rejection samplings is used as the sft dataset.
DPO: (chosen) short correct answer (rejected) long and wrong answer, long and correct answer
long2short: apply the above length penalty

other training

pretraining : Vision PT up to singularity. I need to organize this later…
vanilla SFT: text 1M / Vision 1M
- seq len 23K

RL infrastructure

Omit

Result

ablations

TL;DR#

Details#

thumbnail#

RL prompt set curation#

Long-CoT SFT#

Reinfocement Learning#

problem setting#

policy optimization#

length penalty#

sampling#

Reward Modeling for math#

vision data#

long2short#

other training#

RL infrastructure#

Result#