contribution : Especially in the RL phase, you’ve done a good job of summarizing the various trials and errors and lessons learned. Like kimi, I like rl. Also, VLM problems are well summarized. But why RL has no training step.
data (eventually 50M)
image caption, interleaved, OCR data, Grounding data, Video data, instruction tuning data.
Video data
corpus from academic, web, proprietary sources
Developed a pipeline to annotate complex action or in-scene text with fine-grained human annotation, as standard captions are prone to hallucinations and omissions (sounds like you’re not captioning, you’re annotating something else?)
Deeper visual understanding Annotate cinematic elements such as camera motion or shot composition with a human-in-the-loop workflow.
No matter how long the video is
training
multimodal pre-training (seq len 8192) — 120K step -> long-context continual pre-training (seq len 32K) — global step 10K 1.5 bs
A combination of RLVR and model-based rewards (RLHF)
The extraction of the final answer in RLVR : LLM extraction can be difficult if the think is long, so it was parsed as <|begin_of_box|>{FINAL_ANSWER}<|end_of_box|>. \boxed{} was also difficult as the final answer became longer.
Reward shaving hard per domain…
algorithm
GRPO
no KL (KL tended to rise faster than text-only, but putting in a kl term limited performance), no entropy bonus, clip higher, larger BS
training recipe
RL with Curriculum Sampling (RLCS), dynamic sampling extension with ratio EMA, no KL and entropy loss
lesson learned
we discover that when training a unified VLM across diverse skills, any weakness in the reward signal for a single capability can derail the entire training (figure 5)
This is the funny part, as we learned by combining multiple domains, if any of the rewards can be hacked, the model performs poorly across the board.
That’s why we say run for each domain -> check rollout to see if we’re getting rewarded well, and so on and so forth.
A coarse or incomplete reward design can lead the model to discover shortcuts for boosting its reward rather than truly improving its task performance.
For example, when using the llm-as-a-judge reward (RLHF here) and doing a “counting task”, the response sometimes rolls out like “The correct answer is a number between 1 and 10”… lol
The peak performance in the RL phase does not perfectly correlate with a cold-start SFT model’s performance.
Domain interference in RL is less pronounced than in SFT.
In terms of evaluation, MMVU and VideoMMMU, which are close to academics, show an increase of about 4 and 6 points, respectively, when think is enabled, but for LVBench and MVBench, which are long video tasks, direct evaluation performs better, and VideoMME shows little performance improvement. Among the image benches, STEM types are still performing well, but general VQA is not so good.
Mostly used vLLM for evaluation, but used sglang for video inference
vision token max uses 6K for images and 48K for video.
Use GPT4o for all cases where APIs are needed, such as parsing. Evaluate other models equally.
Effect of RL on cross-domain performance - better when all domains are mixed together