Image

paper , code

TL;DR

  • I read this because.. : A relatively(?) new MLLM model and a model that has cut RL hard. I read it because an acquaintance said it was good.
  • task : MLLM with thinking
  • problem : good MLLM model
  • idea : multi-modal pretraining -> cold start SFT -> RL (RLVR + RLHF). Let’s collect data hard
  • input/output : {image, video, question} -> answering
  • architecture : VE (AIMv2-Huge), LLM (GLM-4-9B-0414, GLM-4.5-Air)
  • objective : CE loss -> GRPO loss
  • baseline : Qwen2.5-VL, Kimi-VL, InternVL3, GPT-4o, Gemini…
  • data : pretraining data 50M -> SFT (??) -> RL (??)
  • evaluation : (image) general VQA, STEM, OCR-chart, long-doc, grounding, GUI agent, coding(Design2Code, Flame-React-Eval), (video) VideoMME, MMVU, LVBench, MotionBench
  • result : Among comparable models, the SOTA
  • contribution : Especially in the RL phase, you’ve done a good job of summarizing the various trials and errors and lessons learned. Like kimi, I like rl. Also, VLM problems are well summarized. But why RL has no training step.
  • etc. :

Details

architecture

Image
  • VE (AIMv2-Huge), GLM-4-9B-0414, GLM-4.5-Air
  • 3D conv (w, h, temporal) - if image is duplicated (stride 2, so a single image is just two copies and put on the temporal axis)
  • 2D-RoPE in ViT + original absolute PE (bicubic interpolation)
  • In LLM, replace RoPE with 3D-RoPE for spatial understanding (w, h, nth token?)

pretraining stage

  • data (eventually 50M) image caption, interleaved, OCR data, Grounding data, Video data, instruction tuning data.
    • Video data
      • corpus from academic, web, proprietary sources
  • Developed a pipeline to annotate complex action or in-scene text with fine-grained human annotation, as standard captions are prone to hallucinations and omissions (sounds like you’re not captioning, you’re annotating something else?)
  • Deeper visual understanding Annotate cinematic elements such as camera motion or shot composition with a human-in-the-loop workflow.
  • No matter how long the video is
  • training
    • multimodal pre-training (seq len 8192) — 120K step -> long-context continual pre-training (seq len 32K) — global step 10K 1.5 bs

SFT stage

  • high-quality CoT reasoning examples
    • almost chinese and english
    • filter with pretrained model (too easy or excessively hard)
  • Tight data filtering and iterative data enhancement (RL and re-create cold-start data with that model…)
  • training
    • full param tuning
    • seq len 32K
    • global bs 32
    • also includes high-quality text-only long-form SFT data.
  • GLM-4.5V is a think / no-think model, so it learns that if you put /nothink as a user prompt, the thinking content will be empty.

RL stage

  • reward
  • A combination of RLVR and model-based rewards (RLHF)
  • The extraction of the final answer in RLVR : LLM extraction can be difficult if the think is long, so it was parsed as <|begin_of_box|>{FINAL_ANSWER}<|end_of_box|>. \boxed{} was also difficult as the final answer became longer.
  • Reward shaving hard per domain…
    • Image
  • algorithm
    • GRPO
  • no KL (KL tended to rise faster than text-only, but putting in a kl term limited performance), no entropy bonus, clip higher, larger BS
  • training recipe
    • RL with Curriculum Sampling (RLCS), dynamic sampling extension with ratio EMA, no KL and entropy loss
  • lesson learned
    • we discover that when training a unified VLM across diverse skills, any weakness in the reward signal for a single capability can derail the entire training (figure 5)
  • This is the funny part, as we learned by combining multiple domains, if any of the rewards can be hacked, the model performs poorly across the board.
  • That’s why we say run for each domain -> check rollout to see if we’re getting rewarded well, and so on and so forth.
    • A coarse or incomplete reward design can lead the model to discover shortcuts for boosting its reward rather than truly improving its task performance.
  • For example, when using the llm-as-a-judge reward (RLHF here) and doing a “counting task”, the response sometimes rolls out like “The correct answer is a number between 1 and 10”… lol
    • Image
    • The peak performance in the RL phase does not perfectly correlate with a cold-start SFT model’s performance.
    • Domain interference in RL is less pronounced than in SFT.

result

Image
  • In terms of evaluation, MMVU and VideoMMMU, which are close to academics, show an increase of about 4 and 6 points, respectively, when think is enabled, but for LVBench and MVBench, which are long video tasks, direct evaluation performs better, and VideoMME shows little performance improvement. Among the image benches, STEM types are still performing well, but general VQA is not so good.
  • Mostly used vLLM for evaluation, but used sglang for video inference
  • vision token max uses 6K for images and 48K for video.
  • Use GPT4o for all cases where APIs are needed, such as parsing. Evaluate other models equally.
Image
  • Effect of RL on cross-domain performance - better when all domains are mixed together
Image
  • It’s almost sota under 10B.