Image

paper , code

TL;DR

  • I read this because.. : LVLM2 reporting AIME
  • task : multimodal understanding, reasoning, GUI grounding
  • PROBLEM : A model that is good at both reasoning and understanding.
  • idea : massive pretrain + rl
  • input/output : {(image, video), prompt}โ†’ answer or action
  • architecture : Qwen2.5-ViT + MLP projector + MiMo-7B
  • objective : CE loss(pretrain, SFT), on-policy GRPO(RL)
  • baseline : Qwen2.5-VL, InternVL3, GPT-4o, UI-TARS, etc.
  • data : 2.4T token for pretraining(caption, interleaved, ocr, grounding, gui, synthetic reasoning)
  • evaluation : 50+ benchmarks (MMMU, OlympiadBench, GUI, etc.)
  • result : Outperforms other models in most benchmarks, highest Elo in open source
  • contribution :
  • etc. : Raises intertask interference issues, can handle long-context

Details

architecture

Image
  • strong reasoning abilities inherent in MiMo-7B-Base (Xiaomi, 2025), enabling their seamless transfer and adaptation to multimodal contexts. Consequently, this allows our model to exhibit powerful and versatile multimodal reasoning capabilities across a broad array of domains. -> Already MIMO based but seems to have Reasoning ability.

training

Image

Raise to 32K in stage 4

pre-training

Learn 2.4T token

  • implement dedicated data curation pipelines tailored to the characteristics of each data type
  • data
    • Image Caption Data : dedup (image perceptual hashing), text filtering, re-caption(w/ linguistic consistency, repititon filtering, imbalance filtering (w/ MetaCLIP)
    • interleaved data: webpages, books, academic papers, pdf parsing, knowledge density and readability filtering, …
    • ocr : to increase learning difficulty, handwritten, typographically deformed, blury occluded, bounding box annotated
    • grounding data: single / multiple objects
    • video: publicly available, video recaptioning, dense finegrained, event distribution, challenging question about video and synthesize response
    • GUI data : open source & synthetic. grounding, action
  • synthetic reasoning : image, prompt are collected as open source and MiMo-7B-base is used for generation. Strict filtering to evaluate not only the correctness of the answer but also the clarity of thought, remove redundancy, and ensure consistent formatting.

post-training

Image

  • RLVR
    • data
      • visual reasoning: open source & k-12 collection
        • LLM is prompted to filter proof-based problems and rewrite multiple-choice questions with numerical or symbolic answers into free-answer formats, alleviating potential reward hacking
      • text reasoning : include more challenging queries requiring college or competition-level intelligence (than visual reasoning, K12)
  • Image Grounding: Whether the GIoU or point is inside the box.
    • Temporal Video Grounding : video moment retrieval [mm:ss,mm:ss] – IoU
  • RLHF
  • mixed on-policy RL
  • RLVR + RLHF combined to learn with GRPO
  • The difference with GRPO is that it is completely on-policy (reloading vllm every time), so there is no clipping or importance sampling part
    • Image
    • Image
    • c.f. GRPO
    • Image

performance

Image

Not much fun in the RL phase

Image

ablation

Image

Is it because there is no KL term?

Image

challenges

  • Interference Between RL Tasks: grounding task like grounding trajectory is not long and reasoning is long, so it was difficult to get as much performance as RL doing each because they move in opposite directions.
Image

aime24 68.0 to 80.1 (max len 48K) with mimo 7B-rl https://huggingface.co/XiaomiMiMo/MiMo-7B-RL

(67 points in this paper)