[212] MiMo-VL Technical Report

TL;DR

I read this because.. : LVLM2 reporting AIME
task : multimodal understanding, reasoning, GUI grounding
PROBLEM : A model that is good at both reasoning and understanding.
idea : massive pretrain + rl
input/output : {(image, video), prompt}→ answer or action
architecture : Qwen2.5-ViT + MLP projector + MiMo-7B
objective : CE loss(pretrain, SFT), on-policy GRPO(RL)
baseline : Qwen2.5-VL, InternVL3, GPT-4o, UI-TARS, etc.
data : 2.4T token for pretraining(caption, interleaved, ocr, grounding, gui, synthetic reasoning)
evaluation : 50+ benchmarks (MMMU, OlympiadBench, GUI, etc.)
result : Outperforms other models in most benchmarks, highest Elo in open source
contribution :
etc. : Raises intertask interference issues, can handle long-context

strong reasoning abilities inherent in MiMo-7B-Base (Xiaomi, 2025), enabling their seamless transfer and adaptation to multimodal contexts. Consequently, this allows our model to exhibit powerful and versatile multimodal reasoning capabilities across a broad array of domains. -> Already MIMO based but seems to have Reasoning ability.

Raise to 32K in stage 4

Learn 2.4T token

implement dedicated data curation pipelines tailored to the characteristics of each data type
data
- Image Caption Data : dedup (image perceptual hashing), text filtering, re-caption(w/ linguistic consistency, repititon filtering, imbalance filtering (w/ MetaCLIP)
- interleaved data: webpages, books, academic papers, pdf parsing, knowledge density and readability filtering, …
- ocr : to increase learning difficulty, handwritten, typographically deformed, blury occluded, bounding box annotated
- grounding data: single / multiple objects
- video: publicly available, video recaptioning, dense finegrained, event distribution, challenging question about video and synthesize response
- GUI data : open source & synthetic. grounding, action
synthetic reasoning : image, prompt are collected as open source and MiMo-7B-base is used for generation. Strict filtering to evaluate not only the correctness of the answer but also the clarity of thought, remove redundancy, and ensure consistent formatting.

RLVR
- data
  - visual reasoning: open source & k-12 collection
    - LLM is prompted to filter proof-based problems and rewrite multiple-choice questions with numerical or symbolic answers into free-answer formats, alleviating potential reward hacking
  - text reasoning : include more challenging queries requiring college or competition-level intelligence (than visual reasoning, K12)
Image Grounding: Whether the GIoU or point is inside the box.
- Temporal Video Grounding : video moment retrieval [mm:ss,mm:ss] – IoU
RLHF
mixed on-policy RL
RLVR + RLHF combined to learn with GRPO
The difference with GRPO is that it is completely on-policy (reloading vllm every time), so there is no clipping or importance sampling part
- c.f. GRPO

Not much fun in the RL phase

Is it because there is no KL term?

Interference Between RL Tasks: grounding task like grounding trajectory is not long and reasoning is long, so it was difficult to get as much performance as RL doing each because they move in opposite directions.

aime24 68.0 to 80.1 (max len 48K) with mimo 7B-rl https://huggingface.co/XiaomiMiMo/MiMo-7B-RL

(67 points in this paper)