[212] MiMo-VL Technical Report

TL;DR

I read this because.. : AIME을 report한 LVLM2
task : 멀티모달 이해, 추론, GUI grounding
problem : reasoning과 understanding 모두 잘하는 모델
idea : 대규모 pretrain + rl
input/output : {(image, video), prompt}→ answer or action
architecture : Qwen2.5-ViT + MLP projector + MiMo-7B
objective : CE loss(pretrain, SFT), on-policy GRPO(RL)
baseline : Qwen2.5-VL, InternVL3, GPT-4o, UI-TARS 등
data : 2.4T token for pretraining(caption, interleaved, ocr, grounding, gui, synthetic reasoning)
evaluation : 50+ 벤치마크 (MMMU, OlympiadBench, GUI 등)
result : 대부분 벤치마크에서 타 모델 압도, 오픈소스 중 최고 Elo
contribution :
etc. : task 간 interference 문제 제기, long-context 대응 가능

strong reasoning abilities inherent in MiMo-7B-Base (Xiaomi, 2025), enabling their seamless transfer and adaptation to multimodal contexts. Consequently, this allows our model to exhibit powerful and versatile multimodal reasoning capabilities across a broad array of domains. -> 이미 mimo가 base지만 Reasoning ability가 있다는 것 같음.

stage 4에서 32K까지 올림

2.4T token 학습

implement dedicated data curation pipelines tailored to the characteristics of each data type
data
- Image Caption Data : dedup (image perceptual hashing), text filtering, re-caption(w/ linguistic consistency, repititon filtering, imbalance filtering (w/ MetaCLIP)
- interleaved data: webpages, books, academic papers, pdf parsing, knowledge density and readability filtering, …
- ocr : to increase learning difficulty, handwritten, typographically deformed, blury occluded, bounding box annotated
- grounding data: single / multiple objects
- video: publicly available, video recaptioning, dense finegrained, event distribution, challenging question about video and synthesize response
- GUI data : open source & synthetic. grounding, action
- synthetic reasoning : image, prompt는 오픈소스로 모으고 생성에는 MiMo-7B-base를 사용. answer 의 정답 뿐 아니라 thought의 clarity를 평가하고, redundancy 제거, consistent formatting을 하도록 strict filtering.

RLVR
- data
  - visual reasoning: open source & k-12 collection
    - LLM is prompted to filter proof-based problems and rewrite multiple-choice questions with numerical or symbolic answers into free-answer formats, alleviating potential reward hacking
  - text reasoning : include more challenging queries requiring college or competition-level intelligence (than visual reasoning, K12)
  - Image Grounding: GIoU or point가 box안에 들어오는지
  - Temporal Video Grounding : video moment retrieval [mm:ss,mm:ss] – IoU
RLHF
mixed on-policy RL
- RLVR + RLHF 둘이 합쳐서 GRPO로 학습
- GRPO와 다른 점은 완벽히 on-policy (vllm을 매번 새로 띄움)을 사용해서 clipping이나 importance sampling 부분이 없음
  - c.f. GRPO