TL;DR
- I read this because.. : AIME을 report한 LVLM2
- task : 멀티모달 이해, 추론, GUI grounding
- problem : reasoning과 understanding 모두 잘하는 모델
- idea : 대규모 pretrain + rl
- input/output : {(image, video), prompt}→ answer or action
- architecture : Qwen2.5-ViT + MLP projector + MiMo-7B
- objective : CE loss(pretrain, SFT), on-policy GRPO(RL)
- baseline : Qwen2.5-VL, InternVL3, GPT-4o, UI-TARS 등
- data : 2.4T token for pretraining(caption, interleaved, ocr, grounding, gui, synthetic reasoning)
- evaluation : 50+ 벤치마크 (MMMU, OlympiadBench, GUI 등)
- result : 대부분 벤치마크에서 타 모델 압도, 오픈소스 중 최고 Elo
- contribution :
- etc. : task 간 interference 문제 제기, long-context 대응 가능
Details
architecture
- strong reasoning abilities inherent in MiMo-7B-Base (Xiaomi, 2025), enabling their seamless transfer and adaptation to multimodal contexts. Consequently, this allows our model to exhibit powerful and versatile multimodal reasoning capabilities across a broad array of domains. -> 이미 mimo가 base지만 Reasoning ability가 있다는 것 같음.
training
stage 4에서 32K까지 올림
pre-training
2.4T token 학습
- implement dedicated data curation pipelines tailored to the characteristics of each data type
- data
- Image Caption Data : dedup (image perceptual hashing), text filtering, re-caption(w/ linguistic consistency, repititon filtering, imbalance filtering (w/ MetaCLIP)
- interleaved data: webpages, books, academic papers, pdf parsing, knowledge density and readability filtering, …
- ocr : to increase learning difficulty, handwritten, typographically deformed, blury occluded, bounding box annotated
- grounding data: single / multiple objects
- video: publicly available, video recaptioning, dense finegrained, event distribution, challenging question about video and synthesize response
- GUI data : open source & synthetic. grounding, action
- synthetic reasoning : image, prompt는 오픈소스로 모으고 생성에는 MiMo-7B-base를 사용. answer 의 정답 뿐 아니라 thought의 clarity를 평가하고, redundancy 제거, consistent formatting을 하도록 strict filtering.
post-training
- RLVR
- data
- visual reasoning: open source & k-12 collection
- LLM is prompted to filter proof-based problems and rewrite multiple-choice questions with numerical or symbolic answers into free-answer formats, alleviating potential reward hacking
- text reasoning : include more challenging queries requiring college or competition-level intelligence (than visual reasoning, K12)
- Image Grounding: GIoU or point가 box안에 들어오는지
- Temporal Video Grounding : video moment retrieval [mm:ss,mm:ss] – IoU
- visual reasoning: open source & k-12 collection
- data
- RLHF
- mixed on-policy RL
- RLVR + RLHF 둘이 합쳐서 GRPO로 학습
- GRPO와 다른 점은 완벽히 on-policy (vllm을 매번 새로 띄움)을 사용해서 clipping이나 importance sampling 부분이 없음
- c.f. GRPO
performance
RL 단계에서 크게 재미는 못봄
ablation
KL term이 없어서 그런건지?
challenges
- Interference Between RL Tasks: grounding task 같은건 grounding 궤적이 길지 않고 reasoning 은 길어서 서로 반대 방향으로 움직여서 각각을 하는 RL 만큼의 성능을 확보하기가 어려웠다고 함
mimo 7B-rl의 aime24 68.0 ~ 80.1 (max len 48K) https://huggingface.co/XiaomiMiMo/MiMo-7B-RL
(이 논문에서는 67점)