Image

paper , code

TL;DR

  • I read this because.. : AIME을 report한 LVLM2
  • task : 멀티모달 이해, 추론, GUI grounding
  • problem : reasoning과 understanding 모두 잘하는 모델
  • idea : 대규모 pretrain + rl
  • input/output : {(image, video), prompt}→ answer or action
  • architecture : Qwen2.5-ViT + MLP projector + MiMo-7B
  • objective : CE loss(pretrain, SFT), on-policy GRPO(RL)
  • baseline : Qwen2.5-VL, InternVL3, GPT-4o, UI-TARS 등
  • data : 2.4T token for pretraining(caption, interleaved, ocr, grounding, gui, synthetic reasoning)
  • evaluation : 50+ 벤치마크 (MMMU, OlympiadBench, GUI 등)
  • result : 대부분 벤치마크에서 타 모델 압도, 오픈소스 중 최고 Elo
  • contribution :
  • etc. : task 간 interference 문제 제기, long-context 대응 가능

Details

architecture

Image
  • strong reasoning abilities inherent in MiMo-7B-Base (Xiaomi, 2025), enabling their seamless transfer and adaptation to multimodal contexts. Consequently, this allows our model to exhibit powerful and versatile multimodal reasoning capabilities across a broad array of domains. -> 이미 mimo가 base지만 Reasoning ability가 있다는 것 같음.

training

Image

stage 4에서 32K까지 올림

pre-training

2.4T token 학습

  • implement dedicated data curation pipelines tailored to the characteristics of each data type
  • data
    • Image Caption Data : dedup (image perceptual hashing), text filtering, re-caption(w/ linguistic consistency, repititon filtering, imbalance filtering (w/ MetaCLIP)
    • interleaved data: webpages, books, academic papers, pdf parsing, knowledge density and readability filtering, …
    • ocr : to increase learning difficulty, handwritten, typographically deformed, blury occluded, bounding box annotated
    • grounding data: single / multiple objects
    • video: publicly available, video recaptioning, dense finegrained, event distribution, challenging question about video and synthesize response
    • GUI data : open source & synthetic. grounding, action
    • synthetic reasoning : image, prompt는 오픈소스로 모으고 생성에는 MiMo-7B-base를 사용. answer 의 정답 뿐 아니라 thought의 clarity를 평가하고, redundancy 제거, consistent formatting을 하도록 strict filtering.

post-training

Image

  • RLVR
    • data
      • visual reasoning: open source & k-12 collection
        • LLM is prompted to filter proof-based problems and rewrite multiple-choice questions with numerical or symbolic answers into free-answer formats, alleviating potential reward hacking
      • text reasoning : include more challenging queries requiring college or competition-level intelligence (than visual reasoning, K12)
      • Image Grounding: GIoU or point가 box안에 들어오는지
      • Temporal Video Grounding : video moment retrieval [mm:ss,mm:ss] – IoU
  • RLHF
  • mixed on-policy RL
    • RLVR + RLHF 둘이 합쳐서 GRPO로 학습
    • GRPO와 다른 점은 완벽히 on-policy (vllm을 매번 새로 띄움)을 사용해서 clipping이나 importance sampling 부분이 없음
      • Image
      • Image
      • c.f. GRPO
      • Image

performance

Image

RL 단계에서 크게 재미는 못봄

Image

ablation

Image

KL term이 없어서 그런건지?

Image

challenges

  • Interference Between RL Tasks: grounding task 같은건 grounding 궤적이 길지 않고 reasoning 은 길어서 서로 반대 방향으로 움직여서 각각을 하는 RL 만큼의 성능을 확보하기가 어려웠다고 함
Image

mimo 7B-rl의 aime24 68.0 ~ 80.1 (max len 48K) https://huggingface.co/XiaomiMiMo/MiMo-7B-RL

(이 논문에서는 67점)