Image

paper

TL;DR

  • I read this because.. : mllm + o1
  • task : VLM reasoning
  • problem : I want to train like r1, but I don’t have vision data.
  • idea : Create cot first with MLLM, create description with it, give it to r1 and let him create long cot.
  • input/output : {I, Q} -> {long CoT, A}
  • architecture : Qwen2.5-VL-7B-Instruct, Llmama-3.2-V-Instruct
  • objective : CE loss -> GRPO loss
  • baseline : Qwen2.5-VL-7B-Instruct, Llmama-3.2-V-Instruct, Math MLLM, LLaVA-CoT-11B, Mulberry-7B
  • data : cold start {LLaVA-CoT, Mulberry} image / answer – 200K -> GRPO {WeMath, PolyMATH, MathiVision, SceMQA, Geomety3K} – 10K
  • evaluation : MM-Math, MathVista, MathVerse
  • Result : Significant improvement over the instruct model
  • contribution : Seems reasonable to create and use cot as prompt rather than just generate detail caption. I’d like to see the data published as well.
  • etc. : Write the bench set as a dataset

Details

  • thumbnail
Image

MLLM – Qwen-VL2.5-72B LLM – R1

Image Image Image
  • data distil
Image
  • data ablation
Image
  • progressive
Image Image
  • main result
Image
  • qualitative example
Image