TL;DR
- I read this because.. : mllm + o1
- task : VLM reasoning
- problem : I want to train like r1, but I don’t have vision data.
- idea : Create cot first with MLLM, create description with it, give it to r1 and let him create long cot.
- input/output : {I, Q} -> {long CoT, A}
- architecture : Qwen2.5-VL-7B-Instruct, Llmama-3.2-V-Instruct
- objective : CE loss -> GRPO loss
- baseline : Qwen2.5-VL-7B-Instruct, Llmama-3.2-V-Instruct, Math MLLM, LLaVA-CoT-11B, Mulberry-7B
- data : cold start {LLaVA-CoT, Mulberry} image / answer – 200K -> GRPO {WeMath, PolyMATH, MathiVision, SceMQA, Geomety3K} – 10K
- evaluation : MM-Math, MathVista, MathVerse
- Result : Significant improvement over the instruct model
- contribution : Seems reasonable to create and use cot as prompt rather than just generate detail caption. I’d like to see the data published as well.
- etc. : Write the bench set as a dataset
Details
- thumbnail
MLLM – Qwen-VL2.5-72B LLM – R1
- data distil
- data ablation
- progressive
- main result
- qualitative example