[206] Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

TL;DR

I read this because.. : mllm + o1
task : VLM reasoning
problem : I want to train like r1, but I don’t have vision data.
idea : Create cot first with MLLM, create description with it, give it to r1 and let him create long cot.
input/output : {I, Q} -> {long CoT, A}
architecture : Qwen2.5-VL-7B-Instruct, Llmama-3.2-V-Instruct
objective : CE loss -> GRPO loss
baseline : Qwen2.5-VL-7B-Instruct, Llmama-3.2-V-Instruct, Math MLLM, LLaVA-CoT-11B, Mulberry-7B
data : cold start {LLaVA-CoT, Mulberry} image / answer – 200K -> GRPO {WeMath, PolyMATH, MathiVision, SceMQA, Geomety3K} – 10K
evaluation : MM-Math, MathVista, MathVerse
Result : Significant improvement over the instruct model
contribution : Seems reasonable to create and use cot as prompt rather than just generate detail caption. I’d like to see the data published as well.
etc. : Write the bench set as a dataset

MLLM – Qwen-VL2.5-72B LLM – R1