[206] Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

TL;DR

I read this because.. : mllm + o1
task : VLM reasoning
problem : r1 처럼 학습하고 싶은데 vision data 가 없네
idea : MLLM으로 cot 먼저 생성하고 이걸로 description 생성한 뒤 r1에게 주고 long cot 생성하게 함.
input/output : {I, Q} -> {long CoT, A}
architecture : Qwen2.5-VL-7B-Instruct, Llmama-3.2-V-Instruct
objective : CE loss -> GRPO loss
baseline : Qwen2.5-VL-7B-Instruct, Llmama-3.2-V-Instruct, Math MLLM, LLaVA-CoT-11B, Mulberry-7B
data : cold start {LLaVA-CoT, Mulberry} image / answer – 200K -> GRPO {WeMath, PolyMATH, MathiVision, SceMQA, Geomety3K} – 10K
evaluation : MM-Math, MathVista, MathVerse
result : instruct 모델보다 상당히 개선된 모습
contribution : 그냥 디테일 캡션 생성해라 보다 prompt로 cot 생성하고 사용하는게 합리적인듯. 데이터도 공개했으면 좋겠다.
etc. : bench 셋을 데이터셋으로 씀

MLLM – Qwen-VL2.5-72B LLM – R1