Image

paper

TL;DR

  • I read this because.. : mllm + o1
  • task : VLM reasoning
  • problem : r1 ์ฒ˜๋Ÿผ ํ•™์Šตํ•˜๊ณ  ์‹ถ์€๋ฐ vision data ๊ฐ€ ์—†๋„ค
  • idea : MLLM์œผ๋กœ cot ๋จผ์ € ์ƒ์„ฑํ•˜๊ณ  ์ด๊ฑธ๋กœ description ์ƒ์„ฑํ•œ ๋’ค r1์—๊ฒŒ ์ฃผ๊ณ  long cot ์ƒ์„ฑํ•˜๊ฒŒ ํ•จ.
  • input/output : {I, Q} -> {long CoT, A}
  • architecture : Qwen2.5-VL-7B-Instruct, Llmama-3.2-V-Instruct
  • objective : CE loss -> GRPO loss
  • baseline : Qwen2.5-VL-7B-Instruct, Llmama-3.2-V-Instruct, Math MLLM, LLaVA-CoT-11B, Mulberry-7B
  • data : cold start {LLaVA-CoT, Mulberry} image / answer – 200K -> GRPO {WeMath, PolyMATH, MathiVision, SceMQA, Geomety3K} – 10K
  • evaluation : MM-Math, MathVista, MathVerse
  • result : instruct ๋ชจ๋ธ๋ณด๋‹ค ์ƒ๋‹นํžˆ ๊ฐœ์„ ๋œ ๋ชจ์Šต
  • contribution : ๊ทธ๋ƒฅ ๋””ํ…Œ์ผ ์บก์…˜ ์ƒ์„ฑํ•ด๋ผ ๋ณด๋‹ค prompt๋กœ cot ์ƒ์„ฑํ•˜๊ณ  ์‚ฌ์šฉํ•˜๋Š”๊ฒŒ ํ•ฉ๋ฆฌ์ ์ธ๋“ฏ. ๋ฐ์ดํ„ฐ๋„ ๊ณต๊ฐœํ–ˆ์œผ๋ฉด ์ข‹๊ฒ ๋‹ค.
  • etc. : bench ์…‹์„ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ์”€

Details

  • thumbnail
Image

MLLM – Qwen-VL2.5-72B LLM – R1

Image Image Image
  • data distil
Image
  • data ablation
Image
  • progressive
Image Image
  • main result
Image
  • qualitative example
Image