Image

paper

TL;DR

  • I read this because.. : AIME ์„ฑ๋Šฅ์„ ๋ ˆํฌํŠธํ•œ LVLM
  • task : multimodal reasoning (math, vision, QA)
  • problem : VLM์€ complex reasoning์— ์•ฝํ•˜๊ณ , vision-text alignment๋„ ์–ด๋ ค์›€
  • idea : MLP-based adapter + hybrid SFT+GRPO + adaptive-length CoT distillation
  • input/output : {image, prompt} -> {step-by-step reasoning, boxed answer}
  • architecture : DeepSeek-R1-distill-Qwen2.5-32B (frozen), InternViT-6B-448px-V2_5 (frozen), MLP Adapter
  • objective : SFT, GRPO
  • baseline : GPT-4o, Claude 3.5, Kimi k1.5, InternVL2.5, QwenVL
  • data : 2M VL data โ†’ 200K (GPT-4 filtered) โ†’ 40K CoT (AL-CoTD) -> prompt
  • evaluation : MATH500, AIME24, GPQA, MathVista, MMMU
  • result : MATH500 94.0 / AIME24 72.0 / MMMU 69.0 ๋“ฑ competitive ์„ฑ๋Šฅ
  • contribution : reasoning LLM์„ vision์œผ๋กœ ํšจ์œจ์ ์œผ๋กœ ํ™•์žฅ, RL๋กœ ์„ฑ๋Šฅ ํ–ฅ์ƒ
  • etc. : MLP๋งŒ ํ•™์Šตํ•œ๊ฒŒ ํŠน์ดํ•˜๊ณ  ์‹ ๊ธฐํ•˜์ง€๋งŒ AIME ์„ฑ๋Šฅ์„ reportํ•œ๊ฒŒ ๊ด˜์”ธ(?)ํ•จ. ์—ฌ๊ธฐ์„  llm frozen์„ ์ž˜ ๋ช…์‹œํ•ด๋†จ๋Š”๋ฐ V2์—์„  ์• ๋งคํ•ด๊ฒŒ ์„œ์ˆ ํ•ด๋†”์„œ ๋” ๊ด˜์”ธํ•จ

Details

thumbnail

Image

  • ์ด ๋…ผ๋ฌธ์˜ ๊ฐ€์žฅ ํŠน์ด์ ์€ MLP๋งŒ ํ•™์Šตํ•œ๋‹ค๋Š” ๊ฒƒ์ž„. ์ด๋•Œ MLP๋ฅผ ํ•™์Šต ์‹œํ‚ค๋Š” ๋ฐฉ์‹์— ๊ณต์„ ๋“ค์ž„
      1. MLP adapter๋ฅผ ์ฒ˜์Œ initializeํ•  ๋•Œ๋Š” reasoning lanugage model ๋Œ€์‹  ๊ทธ๋ƒฅ language model์„ ์‚ฌ์šฉํ•จ (Qwen2.5-32B-Instruct)
      • 2M full dataset์œผ๋กœ finetune
      1. ์ด ๋‹จ๊ณ„์—์„œ language model์„ DeepSeek-R1-distill-Qwen2.5-32B ๋กœ ๊ต์ฒด. tokenizer์™€ parameter๊ฐ€ ๋‹ค๋ฅด์ง€๋งŒ(์™œ ๋‹ค๋ฅด์ง€??) ์›๋ž˜ ์„ฑ๋Šฅ์„ ์ž˜ ๋ณต์›ํ•œ๋‹ค๊ณ  ํ•จ
      • GPT-4๋กœ ํ‰๊ฐ€๋œ high-quality ์˜ 200K ์‚ฌ์šฉ
      1. 40K์˜ high-quality CoT ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต (Adaptive-Length Chain-of-Thought Distilation ์‚ฌ์šฉ)
    • ๊ฐ 1 epoch์”ฉ lr์€ 2e-4 -> 4e-5 -> 4e-5
  • Hybrid Optimization Framework
  • Image
  • stage 1: filtering ์—†์ด ๋ชจ๋“  ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ•™์Šต
  • stage 2: reward model์ด ์ ์ˆ˜๋ฅผ ๋งค๊ธด๊ฒƒ์œผ๋กœ filteringํ•˜๊ณ  ์ด์ „ stage์˜ ๋ชจ๋ธ์ด ํ’€์ง€ ๋ชปํ•œ ๊ฑธ ๊ต์ง‘ํ•ฉ์„ ๊ตฌํ•ด์„œ ๋ฐ์ดํ„ฐ๋กœ ์‚ฌ์šฉ (
    • 2,3,4,5๋กœ ๋†’์˜€๋‹ค๊ณ  ํ•จ.
    • context length 16K
  • stage 3: GRPO, reward=5, generation bs 8, temperature 1, lr 1e-6, max completion length 8k

Adaptive-Length Chain-of-Thought Distilation

Image
  • QDAM:
    • vision score: image clarity, image necessity (์งˆ๋‹ต์„ ์œ„ํ•ด ์ด๋ฏธ์ง€๊ฐ€ ํ•„์š”ํ•œ๊ฐ€)
    • text score: GPT-4o ๋ฅผ ์‚ฌ์šฉํ•ด์„œ question quality, difficulty level, reasoning demand ๋“ฑ์„ ํ‰๊ฐ€ํ•˜๊ฒŒ ํ•จ
  • VTIA
    • why, how ๋“ฑ scientific reasoning์ด ํ•„์š”ํ•œ์ง€๋ฅผ ํ‰๊ฐ€ํ•˜๊ฒŒ ํ•จ
  • ๋‘๊ฐœ๋ฅผ ๊ฒฐํ•ฉํ•ด์„œ ์ด ์ฟผ๋ฆฌ๊ฐ€ ์–ผ๋งˆ๋‚˜ ๊ธด ๋Œ€๋‹ต์ด ํ•„์š”ํ•œ์ง€๋ฅผ P๋กœ ์ถ”์‚ฐํ•˜๊ณ  P๊ฐ€ ๋‚ฎ์œผ๋ฉด ๋” ๋†’์€ repetition penalty๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ƒˆ์—‡ใ…‡ํ•˜๊ฒŒ ํ•จ.
  • ์ตœ์ข…์ ์œผ๋กœ๋Š” GPT4o๊ฐ€ ์ •๋‹ต์ด ๋งž๋Š”์ง€ ํ‰๊ฐ€ํ•˜๊ฒŒ ํ•˜๊ณ , ํ‹€๋ฆฌ๋‹ค๋ฉด GPT4or๊ฐ€ ๋‹ค์‹œ ์ƒ์„ฑํ•˜๊ฒŒ ํ•จ.

performance

Image

ablation

Image