paper , page

TL;DR

  • I read this because.. :
  • task : video understanding / RL (auto-think)
  • problem : video QAμ—μ„œ always-think (CoT) κ°€ ν† ν°λ§Œ 더 μ“°κ³  μ •ν™•λ„λŠ” λ™λ“±ν•˜κ±°λ‚˜ 떨어짐 (Table 1). μ–Έμ œ think ν• μ§€λ₯Ό λͺ¨λΈμ΄ 슀슀둜 μ •ν•˜κ²Œ ν•˜κ³  μ‹ΆμŒ
  • idea : “thinking once, answering twice” β€” λ¨Όμ € \boxed{a_1} 으둜 즉닡을 뱉고, confidence (length-normalized mean log prob) κ°€ threshold $\tau{=}0.97$ 이상이면 early exit. μ•„λ‹ˆλ©΄ <think>...</think> 돌고 \boxed{a_2} 둜 λ‹€μ‹œ 닡함
  • input/output : {video (≀256 frames), question} -> a_1 (+ optional <think> + a_2)
  • architecture : Qwen2.5-VL-7B-Instruct / Qwen3-VL-8B-Instruct. vision encoder frozen, projector + LLM ν•™μŠ΅. inference μ‹œ Qwen3-VL은 128K κΉŒμ§€
  • objective : RL only (cold-start SFT μ—†μŒ). GRPO + dual-answer reward $R = w_1 R_{task}(a_1) + w_2 R_{task}(a_2) + \lambda R_{fmt} + \alpha R_{fallback}$, $w_1{=}0.9, w_2{=}1.1, \lambda{=}1, \alpha{=}0.3$
  • baseline : Qwen2.5-VL / Qwen3-VL base, Video-R1, Time-R1, VideoChat-R1.5, Temporal-RLT, Video-RFT, Video-RTS, VITAL, LongVILA-R1, LOVE-R1, training-based auto-think (AdaptThink-style)
  • data : RL 83K (137Kμ—μ„œ filter). text 6.4K (DAPO-Math) / image 27.5K (ViRL, ThinkLite-Hard) / video 49.4K (Video-R1, TVBench, STI-Bench, temporal grounding)
  • evaluation : video QA β€” VideoMME, MVBench, LongVideoBench, MMVU, VideoMMMU, MVP / temporal grounding β€” Charades-STA, ActivityNet, NExT-GQA / image β€” MathVista, MathVision, MathVerse, MMMU, MMMU-Pro, MM-Vet
  • result : VideoMME 67.3 (Qwen2.5) / 71.7 (Qwen3), VideoMMMU 58.6 (+3.9 over Qwen baseline) / 65.0, MVP 39.4 (+2.9 over Video-R1), Charades-STA mIoU 60.0 / 63.7. avg response 44 tokens (vs 149~386). perception 계열 (VideoMME +1.3) 에선 gain 적음
  • contribution : (μ €μž claim) (1) video μ—μ„œ always-think κ°€ inefficient 함을 μ •λŸ‰μœΌλ‘œ 보이고 (2) inference-time confidence 둜 think/no-think λ₯Ό κ°ˆλΌμΉ˜λŠ” λ‹¨μˆœν•œ 방법 + dual-answer reward 둜 SFT 없이도 ν•™μŠ΅λ¨. μ§„μ§œ κΈ°μ—¬λŠ” training-based auto-think κ°€ collapse ν•˜λŠ” κ±Έ inference-side early exit 으둜 μš°νšŒν•œ 뢀뢄인 λ“―
  • etc. : $\tau{=}0.97$ μ΄λΌλŠ” hyperparam ν•˜λ‚˜μ— think ratio κ°€ λ‹€ 결정됨. perception (MVBench 25%, VideoMME 11-40%) vs reasoning (VideoMMMU 51-53%) 으둜 task λ³„λ‘œ μžλ™ μ μ‘λ˜λŠ” 게 신기함. “must-think” sample 이 video 에 거의 μ—†μ–΄μ„œ training-based κ°€ collapse ν•œλ‹€κ³  μ£Όμž₯

Details

architecture

  • backbone: Qwen2.5-VL-7B-Instruct, Qwen3-VL-8B-Instruct λ‘˜ λ‹€ μ‹€ν—˜
  • vision encoderλŠ” frozen, projector + LLM 만 ν•™μŠ΅
  • video input: train μ‹œ max 256 frames / 4096 video tokens. inference μ‹œ Qwen3-VL 은 128K context κΉŒμ§€

method β€” thinking once, answering twice

μ„Έ 토큰 μ‹œν€€μŠ€λ₯Ό ν•œ λ²ˆμ— μƒμ„±ν•˜λŠ” ꡬ쑰:

  1. \boxed{a_1} β€” reasoning 없이 즉닡
  2. <think>...</think> β€” internal CoT
  3. \boxed{a_2} β€” reviewed answer

inference μ‹œ early-exit: $a_1$ 생성 직후 token λ“€μ˜ length-normalized mean log prob λ₯Ό confidence 둜 보고, $\geq \tau$ ($\tau{=}0.97$) 이면 κ±°κΈ°μ„œ 끊고 $a_1$ λ°˜ν™˜. μ•„λ‹ˆλ©΄ think + $a_2$ κΉŒμ§€ λκΉŒμ§€ 생성.

fallback: μ–΄λ €μš΄ λ¬Έμ œλŠ” λͺ¨λΈμ΄ 직닡을 λͺ»ν•¨ β†’ $a_1$ μžλ¦¬μ— "Let's analyze the problem step by step" 을 좜λ ₯ν•˜λ„λ‘ ν•™μŠ΅. 이게 좜λ ₯되면 μžμ—°μŠ€λŸ½κ²Œ think λ‹¨κ³„λ‘œ κ°•μ œλ¨ (즉 fallback 도 inference μ‹œ early exit logic μ•ˆμœΌλ‘œ 흑수됨).

training recipe

  • stage: SFT 없이 RL only. cold-start 없이 λ°”λ‘œ GRPO
  • GRPO: G=16 rollouts, temperature 1.0, lr $1{\times}10^{-6}$, $\beta{=}0.01$ (KL), batch 256, 1 epoch
  • dual-answer reward: $R = w_1 R_{task}(a_1) + w_2 R_{task}(a_2) + \lambda R_{fmt} + \alpha R_{fallback}$
    • $w_1{=}0.9, w_2{=}1.1$ β†’ reviewed answer λ₯Ό 살짝 더 μš°λŒ€
    • $\alpha{=}0.3$ β†’ fallback bonus. μ–΄λ €μš΄ λ¬Έμ œμ—μ„œ think λ‹¨κ³„λ‘œ λ„˜κΈ°λŠ” κ±Έ reward
  • compute: 32 H100, ~35h

data

  • 137K β†’ filter β†’ 83K 둜 ν•™μŠ΅
  • text 6.4K (DAPO-Math) / image 27.5K (ViRL, ThinkLite-Hard) / video 49.4K (Video-R1, TVBench, STI-Bench, temporal grounding)
  • (논문에 μ—†μŒ) filtering 기쀀이 μ •ν™•νžˆ λ­”μ§€ (λ‚œμ΄λ„ κΈ°λ°˜μΈμ§€ confidence κΈ°λ°˜μΈμ§€) β€” λ³Έλ¬Έ λ‹€μ‹œ 봐야 함

result

video QA (Table 3) κΈ°μ€€:

  • VideoMME: 67.3 (Qwen2.5) / 71.7 (Qwen3), perception μœ„μ£ΌλΌ gain μž‘μŒ (+1.3)
  • VideoMMMU: 58.6 (+3.9), 65.0 (Qwen3) β€” reasoning κ³„μ—΄μ—μ„œ gain 큼
  • MVP: 39.4 (+2.9 over Video-R1)
  • avg response length 44 tokens vs prior reasoning model 149~386 토큰 β†’ ~3.3Γ— 단좕

temporal grounding (Table 4):

  • Charades-STA mIoU 60.0 / 63.7 (Qwen3)
  • ActivityNet, NExT-GQA 도 λΉ„μŠ·ν•œ trend

image (Table 5): video μ—μ„œλ§Œ ν•™μŠ΅ν–ˆλŠ”λ° image reasoning bench μ—μ„œλ„ generalization 됨 (MathVista, MathVision, MMMU λ₯˜).

ablation

  • training strategy (Table 6): SFT only / RL no-think / RL CoT / VideoAuto-R1 비ꡐ. RL CoT κ°€ reasoning μ—μ„œ 56.4 인데 149 토큰 μ“°λŠ” 반면, VideoAuto-R1 은 58.6 + 44 토큰
  • training-based vs inference-based auto-think (Table 7): training-based λŠ” single mode (μ „λΆ€ think λ˜λŠ” μ „λΆ€ no-think) 둜 collapse. inference-side early exit 이 μ•ˆμ •μ . μ €μž μ£Όμž₯: video μ—λŠ” “must-think” sample 이 λ“œλ¬Όμ–΄μ„œ training signal λΆ€μ‘±
  • dual-answer weight (Table 9): $w_1{:}w_2 = 0.9{:}1.1$ λΉ„λŒ€μΉ­μ΄ 동등 weighting 보닀 μ’‹μŒ
  • fallback bonus: $\alpha$ ν‚€μš°λ©΄ reasoning bench μ„±λŠ₯ ↑
  • threshold Ο„ (Figure 3): $\tau{=}0.97$ 이 robust default. Ο„ ↑ β†’ think ratio ↑ 인데 perception 에선 accuracy gain diminishing
  • think-ratio (per bench): MVBench 25/31%, VideoMME 40/11%, VideoMMMU 51/53% β€” perception 은 ~30% μ΄ν•˜, reasoning 은 50% μ΄μƒμœΌλ‘œ μžλ™ 적응