TL;DR
- I read this because.. :
- task : video understanding / RL (auto-think)
- problem : video QAμμ always-think (CoT) κ° ν ν°λ§ λ μ°κ³ μ νλλ λλ±νκ±°λ λ¨μ΄μ§ (Table 1). μΈμ think ν μ§λ₯Ό λͺ¨λΈμ΄ μ€μ€λ‘ μ νκ² νκ³ μΆμ
- idea : “thinking once, answering twice” β λ¨Όμ
\boxed{a_1}μΌλ‘ μ¦λ΅μ λ±κ³ , confidence (length-normalized mean log prob) κ° threshold $\tau{=}0.97$ μ΄μμ΄λ©΄ early exit. μλλ©΄<think>...</think>λκ³\boxed{a_2}λ‘ λ€μ λ΅ν¨ - input/output :
{video (β€256 frames), question} -> a_1 (+ optional <think> + a_2) - architecture : Qwen2.5-VL-7B-Instruct / Qwen3-VL-8B-Instruct. vision encoder frozen, projector + LLM νμ΅. inference μ Qwen3-VLμ 128K κΉμ§
- objective : RL only (cold-start SFT μμ). GRPO + dual-answer reward $R = w_1 R_{task}(a_1) + w_2 R_{task}(a_2) + \lambda R_{fmt} + \alpha R_{fallback}$, $w_1{=}0.9, w_2{=}1.1, \lambda{=}1, \alpha{=}0.3$
- baseline : Qwen2.5-VL / Qwen3-VL base, Video-R1, Time-R1, VideoChat-R1.5, Temporal-RLT, Video-RFT, Video-RTS, VITAL, LongVILA-R1, LOVE-R1, training-based auto-think (AdaptThink-style)
- data : RL 83K (137Kμμ filter). text 6.4K (DAPO-Math) / image 27.5K (ViRL, ThinkLite-Hard) / video 49.4K (Video-R1, TVBench, STI-Bench, temporal grounding)
- evaluation : video QA β VideoMME, MVBench, LongVideoBench, MMVU, VideoMMMU, MVP / temporal grounding β Charades-STA, ActivityNet, NExT-GQA / image β MathVista, MathVision, MathVerse, MMMU, MMMU-Pro, MM-Vet
- result : VideoMME 67.3 (Qwen2.5) / 71.7 (Qwen3), VideoMMMU 58.6 (+3.9 over Qwen baseline) / 65.0, MVP 39.4 (+2.9 over Video-R1), Charades-STA mIoU 60.0 / 63.7. avg response 44 tokens (vs 149~386). perception κ³μ΄ (VideoMME +1.3) μμ gain μ μ
- contribution : (μ μ claim) (1) video μμ always-think κ° inefficient ν¨μ μ λμΌλ‘ 보μ΄κ³ (2) inference-time confidence λ‘ think/no-think λ₯Ό κ°λΌμΉλ λ¨μν λ°©λ² + dual-answer reward λ‘ SFT μμ΄λ νμ΅λ¨. μ§μ§ κΈ°μ¬λ training-based auto-think κ° collapse νλ κ±Έ inference-side early exit μΌλ‘ μ°νν λΆλΆμΈ λ―
- etc. : $\tau{=}0.97$ μ΄λΌλ hyperparam νλμ think ratio κ° λ€ κ²°μ λ¨. perception (MVBench 25%, VideoMME 11-40%) vs reasoning (VideoMMMU 51-53%) μΌλ‘ task λ³λ‘ μλ μ μλλ κ² μ κΈ°ν¨. “must-think” sample μ΄ video μ κ±°μ μμ΄μ training-based κ° collapse νλ€κ³ μ£Όμ₯
Details
architecture
- backbone: Qwen2.5-VL-7B-Instruct, Qwen3-VL-8B-Instruct λ λ€ μ€ν
- vision encoderλ frozen, projector + LLM λ§ νμ΅
- video input: train μ max 256 frames / 4096 video tokens. inference μ Qwen3-VL μ 128K context κΉμ§
method β thinking once, answering twice
μΈ ν ν° μνμ€λ₯Ό ν λ²μ μμ±νλ ꡬ쑰:
\boxed{a_1}β reasoning μμ΄ μ¦λ΅<think>...</think>β internal CoT\boxed{a_2}β reviewed answer
inference μ early-exit: $a_1$ μμ± μ§ν token λ€μ length-normalized mean log prob λ₯Ό confidence λ‘ λ³΄κ³ , $\geq \tau$ ($\tau{=}0.97$) μ΄λ©΄ κ±°κΈ°μ λκ³ $a_1$ λ°ν. μλλ©΄ think + $a_2$ κΉμ§ λκΉμ§ μμ±.
fallback: μ΄λ €μ΄ λ¬Έμ λ λͺ¨λΈμ΄ μ§λ΅μ λͺ»ν¨ β $a_1$ μ리μ "Let's analyze the problem step by step" μ μΆλ ₯νλλ‘ νμ΅. μ΄κ² μΆλ ₯λλ©΄ μμ°μ€λ½κ² think λ¨κ³λ‘ κ°μ λ¨ (μ¦ fallback λ inference μ early exit logic μμΌλ‘ ν‘μλ¨).
training recipe
- stage: SFT μμ΄ RL only. cold-start μμ΄ λ°λ‘ GRPO
- GRPO: G=16 rollouts, temperature 1.0, lr $1{\times}10^{-6}$, $\beta{=}0.01$ (KL), batch 256, 1 epoch
- dual-answer reward: $R = w_1 R_{task}(a_1) + w_2 R_{task}(a_2) + \lambda R_{fmt} + \alpha R_{fallback}$
- $w_1{=}0.9, w_2{=}1.1$ β reviewed answer λ₯Ό μ΄μ§ λ μ°λ
- $\alpha{=}0.3$ β fallback bonus. μ΄λ €μ΄ λ¬Έμ μμ think λ¨κ³λ‘ λκΈ°λ κ±Έ reward
- compute: 32 H100, ~35h
data
- 137K β filter β 83K λ‘ νμ΅
- text 6.4K (DAPO-Math) / image 27.5K (ViRL, ThinkLite-Hard) / video 49.4K (Video-R1, TVBench, STI-Bench, temporal grounding)
- (λ Όλ¬Έμ μμ) filtering κΈ°μ€μ΄ μ νν λμ§ (λμ΄λ κΈ°λ°μΈμ§ confidence κΈ°λ°μΈμ§) β λ³Έλ¬Έ λ€μ λ΄μΌ ν¨
result
video QA (Table 3) κΈ°μ€:
- VideoMME: 67.3 (Qwen2.5) / 71.7 (Qwen3), perception μμ£ΌλΌ gain μμ (+1.3)
- VideoMMMU: 58.6 (+3.9), 65.0 (Qwen3) β reasoning κ³μ΄μμ gain νΌ
- MVP: 39.4 (+2.9 over Video-R1)
- avg response length 44 tokens vs prior reasoning model 149~386 ν ν° β ~3.3Γ λ¨μΆ
temporal grounding (Table 4):
- Charades-STA mIoU 60.0 / 63.7 (Qwen3)
- ActivityNet, NExT-GQA λ λΉμ·ν trend
image (Table 5): video μμλ§ νμ΅νλλ° image reasoning bench μμλ generalization λ¨ (MathVista, MathVision, MMMU λ₯).
ablation
- training strategy (Table 6): SFT only / RL no-think / RL CoT / VideoAuto-R1 λΉκ΅. RL CoT κ° reasoning μμ 56.4 μΈλ° 149 ν ν° μ°λ λ°λ©΄, VideoAuto-R1 μ 58.6 + 44 ν ν°
- training-based vs inference-based auto-think (Table 7): training-based λ single mode (μ λΆ think λλ μ λΆ no-think) λ‘ collapse. inference-side early exit μ΄ μμ μ . μ μ μ£Όμ₯: video μλ “must-think” sample μ΄ λλ¬Όμ΄μ training signal λΆμ‘±
- dual-answer weight (Table 9): $w_1{:}w_2 = 0.9{:}1.1$ λΉλμΉμ΄ λλ± weighting λ³΄λ€ μ’μ
- fallback bonus: $\alpha$ ν€μ°λ©΄ reasoning bench μ±λ₯ β
- threshold Ο (Figure 3): $\tau{=}0.97$ μ΄ robust default. Ο β β think ratio β μΈλ° perception μμ accuracy gain diminishing
- think-ratio (per bench): MVBench 25/31%, VideoMME 40/11%, VideoMMMU 51/53% β perception μ ~30% μ΄ν, reasoning μ 50% μ΄μμΌλ‘ μλ μ μ