paper , page

TL;DR

  • I read this because.. :
  • task : video understanding / RL (auto-think)
  • Problem :** In video QA, always-think (CoT) uses more tokens and has the same or worse accuracy (Table 1). Want to let the model decide when to think
  • idea : “thinking once, answering twice” - first spit out an immediate answer with \boxed{a_1}, and if the confidence (length-normalized mean log prob) is above the threshold $\tau{=}0.97$, exit early. Otherwise, iterate <think>...</think> and answer again with \boxed{a_2}.
  • input/output : {video (≀256 frames), question} -> a_1 (+ optional <think> + a_2)
  • architecture : Qwen2.5-VL-7B-Instruct / Qwen3-VL-8B-Instruct. vision encoder frozen, projector + LLM learning. inference on Qwen3-VL up to 128K.
  • objective : RL only (no cold-start SFT). GRPO + dual-answer reward $R = w_1 R_{task}(a_1) + w_2 R_{task}(a_2) + \lambda R_{fmt} + \alpha R_{fallback}$, $w_1{=}0.9, w_2{=}1.1, \lambda{=}1, \alpha{=}0.3$
  • baseline : Qwen2.5-VL / Qwen3-VL base, Video-R1, Time-R1, VideoChat-R1.5, Temporal-RLT, Video-RFT, Video-RTS, VITAL, LongVILA-R1, LOVE-R1, training-based auto-think (AdaptThink-style)
  • data :** RL 83K (filter from 137K). text 6.4K (DAPO-Math) / image 27.5K (ViRL, ThinkLite-Hard) / video 49.4K (Video-R1, TVBench, STI-Bench, temporal grounding)
  • evaluation : video QA β€” VideoMME, MVBench, LongVideoBench, MMVU, VideoMMMU, MVP / temporal grounding β€” Charades-STA, ActivityNet, NExT-GQA / image β€” MathVista, MathVision, MathVerse, MMMU, MMMU-Pro, MM-Vet
  • result : VideoMME 67.3 (Qwen2.5) / 71.7 (Qwen3), VideoMMMU 58.6 (+3.9 over Qwen baseline) / 65.0, MVP 39.4 (+2.9 over Video-R1), Charades-STA mIoU 60.0 / 63.7. avg response 44 tokens (vs 149~386). less gain in the perception series (VideoMME +1.3).
  • Contribution :** (Author claim) (1) Quantitatively demonstrates that always-think is inefficient in video, (2) Simple method for separating think/no-think with inference-time confidence + dual-answer reward, learned without SFT. The real contribution seems to be bypassing the collapse of training-based auto-think with inference-side early exit.
  • etc. : $\tau{=}0.97$ hyperparam determines the think ratio. It is interesting that it automatically adapts to each task as perception (MVBench 25%, VideoMME 11-40%) vs reasoning (VideoMMMU 51-53%). Claims that training-based collapses because there are few “must-think” samples in the video.

Details

architecture

  • backbone: Qwen2.5-VL-7B-Instruct, Qwen3-VL-8B-Instruct both experimental
  • vision encoder is frozen, only projector + LLM learning
  • video input: max 256 frames / 4096 video tokens when training. when inferring, Qwen3-VL supports up to 128K context.

method β€” thinking once, answering twice

A structure that generates a sequence of three tokens at once:

  1. \boxed{a_1} - instant answer without reasoning
  2. <think>...</think> β€” internal CoT
  3. \boxed{a_2} β€” reviewed answer

**early-exit on inference: immediately after generating $a_1$, look at the length-normalized mean log prob of the tokens with confidence, and if $\geq \tau$ ($\tau{=}0.97$), stop there and return $a_1$. Otherwise, generate all the way up to think + $a_2$.

**Fallback: model cannot answer a hard problem directly β†’ learns to output "Let's analyze the problem step by step" in place of $a_1$, which naturally forces the think step (i.e., fallback is also absorbed into the early exit logic at inference).

training recipe

  • stage: RL only. no SFT. no cold-start. straight GRPO.
  • GRPO: G=16 rollouts, temperature 1.0, lr $1{\times}10^{-6}$, $\beta{=}0.01$ (KL), batch 256, 1 epoch
  • dual-answer reward: $R = w_1 R_{task}(a_1) + w_2 R_{task}(a_2) + \lambda R_{fmt} + \alpha R_{fallback}$
  • $w_1{=}0.9, w_2{=}1.1$ β†’ slightly favor reviewed answer
  • $\alpha{=}0.3$ β†’ fallback bonus. Reward jumping to the THINK phase on hard problems
  • compute: 32 H100, ~35h

data

  • Learn with 137K β†’ filter β†’ 83K
  • text 6.4K (DAPO-Math) / image 27.5K (ViRL, ThinkLite-Hard) / video 49.4K (Video-R1, TVBench, STI-Bench, temporal grounding)
  • (not in the paper) what exactly the filtering criteria is (difficulty based or confidence based) - need to look at the text again

result

Based on video QA (Table 3):

  • VideoMME: 67.3 (Qwen2.5) / 71.7 (Qwen3), small gain (+1.3) because it is perception-driven
  • VideoMMMU: 58.6 (+3.9), 65.0 (Qwen3) - large gains in the reasoning family
  • MVP: 39.4 (+2.9 over Video-R1)
  • avg response length 44 tokens vs prior reasoning model 149-386 tokens β†’ ~3.3Γ— shorter

temporal grounding (Table 4):

  • Charades-STA mIoU 60.0 / 63.7 (Qwen3)
  • ActivityNet, NExT-GQA, NExT-GQA have similar trends

image (Table 5): trained only on video, but generalized to image reasoning bench (MathVista, MathVision, MMMU, etc.).

ablation

  • training strategy (Table 6): SFT only / RL no-think / RL CoT / VideoAuto-R1 comparison. RL CoT scores 56.4 in reasoning and uses 149 tokens, while VideoAuto-R1 scores 58.6 + 44 tokens.
  • training-based vs inference-based auto-think (Table 7): training-based collapses to a single mode (all think or all no-think). inference-side early exit is stable. Author’s claim: lack of training signal due to sparse “must-think” samples in video.
  • dual-answer weight (Table 9): $w_1{:}w_2 = 0.9{:}1.1$ Asymmetry is better than equal weighting
  • FALLBACK BONUS: Increasing $\alpha$ ↑ reasoning bench performance
  • threshold Ο„ (Figure 3): $\tau{=}0.97$ is a robust default. Ο„ ↑ β†’ think ratio ↑ but accuracy gain diminishing in perception
  • think-ratio (per bench): MVBench 25/31%, VideoMME 40/11%, VideoMMMU 51/53% - auto-adapts to ~30% or less for perception and 50% or more for reasoning