Image

paper

TL;DR

  • I read this because.. : video + think
  • task : video reasoning
  • Problem: CoT isn’t always helpful in video QA—how can we train the model to achieve a good balance?
  • idea: When training, always have the model give two responses: an immediate answer and a response after thinking. During inference, assign a confidence score based on the log probability of the answer token, and then enable the “think” phase.
  • input/output : {video, question} -> {initial boxed answer, (optional reasoning), reviewed boxed answer}
  • Architecture: Qwen2.5-VL-7B-Instruct / Qwen3-VL-8B-Instruct. Visual encoder frozen; only the projector and LLM were trained. Maximum of 4096 video tokens and 256 frames.
  • Objective: GRPO. Start RL immediately without a cold-start SFT.
  • Baseline: Video-R1 (primarily spatial learning), Time-R1, VideoChat-R1, VideoChat-R1.5, VITAL, LongVILA-R1, LOVE-R1 / base Qwen2.5-VL-7B, Qwen3-VL-8B.
  • Data: RL 83K (8 rollouts removed from 137K where all answers were correct or all were incorrect). text 6.4K (DAPO-Math) / image 27.5K (ViRL, ThinkLite-Hard) / video 49.4K (Video-R1, TVBench, STI-Bench, MMR-VBench, Charades-STA, ActivityNet, Time-R1, NExT-GQA)
  • evaluation : VideoMME, MVBench, LongVideoBench, MMVU, VideoMMMU, MVP, Charades-STA, ActivityNet, NExT-GQA + image bench (MathVista, MathVision, MathVerse, MMMU, MMMU-Pro, MM-Vet).
  • Result: A clear win in terms of inference efficiency. Results are mixed in terms of accuracy. On reasoning benchmarks like VideoMMMU, the “think” activation rate is 51%, with a gain of +3.9. On LongVideoBench, MMVU, and VideoMME, performance is mostly flat or even slightly lower.
  • Contribution: The ablation study demonstrates that “always-think” is not the answer. However, it is more accurate to view the auto-mode as efficient rather than one that improves absolute performance. The framing of “confidence-based early-exit gating” is elegant.
  • etc.: I wonder if it makes studying more efficient.
  • CVPR 2026. It’s a bit surprising that they didn’t use cold-start SFT—it seems they maintained instruction following by simply using the instruction-tuned model as-is. KAUST group.

Details

Image

motivation

Image
  • The motivation stems from the fact that, upon evaluating benchmarks of trained video LLMs, we found that the “direct” approach may yield better performance.
    • benchmarks
      • VideoMME
  • VideoMMMU : Lecture video. It is essentially very similar to the Text Reasoning Benchmark.
  • LongVideoBench : Hmm, it’s underperforming on LongVideoBench too. => It seems the benchmark’s capabilities are primarily focused on perception and relationships. Also, looking at the models below, some of them don’t even have long videos in their training data.
    • Image
  • MMVU : Unlike VideoMMMU, this is a benchmark that requires knowledge, even though the videos aren’t lectures. – I’m not really sure why the COT is so low for this one.
    • Charades-STA: temporal grounding task
    • models

method

Image
  • two-pass decoding, with the format explicitly defined as answer → think → answer
  • 1st pass: The system prompt is set to “FIRST: Output your initial answer inside the first \boxed{...} without any analysis or explanations.” If the model cannot produce an answer, it is instructed to output \boxed{Let's analyze the problem step by step.} — that is, the model must express its intention to defer using tokens.
  • confidence: The length-normalized mean log probability of the answer tokens within the first \boxed{}. Gating is performed by comparing this value to the threshold $\tau$.
  • If the confidence is high and it is not a fallback string → early exit (think omitted).
    • Image
  • If confidence is low or it is a fallback string → THEN: generate a trace and place the reviewed answer $a_2$ in the second \boxed{}.
  • No “think” or “no-think” labeling during training — gating is determined only at inference time. Existing approaches like AdaptThink explicitly mix “think” and “no-think” samples during on-policy training, but this is said to cause issues with data balancing and hyperparameter sensitivity.
  • reward
    • $R = w_1 R_{\text{task}}(a_1) + w_2 R_{\text{task}}(a_2) + \lambda R_{\text{fmt}} + \alpha R_{\text{fallback}}$
  • $w_1 = 0.9, w_2 = 1.1$ — Since $w_2 > w_1 \geq 0$, the reviewed answer is assigned a higher weight, leading to refinement. The ratio of 0.9:1.1 is specified in the main text.
  • $\lambda_{\text{fmt}} = 1.0$ — reward for maintaining the “answer → think → answer” format
  • $\alpha = 0.3$ (fallback bonus): An additional reward when $a_1$ is exactly “Let’s analyze the problem step by step” and $a_2$ is the correct answer. In other words, it provides an incentive for the model’s decision to determine that “this requires reasoning.”
  • task reward
  • QA: binary {0, 1} (math-verify or string match)
    • temporal grounding: continuous [0, 1] (temporal IoU)
  • Grounding QA: Both [0, 2]

If this training is successful, the model will learn to consistently produce a “concise first answer + reasoned second answer” pattern.

data

  • 137K → 83K (removed cases where all 8 rollouts were either all correct or all incorrect)
  • text 6.4K — DAPO-Math
  • image 27.5K — ViRL, ThinkLite-Hard
  • video 49.4K — Video-R1, TVBench, STI-Bench, MMR-VBench, Charades-STA, ActivityNet, Time-R1, NExT-GQA

training recipe

  • GRPO, 32Ă— H100, 35 hours, 1 epoch, batch size 256
  • KL penalty coefficient $\beta = 0.01$ (no dropout)
  • 4096 video token / max 256 frame

result

Image
  • Performance on perception benchmarks is mostly flat or even slightly lower. Compared to the Qwen3-VL-8B base model, VideoMME scores 72.5 → 71.7, and LongVideoBench scores 67.6 → 67.4 — on perception- and relation-focused benchmarks like LongVideoBench, “thinking” doesn’t seem to help much. Although LongVideoBench includes referred reasoning by definition, this is likely because frame-grounded perception ultimately plays a larger role.
  • Improvements were observed in VideoMMMU and Charades-STA (temporal grounding). There are also cases where “think” directly helps, such as Charades-STA 59.8.
  • VideoAuto-R1’s own think ratio is 41% / average response length is 44 tokens — the efficiency gain is clear.
  • However, in terms of accuracy, rather than simply stating that it “performs better” than “always-think,” it would be more accurate to say that it provides “much shorter responses with similar accuracy.”
Image Image