[225] VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice

paper

TL;DR

I read this because.. : video + think
task : video reasoning
Problem: CoT isn’t always helpful in video QA—how can we train the model to achieve a good balance?
idea: When training, always have the model give two responses: an immediate answer and a response after thinking. During inference, assign a confidence score based on the log probability of the answer token, and then enable the “think” phase.
input/output : {video, question} -> {initial boxed answer, (optional reasoning), reviewed boxed answer}
Architecture: Qwen2.5-VL-7B-Instruct / Qwen3-VL-8B-Instruct. Visual encoder frozen; only the projector and LLM were trained. Maximum of 4096 video tokens and 256 frames.
Objective: GRPO. Start RL immediately without a cold-start SFT.
Baseline: Video-R1 (primarily spatial learning), Time-R1, VideoChat-R1, VideoChat-R1.5, VITAL, LongVILA-R1, LOVE-R1 / base Qwen2.5-VL-7B, Qwen3-VL-8B.
Data: RL 83K (8 rollouts removed from 137K where all answers were correct or all were incorrect). text 6.4K (DAPO-Math) / image 27.5K (ViRL, ThinkLite-Hard) / video 49.4K (Video-R1, TVBench, STI-Bench, MMR-VBench, Charades-STA, ActivityNet, Time-R1, NExT-GQA)
evaluation : VideoMME, MVBench, LongVideoBench, MMVU, VideoMMMU, MVP, Charades-STA, ActivityNet, NExT-GQA + image bench (MathVista, MathVision, MathVerse, MMMU, MMMU-Pro, MM-Vet).
Result: A clear win in terms of inference efficiency. Results are mixed in terms of accuracy. On reasoning benchmarks like VideoMMMU, the “think” activation rate is 51%, with a gain of +3.9. On LongVideoBench, MMVU, and VideoMME, performance is mostly flat or even slightly lower.
Contribution: The ablation study demonstrates that “always-think” is not the answer. However, it is more accurate to view the auto-mode as efficient rather than one that improves absolute performance. The framing of “confidence-based early-exit gating” is elegant.
etc.: I wonder if it makes studying more efficient.
CVPR 2026. It’s a bit surprising that they didn’t use cold-start SFT—it seems they maintained instruction following by simply using the instruction-tuned model as-is. KAUST group.

Details

motivation

The motivation stems from the fact that, upon evaluating benchmarks of trained video LLMs, we found that the “direct” approach may yield better performance.
- benchmarks
  - VideoMME
VideoMMMU : Lecture video. It is essentially very similar to the Text Reasoning Benchmark.
LongVideoBench : Hmm, it’s underperforming on LongVideoBench too. => It seems the benchmark’s capabilities are primarily focused on perception and relationships. Also, looking at the models below, some of them don’t even have long videos in their training data.
MMVU : Unlike VideoMMMU, this is a benchmark that requires knowledge, even though the videos aren’t lectures. – I’m not really sure why the COT is so low for this one.
- Charades-STA: temporal grounding task
- models
  - Video-R1 / Qwen2.5-VL-7B / Video-R1-CoT-165k (SFT / distil from Qwen2.5-VL-72B-Instruct) + Video-R1-260k (RL) / https://github.com/tulerfeng/Video-R1
  - Time-R1 / Qwen2.5-VL-7B / temporal Grounding
  - VideoChat-R1 / spatio-temporal perception
  - VideoChat-R1.5 / VTTS-80K (15K temporal + 30K spatial clues, 80K Think annotations, 50K QA), Iterative Perception + GRPO

method

two-pass decoding, with the format explicitly defined as answer → think → answer
1st pass: The system prompt is set to “FIRST: Output your initial answer inside the first \boxed{...} without any analysis or explanations.” If the model cannot produce an answer, it is instructed to output \boxed{Let's analyze the problem step by step.} — that is, the model must express its intention to defer using tokens.
confidence: The length-normalized mean log probability of the answer tokens within the first \boxed{}. Gating is performed by comparing this value to the threshold $\tau$.
If the confidence is high and it is not a fallback string → early exit (think omitted).
If confidence is low or it is a fallback string → THEN: generate a trace and place the reviewed answer $a_2$ in the second \boxed{}.
No “think” or “no-think” labeling during training — gating is determined only at inference time. Existing approaches like AdaptThink explicitly mix “think” and “no-think” samples during on-policy training, but this is said to cause issues with data balancing and hyperparameter sensitivity.
reward
- $R = w_1 R_{\text{task}}(a_1) + w_2 R_{\text{task}}(a_2) + \lambda R_{\text{fmt}} + \alpha R_{\text{fallback}}$
$w_1 = 0.9, w_2 = 1.1$ — Since $w_2 > w_1 \geq 0$, the reviewed answer is assigned a higher weight, leading to refinement. The ratio of 0.9:1.1 is specified in the main text.
$\lambda_{\text{fmt}} = 1.0$ — reward for maintaining the “answer → think → answer” format
$\alpha = 0.3$ (fallback bonus): An additional reward when $a_1$ is exactly “Let’s analyze the problem step by step” and $a_2$ is the correct answer. In other words, it provides an incentive for the model’s decision to determine that “this requires reasoning.”
task reward
QA: binary {0, 1} (math-verify or string match)
- temporal grounding: continuous [0, 1] (temporal IoU)
Grounding QA: Both [0, 2]

If this training is successful, the model will learn to consistently produce a “concise first answer + reasoned second answer” pattern.

data

137K → 83K (removed cases where all 8 rollouts were either all correct or all incorrect)
text 6.4K — DAPO-Math
image 27.5K — ViRL, ThinkLite-Hard
video 49.4K — Video-R1, TVBench, STI-Bench, MMR-VBench, Charades-STA, ActivityNet, Time-R1, NExT-GQA

training recipe

GRPO, 32× H100, 35 hours, 1 epoch, batch size 256
KL penalty coefficient $\beta = 0.01$ (no dropout)
4096 video token / max 256 frame

result

Performance on perception benchmarks is mostly flat or even slightly lower. Compared to the Qwen3-VL-8B base model, VideoMME scores 72.5 → 71.7, and LongVideoBench scores 67.6 → 67.4 — on perception- and relation-focused benchmarks like LongVideoBench, “thinking” doesn’t seem to help much. Although LongVideoBench includes referred reasoning by definition, this is likely because frame-grounded perception ultimately plays a larger role.
Improvements were observed in VideoMMMU and Charades-STA (temporal grounding). There are also cases where “think” directly helps, such as Charades-STA 59.8.
VideoAuto-R1’s own think ratio is 41% / average response length is 44 tokens — the efficiency gain is clear.
However, in terms of accuracy, rather than simply stating that it “performs better” than “always-think,” it would be more accurate to say that it provides “much shorter responses with similar accuracy.”

TL;DR#

Details#

motivation#

method#

data#

training recipe#

result#

TL;DR

Details

motivation

method

data

training recipe

result