TL;DR
- I read this because.. : video + think
- task : video reasoning
- Problem: CoT isn’t always helpful in video QA—how can we train the model to achieve a good balance?
- idea: When training, always have the model give two responses: an immediate answer and a response after thinking. During inference, assign a confidence score based on the log probability of the answer token, and then enable the “think” phase.
- input/output : {video, question} -> {initial boxed answer, (optional reasoning), reviewed boxed answer}
- Architecture: Qwen2.5-VL-7B-Instruct / Qwen3-VL-8B-Instruct. Visual encoder frozen; only the projector and LLM were trained. Maximum of 4096 video tokens and 256 frames.
- Objective: GRPO. Start RL immediately without a cold-start SFT.
- Baseline: Video-R1 (primarily spatial learning), Time-R1, VideoChat-R1, VideoChat-R1.5, VITAL, LongVILA-R1, LOVE-R1 / base Qwen2.5-VL-7B, Qwen3-VL-8B.
- Data: RL 83K (8 rollouts removed from 137K where all answers were correct or all were incorrect). text 6.4K (DAPO-Math) / image 27.5K (ViRL, ThinkLite-Hard) / video 49.4K (Video-R1, TVBench, STI-Bench, MMR-VBench, Charades-STA, ActivityNet, Time-R1, NExT-GQA)
- evaluation : VideoMME, MVBench, LongVideoBench, MMVU, VideoMMMU, MVP, Charades-STA, ActivityNet, NExT-GQA + image bench (MathVista, MathVision, MathVerse, MMMU, MMMU-Pro, MM-Vet).
- Result: A clear win in terms of inference efficiency. Results are mixed in terms of accuracy. On reasoning benchmarks like VideoMMMU, the “think” activation rate is 51%, with a gain of +3.9. On LongVideoBench, MMVU, and VideoMME, performance is mostly flat or even slightly lower.
- Contribution: The ablation study demonstrates that “always-think” is not the answer. However, it is more accurate to view the auto-mode as efficient rather than one that improves absolute performance. The framing of “confidence-based early-exit gating” is elegant.
- etc.: I wonder if it makes studying more efficient.
- CVPR 2026. It’s a bit surprising that they didn’t use cold-start SFT—it seems they maintained instruction following by simply using the instruction-tuned model as-is. KAUST group.
Details
motivation
- The motivation stems from the fact that, upon evaluating benchmarks of trained video LLMs, we found that the “direct” approach may yield better performance.
- benchmarks
- VideoMME
- benchmarks
- VideoMMMU : Lecture video. It is essentially very similar to the Text Reasoning Benchmark.
- LongVideoBench
: Hmm, it’s underperforming on LongVideoBench too. => It seems the benchmark’s capabilities are primarily focused on perception and relationships. Also, looking at the models below, some of them don’t even have long videos in their training data.
- MMVU
: Unlike VideoMMMU, this is a benchmark that requires knowledge, even though the videos aren’t lectures. – I’m not really sure why the COT is so low for this one.
- Charades-STA: temporal grounding task
- models
- Video-R1 / Qwen2.5-VL-7B / Video-R1-CoT-165k (SFT / distil from Qwen2.5-VL-72B-Instruct) + Video-R1-260k (RL) / https://github.com/tulerfeng/Video-R1
- Time-R1 / Qwen2.5-VL-7B / temporal Grounding
- VideoChat-R1 / spatio-temporal perception
- VideoChat-R1.5 / VTTS-80K (15K temporal + 30K spatial clues, 80K Think annotations, 50K QA), Iterative Perception + GRPO
method
- two-pass decoding, with the format explicitly defined as
answer → think → answer - 1st pass: The system prompt is set to “FIRST: Output your initial answer inside the first
\boxed{...}without any analysis or explanations.” If the model cannot produce an answer, it is instructed to output\boxed{Let's analyze the problem step by step.}— that is, the model must express its intention to defer using tokens. - confidence: The length-normalized mean log probability of the answer tokens within the first
\boxed{}. Gating is performed by comparing this value to the threshold $\tau$. - If the confidence is high and it is not a fallback string → early exit (think omitted).
- If confidence is low or it is a fallback string → THEN: generate a trace and place the reviewed answer $a_2$ in the second
\boxed{}. - No “think” or “no-think” labeling during training — gating is determined only at inference time. Existing approaches like AdaptThink explicitly mix “think” and “no-think” samples during on-policy training, but this is said to cause issues with data balancing and hyperparameter sensitivity.
- reward
- $R = w_1 R_{\text{task}}(a_1) + w_2 R_{\text{task}}(a_2) + \lambda R_{\text{fmt}} + \alpha R_{\text{fallback}}$
- $w_1 = 0.9, w_2 = 1.1$ — Since $w_2 > w_1 \geq 0$, the reviewed answer is assigned a higher weight, leading to refinement. The ratio of 0.9:1.1 is specified in the main text.
- $\lambda_{\text{fmt}} = 1.0$ — reward for maintaining the “answer → think → answer” format
- $\alpha = 0.3$ (fallback bonus): An additional reward when $a_1$ is exactly “Let’s analyze the problem step by step” and $a_2$ is the correct answer. In other words, it provides an incentive for the model’s decision to determine that “this requires reasoning.”
- task reward
- QA: binary {0, 1} (math-verify or string match)
- temporal grounding: continuous [0, 1] (temporal IoU)
- Grounding QA: Both [0, 2]
If this training is successful, the model will learn to consistently produce a “concise first answer + reasoned second answer” pattern.
data
- 137K → 83K (removed cases where all 8 rollouts were either all correct or all incorrect)
- text 6.4K — DAPO-Math
- image 27.5K — ViRL, ThinkLite-Hard
- video 49.4K — Video-R1, TVBench, STI-Bench, MMR-VBench, Charades-STA, ActivityNet, Time-R1, NExT-GQA
training recipe
- GRPO, 32Ă— H100, 35 hours, 1 epoch, batch size 256
- KL penalty coefficient $\beta = 0.01$ (no dropout)
- 4096 video token / max 256 frame
result
- Performance on perception benchmarks is mostly flat or even slightly lower. Compared to the Qwen3-VL-8B base model, VideoMME scores 72.5 → 71.7, and LongVideoBench scores 67.6 → 67.4 — on perception- and relation-focused benchmarks like LongVideoBench, “thinking” doesn’t seem to help much. Although LongVideoBench includes referred reasoning by definition, this is likely because frame-grounded perception ultimately plays a larger role.
- Improvements were observed in VideoMMMU and Charades-STA (temporal grounding). There are also cases where “think” directly helps, such as Charades-STA 59.8.
- VideoAuto-R1’s own think ratio is 41% / average response length is 44 tokens — the efficiency gain is clear.
- However, in terms of accuracy, rather than simply stating that it “performs better” than “always-think,” it would be more accurate to say that it provides “much shorter responses with similar accuracy.”