Image

paper

TL;DR

  • I read this because.. : video + think
  • task : video reasoning
  • problem : CoT๊ฐ€ video QA์—์„œ ํ•ญ์ƒ ๋„์›€๋˜์ง€ ์•Š์Œ – ์–ด๋–ป๊ฒŒ ๊ท ํ˜• ์žกํžˆ๊ฒŒ ํ•™์Šต ํ•  ๊ฒƒ์ธ๊ฐ€?
  • idea : ํ•™์Šตํ•  ๋•Œ ๋ฌด์กฐ๊ฑด ์ฆ‰๋‹ต ๊ณผ Think ํ›„ ๋Œ€๋‹ต ๋‘๋ฒˆ ํ•˜๊ฒŒ ํ•จ. inference ์‹œ์—๋Š” answer token์˜ log prob์œผ๋กœ confidence ๋งค๊ธด ํ›„ think๋ฅผ enableํ•˜๊ฒŒ ํ•จ
  • input/output : {video, question} -> {initial boxed answer, (optional reasoning), reviewed boxed answer}
  • architecture : Qwen2.5-VL-7B-Instruct / Qwen3-VL-8B-Instruct. visual encoder frozen, projector + LLM๋งŒ ํ•™์Šต. ์ตœ๋Œ€ 4096 video token, 256 frame.
  • objective : GRPO. cold-start SFT ์—†์ด ๋ฐ”๋กœ RL.
  • baseline : Video-R1 (์ฃผ๋กœ spatial ์œ„์ฃผ ํ•™์Šต), Time-R1, VideoChat-R1, VideoChat-R1.5, VITAL, LongVILA-R1, LOVE-R1 / base Qwen2.5-VL-7B, Qwen3-VL-8B.
  • data : RL 83K (137K์—์„œ 8 rollout all-correct/all-wrong ์ œ๊ฑฐ). text 6.4K (DAPO-Math) / image 27.5K (ViRL, ThinkLite-Hard) / video 49.4K (Video-R1, TVBench, STI-Bench, MMR-VBench, Charades-STA, ActivityNet, Time-R1, NExT-GQA)
  • evaluation : VideoMME, MVBench, LongVideoBench, MMVU, VideoMMMU, MVP, Charades-STA, ActivityNet, NExT-GQA + image bench (MathVista, MathVision, MathVerse, MMMU, MMMU-Pro, MM-Vet).
  • result : inference ์‹œ ํšจ์œจ ์ธก๋ฉด์—์„œ ํ™•์‹คํ•œ win. ์ •ํ™•๋„ ์ธก๋ฉด์€ mixed. VideoMMMU ๊ฐ™์€ reasoning bench๋Š” think ์ผœ์ง€๋Š” ๋น„์œจ 51%, gain +3.9. LongVideoBench / MMVU / VideoMME ๋Š” ๊ฑฐ์˜ ํ‰์ดํ•˜๊ฑฐ๋‚˜ ์˜คํžˆ๋ ค ์‚ด์ง ๋–จ์–ด์ง.
  • contribution : “always-think"๊ฐ€ ๋‹ต์ด ์•„๋‹ˆ๋ผ๋Š” ๊ฑธ ablation์œผ๋กœ ๋ณด์ž„. ๋‹ค๋งŒ auto-mode ๊ฐ€ absolute ์„ฑ๋Šฅ์„ ์˜ฌ๋ฆฐ๋‹ค๊ธฐ๋ณด๋‹จ efficientํ•˜๋‹ค๊ณ  ๋ณด๋Š” ๊ฒŒ ์ •ํ™•. confidence ๊ธฐ๋ฐ˜ early-exit gating์ด๋ผ๋Š” framing์ด ๊น”๋”ํ•จ.
  • etc. : ํ•™์Šต์„ ํ•  ๋•Œ ๋” ํšจ์œจ์ ์ธ์ง€๊ฐ€ ๊ถ๊ธˆํ•˜๋„ค
  • CVPR 2026. cold-start SFT ์—†๋Š” ๊ฒŒ ์ข€ ์‹ ๊ธฐ โ€” instruction-tuned ๋ชจ๋ธ ๊ทธ๋Œ€๋กœ ์จ์„œ instruction following์ด ์œ ์ง€๋˜๋Š” ๋“ฏ. KAUST ๊ทธ๋ฃน.

Details

Image

motivation

Image
  • think ํ•™์Šต๋œ video LLM๋“ค ๋ฒค์น˜๋งˆํฌ๋ฅผ ํ‰๊ฐ€ํ–ˆ๋”๋‹ˆ direct๊ฐ€ ์„ฑ๋Šฅ์ด ๋” ์ข‹์„ ์ˆ˜ ์žˆ๋‹ค๋Š” motivation
    • benchmarks
      • VideoMME
      • VideoMMMU : lecture ์˜์ƒ. ์‚ฌ์‹ค์ƒ text reasoning bench์™€ ๊ฑฐ์˜ ๋น„์Šทํ•จ
      • LongVideoBench : ํ  long video bench ์—์„œ๋„ ๋–จ์–ด์ง€๋„ค. => ๋ฒค์น˜๋งˆํฌ์˜ ๋Šฅ๋ ฅ ์ž์ฒด๋Š” perception + relation ์œ„์ฃผ์—ฌ์„œ์ธ๋“ฏ ํ•จ. ๊ทธ๋ฆฌ๊ณ  ์•„๋ž˜ ๋ชจ๋ธ๋“ค์„ ๋ณด๋ฉด long video ๊ฐ€ ํ•™์Šต๋ฐ์ดํ„ฐ์— ์—†๊ธฐ๋„ ํ•จ
      • Image
      • MMVU : VideoMMMU์™€ ๋‹ฌ๋ฆฌ ๋น„๋””์˜ค๊ฐ€ lecture๋Š” ์•„๋‹ˆ์ง€๋งŒ ์ง€์‹์„ ์š”ํ•˜๋Š” ๋ฒค์น˜. – ์–˜๋Š” ์™œ cot๊ฐ€ ๋‚ฎ์€์ง€ ์ž˜ ๋ชจ๋ฅด๊ฒ ์Œ
      • Charades-STA: temporal grounding task
    • models

method

Image
  • two-pass decoding, format์€ ๋ช…์‹œ์ ์œผ๋กœ answer โ†’ think โ†’ answer
    • 1st pass: system prompt๊ฐ€ “FIRST: Output your initial answer inside the first \boxed{...} without any analysis or explanations” ๋กœ ๊ฐ•์ œ. answer๋ฅผ ๋ชป ๋‚ผ ๊ฒƒ ๊ฐ™์œผ๋ฉด \boxed{Let's analyze the problem step by step.} ๋ฅผ ์ถœ๋ ฅํ•˜๋„๋ก ์ง€์‹œ โ€” ์ฆ‰ ๋ชจ๋ธ์ด ์Šค์Šค๋กœ defer ์˜์‚ฌ๋ฅผ ํ† ํฐ์œผ๋กœ ํ‘œํ˜„.
    • confidence: ์ฒซ ๋ฒˆ์งธ \boxed{} ์•ˆ answer ํ† ํฐ๋“ค์˜ length-normalized mean log probability. threshold $\tau$ ์™€ ๋น„๊ตํ•ด์„œ gating.
    • confidence ๋†’๊ณ  fallback ๋ฌธ์ž์—ด์ด ์•„๋‹ˆ๋ฉด โ†’ early-exit (think ์ƒ๋žต).
    • Image
    • confidence ๋‚ฎ๊ฑฐ๋‚˜ fallback ๋ฌธ์ž์—ด์ด๋ฉด โ†’ THEN: think trace ์ƒ์„ฑ ํ›„ ๋‘ ๋ฒˆ์งธ \boxed{} ์— reviewed answer $a_2$.
    • ํ•™์Šต ์ค‘ think / no-think ๋ผ๋ฒจ๋ง ์—†์Œ โ€” gating์€ inference time์—๋งŒ ๊ฒฐ์ •. AdaptThink ๊ฐ™์€ ๊ธฐ์กด ์ ‘๊ทผ์€ on-policy training ์ค‘ think/no-think ์ƒ˜ํ”Œ์„ ๋ช…์‹œ์ ์œผ๋กœ ์„ž๋Š”๋ฐ, ๊ทธ๊ฑด data balancing๊ณผ hyperparameter sensitivity ์ด์Šˆ๊ฐ€ ์žˆ๋‹ค๊ณ  ํ•จ.
  • reward
    • $R = w_1 R_{\text{task}}(a_1) + w_2 R_{\text{task}}(a_2) + \lambda R_{\text{fmt}} + \alpha R_{\text{fallback}}$
    • $w_1 = 0.9, w_2 = 1.1$ โ€” $w_2 > w_1 \geq 0$ ๋กœ reviewed answer์— ๋” ํฐ weight ๋ถ€์—ฌ, refinement ์œ ๋„. ratio 0.9:1.1 ์ด ๋ณธ๋ฌธ์— ๋ช…์‹œ.
    • $\lambda_{\text{fmt}} = 1.0$ โ€” answer โ†’ think โ†’ answer ํฌ๋งท ์œ ์ง€ reward
    • $\alpha = 0.3$ (fallback bonus): $a_1$ ์ด ์ •ํ™•ํžˆ “Let’s analyze the problem step by step” ์ด๊ณ  $a_2$ ๊ฐ€ ์ •๋‹ต์ผ ๋•Œ ์ถ”๊ฐ€ ๋ณด์ƒ. ์ฆ‰ ๋ชจ๋ธ์ด “์ด๊ฑด reasoning ํ•„์š”ํ•˜๋‹ค” ๊ณ  ํŒ๋‹จํ•˜๋Š” ํ–‰์œ„ ์ž์ฒด์— ์ธ์„ผํ‹ฐ๋ธŒ.
  • task reward
    • QA: binary {0, 1} (math-verify ๋˜๋Š” string match)
    • temporal grounding: continuous [0, 1] (temporal IoU)
    • grounding QA: ๋‘˜ ํ•ฉ [0, 2]

์ด ํ•™์Šต์ด ์ž˜ ๋˜๋ฉด ๋ชจ๋ธ์ด “concise first answer + reasoned second answer” ๋ฅผ ์•ˆ์ •์ ์œผ๋กœ ๋‚ด๋Š” ํŒจํ„ด์„ ํ•™์Šตํ•จ.

data

  • 137K โ†’ 83K (8 rollout ๋‹ค ๋งž๊ฑฐ๋‚˜ ๋‹ค ํ‹€๋ฆฐ ๊ฑฐ ์ œ๊ฑฐ)
  • text 6.4K โ€” DAPO-Math
  • image 27.5K โ€” ViRL, ThinkLite-Hard
  • video 49.4K โ€” Video-R1, TVBench, STI-Bench, MMR-VBench, Charades-STA, ActivityNet, Time-R1, NExT-GQA

training recipe

  • GRPO, 32ร— H100, 35์‹œ๊ฐ„, 1 epoch, batch size 256
  • KL penalty coefficient $\beta = 0.01$ (์ œ๊ฑฐ ์•ˆ ํ•จ)
  • 4096 video token / max 256 frame

result

Image
  • perception bench๋Š” ๊ฑฐ์˜ ํ‰์ดํ•˜๊ฑฐ๋‚˜ ์˜คํžˆ๋ ค ์•ฝ๊ฐ„ ๋–จ์–ด์ง. Qwen3-VL-8B base ๊ธฐ์ค€ VideoMME 72.5 โ†’ 71.7, LongVideoBench 67.6 โ†’ 67.4 โ€” long video bench ๊ฐ™์€ perception+relation ์œ„์ฃผ ๋ฒค์น˜๋Š” thinking์ด ๋ณ„ ๋„์›€ ์•ˆ ๋จ. LongVideoBench ์ •์˜์ƒ referred reasoning์ด ๋“ค์–ด๊ฐ€๊ธด ํ•˜์ง€๋งŒ ๊ฒฐ๊ตญ frame-grounded perception ๋น„์ค‘์ด ์ปค์„œ ๊ทธ๋Ÿด ๋“ฏ.
  • VideoMMMU ์™€ Charades-STA (temporal grounding) ์—์„  ๊ฐœ์„ . Charades-STA 59.8 ์ฒ˜๋Ÿผ think๊ฐ€ ์ง์ ‘ ๋„์›€ ๋˜๋Š” ์ผ€์ด์Šค๋„ ์žˆ์Œ.
  • VideoAuto-R1 ์ž์ฒด์˜ think ratio 41% / ํ‰๊ท  ์‘๋‹ต ๊ธธ์ด 44 token โ€” efficiency gain์€ ํ™•์‹ค.
    • ๋‹ค๋งŒ ์ •ํ™•๋„ ์ธก๋ฉด์—์„œ ๋ณด๋ฉด always-think ๋Œ€๋น„ ํ•œ ์ค„๋กœ “์„ฑ๋Šฅ์ด ๋” ์ข‹๋‹ค” ๋ผ๊ณ  ๋‹จ์–ธํ•˜๊ธฐ๋ณด๋‹จ, “๋น„์Šทํ•œ ์ •ํ™•๋„์— ํ›จ์”ฌ ์งง์€ ์‘๋‹ต” ์œผ๋กœ ๋ณด๋Š” ๊ฒŒ ์ •ํ™•.
Image Image