Image

paper

TL;DR

  • I read this because.. : open source VLM + RL
  • task : MLLM / RL (general visual reasoning)
  • problem : open source VLM + RL
  • idea : ์—ฌ๋Ÿฌ ๋ ˆ์‹œํ”ผ ํƒ์ƒ‰
  • input/output : {image, question} -> <think>...</think><answer>...</answer>
  • architecture : Qwen3-VL-8B-Instruct / Thinking, Qwen2.5-VL-7B-Instruct, MiMo-VL-7B-SFT ์œ„์— ๊ทธ๋Œ€๋กœ RL. (frozen ์—ฌ๋ถ€๋Š” ๋…ผ๋ฌธ์— ์—†์Œ)
  • objective : GSPO, asymmetric clipping
  • baseline : Qwen3-VL-8B-Instruct, Qwen3-VL-8B-Thinking, MiMo-VL-7B-RL
  • data : Vero-600K = 59 dataset, 6 ์นดํ…Œ๊ณ ๋ฆฌ ร— 100K
  • evaluation : VeroEval 30 benchmark (Chart&OCR 6 / STEM 4 / Spatial&Action 5 / Knowledge 4 / Grounding&Counting 8 / Captioning&IF 3)
  • result : Vero-Qwen3I-8B ๊ฐ€ base ๋Œ€๋น„ +5.3 average, Qwen3-VL-8B-Thinking ๊นŒ์ง€ 23/30 ์—์„œ ๋Šฅ๊ฐ€. Vero-MiMo-7B ๊ฐ€ MiMo-VL-7B-RL (closed recipe) ์˜ 6 ์นดํ…Œ๊ณ ๋ฆฌ ์ค‘ 3๊ฐœ (STEM +0.5, Knowledge +5.1, Captioning +4.0) ์—์„œ ์ด๊น€
  • contribution : (1) 600K open RL data + 30 benchmark eval suite open release (2) task-routed reward (10 type) ๊ฐ€ math-verify ๋‹จ์ผ ๋Œ€๋น„ +5.4 ๋ผ๋Š” ablation (3) “data diversity + task-aware reward ๋กœ visual reasoning RL ์˜ negative transfer ๊ฐ€ ์™„ํ™”๋œ๋‹ค” ๋ผ๋Š” ์ฃผ์žฅ
  • etc. :

Details

data โ€” Vero-600K

Image
  • ์นดํ…Œ๊ณ ๋ฆฌ ๊ตฌ์„ฑ
    • Chart & OCR (9) : ChartQA, InfoVQA, CoSyn-Chart/Diagram/Table, ArxivQA, ECD-VQA, EvoChart, InfographicVQA, ReachQA
    • STEM (13) : CoSyn-Math, AI2D, Geo170K, GeomVerse, GeoQA+, MMK12, PathVQA, RAVEN, TQA, VisualWebInstruct, VQA-RAD, We-Math 2.0 (Pro & Std)
    • Spatial & Action (8) : GameQA, Magma-AITW, Magma-Mind2Web, Robo2VLM, Spatial-SSRL, ST-VQA, Visual Jigsaw 2D/3D
    • Knowledge & Recognition (12) : A-OKVQA, GQA, IconQA, Indoor-QA, KVG, KVQA, PopVQA, VCR, ViQuAE, Visual7W, VizWiz, VQAv2
    • Grounding, Counting & Search (11) : AerialVG, GroundUI, MultiHop, Objects365-QA, OOD-VQA, OS-ATLAS, Pixel Reasoner, PixMo, RefCOCOg, TallyQA, Visual Probe
    • Captioning & IF (6) : PixMo-AskAnything, PixMo-CapQA, PixMo-Cap, MM-RLVR-IFEval, MMIF-23K, Flickr30K
  • data filtering
    • Image
    • ์นดํ…Œ๊ณ ๋ฆฌ๋‹น ~50 sample ์”ฉ ์ง์ ‘ ๋ณด๊ณ  ๊ธฐ์ค€ ์ •ํ•จ: correctness (<5% annotation error rate), unambiguity (๊ฐ ์งˆ๋ฌธ์ด ๋‹จ์ผ verifiable answer ๊ฐ€์ง€๋Š”์ง€), verifiability
    • automatic filter judge = Qwen3-VL-235B-A22B-Instruct. ambiguous / image-irrelevant / unverifiable question ์ œ๊ฑฐ
    • ๊ฒฐ๊ณผ: preโ†’post filter ํ‰๊ท  61.3โ€“64.1
  • data mixture
    • ๊ทธ๋ƒฅ ๊ท ๋“ฑ ๋ถ„๋ฐฐํ•˜๋Š”๊ฒŒ ๋” ์ข‹์•˜๋‹ค๊ณ  ํ•จ
    • Image
  • (๋…ผ๋ฌธ์— ์—†์Œ): ์นดํ…Œ๊ณ ๋ฆฌ ์•ˆ์—์„œ dataset ๋ผ๋ฆฌ๋Š” ์–ด๋–ป๊ฒŒ weighting ํ–ˆ๋Š”์ง€ ๋ช…์‹œ X

training recipe

  • algorithm: GSPO (Group Sequence Policy Optimization, asymmetric clipping). ablation ์—์„œ GRPO / DAPO ๋ณด๋‹ค entropy ๋” ์ž˜ ์œ ์ง€ (0.58ยฑ0.11 vs 0.50ยฑ0.11 / 0.22ยฑ0.15) ํ•˜๋ฉด์„œ ํ‰๊ท  score ๋„ ์•ฝ๊ฐ„ ๋” ๋†’์Œ (54.7 vs 54.3 / 54.3)
    • Image
  • single stage RL, warm-start SFT ์—†์Œ. ~600 step โ‰ˆ 1 epoch
  • KL penalty 0
  • context length: soft overlong penalty buffer $[L_{max}-2048, L_{max}]$
  • SFT vs RL ablation: ๊ฐ™์€ Vero-600K ์— ๋Œ€ํ•ด RL ์ด SFT ๋ณด๋‹ค +4.4 ์ 
    • Image

reward

Image
  • ์ „์ฒด reward $$ R(y, y^) = (1-\alpha) R_{acc}(y, y^) + \alpha R_{fmt}(y) + R_{overlong}(y),\quad \alpha=0.2 $$
  • overlong penalty (Eq. 4): $$ R_{overlong}(y) = \min!\Big(-\frac{|y|-(L_{max}-B)}{B}\lambda,\ 0\Big),\quad B=2048,\ \lambda=1.0 $$
  • format reward: <think>...</think><answer>...</answer> ๊ตฌ์กฐ ์ง€ํ‚ค๋ฉด 1, ์•ˆ ์ง€ํ‚ค๋ฉด 0. ๋ณด๊ธฐ ํ˜•์‹ (\boxed{...} ์•ˆ ์“ด ๊ฒฝ์šฐ ๋“ฑ) ์€ 0.5 ๋กœ partial
  • 10 ๊ฐ€์ง€ accuracy reward โ€” answer ํ˜•์‹๋ณ„๋กœ ๋ถ„๊ธฐ
    1. string match (exact text equality)
    2. multiple choice (single letter ์ถ”์ถœ)
    3. numeric โ†’ math-verify (symbolic parse + tolerance)
    4. list string match (synonym ๋“ฑ any-match)
    5. ordering โ†’ ์ •ํ™•ํ•œ list ์ˆœ์„œ๋ฉด full reward, set ์€ ๋งž๊ณ  ์ˆœ์„œ ํ‹€๋ฆฌ๋ฉด 0.2 discount
    6. web action (JSON field weighted match)
    7. grounding (bbox ๋“ค Hungarian matching, IoU/F1 threshold 0.5)
    8. clicking (point-in-box, ์ขŒํ‘œ [0,1000] normalize)
    9. instruction following (์ œ์•ฝ ์ถฉ์กฑ ๋น„์œจ)
    10. LLM-as-judge โ€” Qwen3-32B (thinking disabled), 1~10 ์ , OLMo3 judge setup ๋ณ€ํ˜•
  • ablation:
    • Image
    • math-verify ๋‹จ์ผ reward 51.8 โ†’ multi-route 57.2 (Table 4b). task-routed ๊ฐ€ +5.4 ์ ˆ๋Œ€์ ์ˆ˜ ์ฐจ์ด๋กœ ์ด๊น€

reward hacking & judge guideline

  • LLM judge ๋งŒ ๋‘๋ฉด ๋ชจ๋ธ์ด self-evaluative language (“This satisfies all requirements”, “exhaustively documents every… detail”) + fabricated measurement ๋ฐ•์•„์„œ ์ ์ˆ˜ inflate ์‹œํ‚ด
  • mitigation: judge prompt ์— Automatic Failure Conditions ๋ช…์‹œ โ€” self-evaluative / meta-commentary ๊ฐ€ ์žกํžˆ๋ฉด ์ž๋™ 1์ . ๋ณด์ƒ ํ•ดํ‚น์ด ์†ํ•ด๋ณด๋Š” ์ชฝ์œผ๋กœ ๊ฐ€๋„๋ก ์„ค๊ณ„
  • (? ๋ญ˜๊นŒ): ์ด failure condition ์ด ์ •์ƒ์ ์ธ reasoning ๊นŒ์ง€ ๊นŽ์„ ๊ฐ€๋Šฅ์„ฑ์€? false-positive rate ๋Š” ๋”ฐ๋กœ ์•ˆ ์žฐ ๋“ฏ

evaluation โ€” VeroEval 30 bench

  • Chart & OCR (6): ChartQA-Pro, ChartQA, InfoVQA, CharXiv, ChartMuseum, EvoChart
  • STEM (4): MMMU-Pro Standard, MMMU-Pro Vision, MathVision, MathVista-testmini
  • Spatial & Action (5): Blink, ERQA, GameQA-Lite, EmbSpatial, CVBench
  • Knowledge & Recognition (4): RealWorldQA, SimpleVQA, FVQA, MM-Vet V2
  • Grounding, Counting & Search (8): CountBenchQA, CountQA, MME-RealWorld, VStarBench, AerialVG, VisualProbe, ScreenSpot, ScreenSpot-Pro
  • Captioning & IF (3): MM-MTBench, MIA-Bench, MMIFEval

result

  • Vero-Qwen3I-8B vs Qwen3-VL-8B-Instruct: +5.3 ํ‰๊ท 
    • Chart&OCR +8.5 / STEM +6.4 / Spatial&Action +3.7 / Knowledge +1.0 / Grounding +5.3 / Captioning +5.6
    • Knowledge ๋งŒ gain ์ž‘์Œ โ€” ์›๋ž˜ base ๊ฐ€ ์ด๋ฏธ ์ž˜ํ•˜๋˜ ์˜์—ญ์œผ๋กœ ๋ณด์ž„
  • Vero-Qwen3I-8B vs Qwen3-VL-8B-Thinking: 23 / 30 bench ์—์„œ ์ด๊น€ (Instruct base ์ธ๋ฐ Thinking base ๋ชจ๋ธ๋ณด๋‹ค ๊ฐ•ํ•œ ์ผ€์ด์Šค)
  • Vero-Qwen3T-8B vs Qwen3-VL-8B-Thinking: 24 / 30 (Grounding +7.2, Chart&OCR +4.2)
  • Vero-MiMo-7B vs MiMo-VL-7B-RL (closed RL recipe): 6 ์นดํ…Œ๊ณ ๋ฆฌ ์ค‘ STEM +0.5, Knowledge +5.1, Captioning +4.0 ์œผ๋กœ 3๊ฐœ ์ด๊น€

ablation โ€” cross-category transfer

  • ํ•ต์‹ฌ ์ฃผ์žฅ: “data diversity + task-aware reward design ์œผ๋กœ negative transfer ๊ฐ€ ์™„ํ™”๋œ๋‹ค”
  • single-task RL ์€ ํ”ํžˆ ๋‹ค๋ฅธ ์นดํ…Œ๊ณ ๋ฆฌ์— neutral ๋˜๋Š” negative transfer. ์˜ˆ: Captioning ๋งŒ์œผ๋กœ RL ๋Œ๋ฆฌ๋ฉด Qwen2.5-VL ๋‹ค๋ฅธ ์นดํ…Œ๊ณ ๋ฆฌ๊ฐ€ -4.4 ~ -35.5 ์ ๊นŒ์ง€ ๋–จ์–ด์ง
  • 6 ์นดํ…Œ๊ณ ๋ฆฌ ๋‹ค ์„ž์œผ๋ฉด positive cross-category transfer ๊ฐ€ ๊ด€์ฐฐ๋จ โ€” ์ฆ‰ ํ•œ ์นดํ…Œ๊ณ ๋ฆฌ ์ถ”๊ฐ€ํ•˜๋Š” ๊ฒŒ ๋‹ค๋ฅธ ์นดํ…Œ๊ณ ๋ฆฌ์—์„œ๋„ ๋„์›€
  • ์นดํ…Œ๊ณ ๋ฆฌ๋ณ„ reasoning length ์ฐจ์ด๊ฐ€ ํผ: Spatial & Action ํ‰๊ท  1983 ๋‹จ์–ด vs Grounding/Search ํ‰๊ท  125 ๋‹จ์–ด
    • Image
    • ์žฌ๋ฐŒ๊ฒŒ๋„ STEM ๋ณด๋‹ค Spatial & Action์ด ๋” ๋ฌธ์žฅ์ด ๋งŽ์ด ํ•„์š”ํ•˜๋„ค

etc.

  • task-routed reward ์˜ ์ง„์งœ ํšจ๊ณผ๊ฐ€ (a) reward signal ์˜ ์ •ํ™•๋„ ๋•Œ๋ฌธ์ธ์ง€ (b) ์นดํ…Œ๊ณ ๋ฆฌ๋ณ„๋กœ reward ๋ถ„ํฌ๊ฐ€ ๋‹ฌ๋ผ์„œ ์ž๋™์œผ๋กœ curriculum / balancing ํšจ๊ณผ ๋‚ด๋Š” ๊ฑด์ง€ ๋ถ„๋ฆฌ ์•ˆ ๋˜์–ด ์žˆ์Œ
  • SFT vs RL ablation ์—์„œ RL ์ด +4.4 ์ธ ๊ฑด ๋™์ผ ๋ฐ์ดํ„ฐ ๊ธฐ์ค€์ด๋ผ fair ํ•˜์ง€๋งŒ, SFT ์ชฝ hparam tuning ์ด ์ถฉ๋ถ„ํ–ˆ๋Š”์ง€๋Š” (๋…ผ๋ฌธ์— ์—†์Œ)
  • Knowledge ์นดํ…Œ๊ณ ๋ฆฌ gain ๋งŒ +1.0 ์œผ๋กœ ์ž‘์€๋ฐ โ€” knowledge benchmark ์ž์ฒด๊ฐ€ RL ๋กœ ํ•™์Šตํ•ด์„œ ์–ป์„ ๊ฒŒ ์ ์€ (factual recall) ์˜์—ญ์ด๋ผ ๊ทธ๋Ÿฐ ๋“ฏ