TL;DR
- I read this because.. :
- task : MLLM / RL (general visual reasoning)
- problem : open weight VLM ์์ ๋จ์ผ ๋๋ฉ์ธ์ ์น์คํ์ง ์๊ณ chart / STEM / spatial / grounding ๋ฑ ๋์ ๋ฒ์์ visual reasoning ์ RL ๋ก ํ ๋ฒ์ ๋์ด์ฌ๋ฆฌ๊ธฐ
- idea : 6 ์นดํ ๊ณ ๋ฆฌ ร 100K ๊ท ๋ฑ sampling (Vero-600K) + answer ํ์๋ณ๋ก ๋ถ๊ธฐ๋๋ task-routed reward + GSPO ๋จ์ผ stage RL
- input/output :
{image, question} -> <think>...</think><answer>...</answer> - architecture : Qwen3-VL-8B-Instruct / Thinking, Qwen2.5-VL-7B-Instruct, MiMo-VL-7B-SFT ์์ ๊ทธ๋๋ก RL. (frozen ์ฌ๋ถ๋ ๋ ผ๋ฌธ์ ์์)
- objective : GSPO, asymmetric clipping ($\varepsilon_{high} > \varepsilon_{low}$), no KL. $R = (1-\alpha) R_{acc} + \alpha R_{fmt} + R_{overlong}$ ($\alpha=0.2$)
- baseline : Qwen3-VL-8B-Instruct, Qwen3-VL-8B-Thinking, MiMo-VL-7B-RL
- data : Vero-600K = 59 dataset, 6 ์นดํ ๊ณ ๋ฆฌ ร 100K. ~50 example ์ฉ correctness / unambiguity / verifiability 3 ๊ธฐ์ค์ผ๋ก ํํฐ (filter judge = Qwen3-VL-235B-A22B-Instruct)
- evaluation : VeroEval 30 benchmark (Chart&OCR 6 / STEM 4 / Spatial&Action 5 / Knowledge 4 / Grounding&Counting 8 / Captioning&IF 3)
- result : Vero-Qwen3I-8B ๊ฐ base ๋๋น +5.3 average, Qwen3-VL-8B-Thinking ๊น์ง 23/30 ์์ ๋ฅ๊ฐ. Vero-MiMo-7B ๊ฐ MiMo-VL-7B-RL (closed recipe) ์ 6 ์นดํ ๊ณ ๋ฆฌ ์ค 3๊ฐ (STEM +0.5, Knowledge +5.1, Captioning +4.0) ์์ ์ด๊น
- contribution : (1) 600K open RL data + 30 benchmark eval suite open release (2) task-routed reward (10 type) ๊ฐ math-verify ๋จ์ผ ๋๋น +5.4 ๋ผ๋ ablation (3) “data diversity + task-aware reward ๋ก visual reasoning RL ์ negative transfer ๊ฐ ์ํ๋๋ค” ๋ผ๋ ์ฃผ์ฅ
- etc. : single stage, ~600 RL step (โ1 epoch). KL penalty 0. category ๋ณ reasoning length ๊ฐ 16๋ฐฐ ์ฐจ์ด (Spatial&Action 1983 ๋จ์ด vs Grounding/Search 125 ๋จ์ด).
Details
data โ Vero-600K
- 59 dataset, 6 ์นดํ ๊ณ ๋ฆฌ ร 100K ๊ท ๋ฑ (uniform sampling ์ด difficulty / area / length weighting ๋ณด๋ค +0.6~+1.0 ์ ๋ ๋์ โ Table 2)
- ์นดํ
๊ณ ๋ฆฌ ๊ตฌ์ฑ
- Chart & OCR (9) : ChartQA, InfoVQA, CoSyn-Chart/Diagram/Table, ArxivQA, ECD-VQA, EvoChart, InfographicVQA, ReachQA
- STEM (13) : CoSyn-Math, AI2D, Geo170K, GeomVerse, GeoQA+, MMK12, PathVQA, RAVEN, TQA, VisualWebInstruct, VQA-RAD, We-Math 2.0 (Pro & Std)
- Spatial & Action (8) : GameQA, Magma-AITW, Magma-Mind2Web, Robo2VLM, Spatial-SSRL, ST-VQA, Visual Jigsaw 2D/3D
- Knowledge & Recognition (12) : A-OKVQA, GQA, IconQA, Indoor-QA, KVG, KVQA, PopVQA, VCR, ViQuAE, Visual7W, VizWiz, VQAv2
- Grounding, Counting & Search (11) : AerialVG, GroundUI, MultiHop, Objects365-QA, OOD-VQA, OS-ATLAS, Pixel Reasoner, PixMo, RefCOCOg, TallyQA, Visual Probe
- Captioning & IF (6) : PixMo-AskAnything, PixMo-CapQA, PixMo-Cap, MM-RLVR-IFEval, MMIF-23K, Flickr30K
- ํํฐ๋ง ํ์ดํ๋ผ์ธ
- ์นดํ ๊ณ ๋ฆฌ๋น ~50 sample ์ฉ ์ง์ ๋ณด๊ณ ๊ธฐ์ค ์ ํจ: correctness (<5% annotation error rate), unambiguity (๊ฐ ์ง๋ฌธ์ด ๋จ์ผ verifiable answer ๊ฐ์ง๋์ง), verifiability
- automatic filter judge =
Qwen3-VL-235B-A22B-Instruct. ambiguous / image-irrelevant / unverifiable question ์ ๊ฑฐ - ๊ฒฐ๊ณผ: preโpost filter ํ๊ท 61.3โ64.1
- (๋ ผ๋ฌธ์ ์์): ์นดํ ๊ณ ๋ฆฌ ์์์ dataset ๋ผ๋ฆฌ๋ ์ด๋ป๊ฒ weighting ํ๋์ง ๋ช ์ X
training recipe
- algorithm: GSPO (Group Sequence Policy Optimization, asymmetric clipping). ablation ์์ GRPO / DAPO ๋ณด๋ค entropy ๋ ์ ์ ์ง (0.58ยฑ0.11 vs 0.50ยฑ0.11 / 0.22ยฑ0.15) ํ๋ฉด์ ํ๊ท score ๋ ์ฝ๊ฐ ๋ ๋์ (54.7 vs 54.3 / 54.3)
- single stage RL, warm-start SFT ์์. ~600 step โ 1 epoch
- KL penalty 0
- context length: soft overlong penalty buffer $[L_{max}-2048, L_{max}]$
- temperature ๋ฑ sampling: Qwen3 ๊ณ์ด T=0.7, Qwen2.5 ๊ณ์ด์ appendix Table C2~C3
- SFT vs RL ablation: ๊ฐ์ Vero-600K ์ ๋ํด RL ์ด SFT ๋ณด๋ค +4.4 ์ โ ์ฆ ๋ฐ์ดํฐ ์์ฒด๊ฐ ์ข์์๊ฐ ์๋๋ผ RL recipe ๊ฐ ๊ฐ์ด ์๋ํด์ผ ๋จ
reward โ task-routed (ํต์ฌ contribution ์ค ํ๋)
- ์ ์ฒด reward $$ R(y, y^) = (1-\alpha) R_{acc}(y, y^) + \alpha R_{fmt}(y) + R_{overlong}(y),\quad \alpha=0.2 $$
- overlong penalty (Eq. 4): $$ R_{overlong}(y) = \min!\Big(-\frac{|y|-(L_{max}-B)}{B}\lambda,\ 0\Big),\quad B=2048,\ \lambda=1.0 $$
- format reward:
<think>...</think><answer>...</answer>๊ตฌ์กฐ ์งํค๋ฉด 1, ์ ์งํค๋ฉด 0. ๋ณด๊ธฐ ํ์ (\boxed{...}์ ์ด ๊ฒฝ์ฐ ๋ฑ) ์ 0.5 ๋ก partial - 10 ๊ฐ์ง accuracy reward โ answer ํ์๋ณ๋ก ๋ถ๊ธฐ
- string match (exact text equality)
- multiple choice (single letter ์ถ์ถ)
- numeric โ
math-verify(symbolic parse + tolerance) - list string match (synonym ๋ฑ any-match)
- ordering โ ์ ํํ list ์์๋ฉด full reward, set ์ ๋ง๊ณ ์์ ํ๋ฆฌ๋ฉด 0.2 discount
- web action (JSON field weighted match)
- grounding (bbox ๋ค Hungarian matching, IoU/F1 threshold 0.5)
- clicking (point-in-box, ์ขํ [0,1000] normalize)
- instruction following (์ ์ฝ ์ถฉ์กฑ ๋น์จ)
- LLM-as-judge โ Qwen3-32B (thinking disabled), 1~10 ์ , OLMo3 judge setup ๋ณํ
- ablation: math-verify ๋จ์ผ reward 51.8 โ multi-route 57.2 (Table 4b). task-routed ๊ฐ +5.4 ์ ๋์ ์ ์ฐจ์ด๋ก ์ด๊น
์ฌ์ฉ์ ์ฝ๋ฉํธ (p.5, entity recognition
"A: Seagull"์): “entity recognition์ ๊ทธ๋ฅ exact match๋ก ํด์ผ๋๋.”
reward hacking & judge guideline
- LLM judge ๋ง ๋๋ฉด ๋ชจ๋ธ์ด self-evaluative language (“This satisfies all requirements”, “exhaustively documents every… detail”) + fabricated measurement ๋ฐ์์ ์ ์ inflate ์ํด
- mitigation: judge prompt ์ Automatic Failure Conditions ๋ช ์ โ self-evaluative / meta-commentary ๊ฐ ์กํ๋ฉด ์๋ 1์ . ๋ณด์ ํดํน์ด ์ํด๋ณด๋ ์ชฝ์ผ๋ก ๊ฐ๋๋ก ์ค๊ณ
- (? ๋ญ๊น): ์ด failure condition ์ด ์ ์์ ์ธ reasoning ๊น์ง ๊น์ ๊ฐ๋ฅ์ฑ์? false-positive rate ๋ ๋ฐ๋ก ์ ์ฐ ๋ฏ
evaluation โ VeroEval 30 bench
- Chart & OCR (6): ChartQA-Pro, ChartQA, InfoVQA, CharXiv, ChartMuseum, EvoChart
- STEM (4): MMMU-Pro Standard, MMMU-Pro Vision, MathVision, MathVista-testmini
- Spatial & Action (5): Blink, ERQA, GameQA-Lite, EmbSpatial, CVBench
- Knowledge & Recognition (4): RealWorldQA, SimpleVQA, FVQA, MM-Vet V2
- Grounding, Counting & Search (8): CountBenchQA, CountQA, MME-RealWorld, VStarBench, AerialVG, VisualProbe, ScreenSpot, ScreenSpot-Pro
- Captioning & IF (3): MM-MTBench, MIA-Bench, MMIFEval
result
- Vero-Qwen3I-8B vs Qwen3-VL-8B-Instruct: +5.3 ํ๊ท
- Chart&OCR +8.5 / STEM +6.4 / Spatial&Action +3.7 / Knowledge +1.0 / Grounding +5.3 / Captioning +5.6
- Knowledge ๋ง gain ์์ โ ์๋ base ๊ฐ ์ด๋ฏธ ์ํ๋ ์์ญ์ผ๋ก ๋ณด์
- Vero-Qwen3I-8B vs Qwen3-VL-8B-Thinking: 23 / 30 bench ์์ ์ด๊น (Instruct base ์ธ๋ฐ Thinking base ๋ชจ๋ธ๋ณด๋ค ๊ฐํ ์ผ์ด์ค)
- Vero-Qwen3T-8B vs Qwen3-VL-8B-Thinking: 24 / 30 (Grounding +7.2, Chart&OCR +4.2)
- Vero-MiMo-7B vs MiMo-VL-7B-RL (closed RL recipe): 6 ์นดํ ๊ณ ๋ฆฌ ์ค STEM +0.5, Knowledge +5.1, Captioning +4.0 ์ผ๋ก 3๊ฐ ์ด๊น โ ์คํ ๋ ์ํผ๊ฐ ํด๋ก์ฆ๋ ๋ ์ํผ๋ ๋ถ์
ablation โ cross-category transfer
- ํต์ฌ ์ฃผ์ฅ: “data diversity + task-aware reward design ์ผ๋ก negative transfer ๊ฐ ์ํ๋๋ค”
- single-task RL ์ ํํ ๋ค๋ฅธ ์นดํ ๊ณ ๋ฆฌ์ neutral ๋๋ negative transfer. ์: Captioning ๋ง์ผ๋ก RL ๋๋ฆฌ๋ฉด Qwen2.5-VL ๋ค๋ฅธ ์นดํ ๊ณ ๋ฆฌ๊ฐ -4.4 ~ -35.5 ์ ๊น์ง ๋จ์ด์ง
- 6 ์นดํ ๊ณ ๋ฆฌ ๋ค ์์ผ๋ฉด positive cross-category transfer ๊ฐ ๊ด์ฐฐ๋จ โ ์ฆ ํ ์นดํ ๊ณ ๋ฆฌ ์ถ๊ฐํ๋ ๊ฒ ๋ค๋ฅธ ์นดํ ๊ณ ๋ฆฌ์์๋ ๋์
- ์นดํ ๊ณ ๋ฆฌ๋ณ reasoning length ์ฐจ์ด๊ฐ ํผ: Spatial & Action ํ๊ท 1983 ๋จ์ด vs Grounding/Search ํ๊ท 125 ๋จ์ด
์ฌ์ฉ์ ์ฝ๋ฉํธ (p.13, “Spatial & Action” ์): “์ฌ๋ฐ๊ฒ๋ STEM ๋ณด๋ค Spatial & Action์ด ๋ ๋ฌธ์ฅ์ด ๋ง์ด ํ์ํ๋ค”
etc.
- task-routed reward ์ ์ง์ง ํจ๊ณผ๊ฐ (a) reward signal ์ ์ ํ๋ ๋๋ฌธ์ธ์ง (b) ์นดํ ๊ณ ๋ฆฌ๋ณ๋ก reward ๋ถํฌ๊ฐ ๋ฌ๋ผ์ ์๋์ผ๋ก curriculum / balancing ํจ๊ณผ ๋ด๋ ๊ฑด์ง ๋ถ๋ฆฌ ์ ๋์ด ์์
- SFT vs RL ablation ์์ RL ์ด +4.4 ์ธ ๊ฑด ๋์ผ ๋ฐ์ดํฐ ๊ธฐ์ค์ด๋ผ fair ํ์ง๋ง, SFT ์ชฝ hparam tuning ์ด ์ถฉ๋ถํ๋์ง๋ (๋ ผ๋ฌธ์ ์์)
- Knowledge ์นดํ ๊ณ ๋ฆฌ gain ๋ง +1.0 ์ผ๋ก ์์๋ฐ โ knowledge benchmark ์์ฒด๊ฐ RL ๋ก ํ์ตํด์ ์ป์ ๊ฒ ์ ์ (factual recall) ์์ญ์ด๋ผ ๊ทธ๋ฐ ๋ฏ
- MiMo-VL-7B-RL ์ด๋ ๋น๊ต์์ 6 ์ค 3 ์ด๊น = ํ๊ท ์ผ๋ก ์ฝ๊ฐ ์ง. ๊ทธ๋๋ “fully open recipe + 600K data ๋ก closed recipe ๋ฐ๋ผ์กํ” ์ด๋ผ๋ ๊ฒ contribution ์ ํต์ฌ