[N] Vero: An Open RL Recipe for General Visual Reasoning

paper

TL;DR

I read this because.. :
task : MLLM / RL (general visual reasoning)
Problem :** Bringing a wide range of visual reasoning from open weight VLM to RL at once, including charting / STEM / spatial / grounding, without focusing on a single domain.
Idea:** 6 categories × 100K uniform sampling (Vero-600K) + task-routed reward branching by answer format + GSPO single stage RL
input/output : {image, question} -> <think>...</think><answer>...</answer>
architecture: Qwen3-VL-8B-Instruct/Thinking, Qwen2.5-VL-7B-Instruct, MiMo-VL-7B-SFT on top of RL as-is (frozen or not is not in the paper)
objective : GSPO, asymmetric clipping ($\varepsilon_{high} > \varepsilon_{low}$), no KL. $R = (1-\alpha) R_{acc} + \alpha R_{fmt} + R_{overlong}$ ($\alpha=0.2$)
baseline : Qwen3-VL-8B-Instruct, Qwen3-VL-8B-Thinking, MiMo-VL-7B-RL
data:** Vero-600K = 59 datasets, 6 categories × 100K. filter by correctness/unambiguity/verifiability 3 criteria by ~50 examples each (filter judge = Qwen3-VL-235B-A22B-Instruct)
evaluation : VeroEval 30 benchmark (Chart&OCR 6 / STEM 4 / Spatial&Action 5 / Knowledge 4 / Grounding&Counting 8 / Captioning&IF 3)
result : Vero-Qwen3I-8B outperforms base by +5.3 average, Qwen3-VL-8B-Thinking by 23/30. Vero-MiMo-7B wins 3 out of 6 categories (STEM +0.5, Knowledge +5.1, Captioning +4.0) in MiMo-VL-7B-RL (closed recipe)
contribution : (1) 600K open RL data + 30 benchmark eval suite open release (2) ablation that task-routed reward (10 type) is +5.4 vs. math-verify single (3) claim that “data diversity + task-aware reward mitigates negative transfer in visual reasoning RL”
etc. : single stage, ~600 RL steps (≈1 epoch). KL penalty 0. 16x difference in reasoning length per category (Spatial&Action 1983 words vs Grounding/Search 125 words).

Details

data — Vero-600K

59 dataset, 6 categories × 100K uniform (uniform sampling scores +0.6 to +1.0 points higher than difficulty/area/length weighting - Table 2)
Organizing categories
- Chart & OCR (9) : ChartQA, InfoVQA, CoSyn-Chart/Diagram/Table, ArxivQA, ECD-VQA, EvoChart, InfographicVQA, ReachQA
- STEM (13) : CoSyn-Math, AI2D, Geo170K, GeomVerse, GeoQA+, MMK12, PathVQA, RAVEN, TQA, VisualWebInstruct, VQA-RAD, We-Math 2.0 (Pro & Std)
- Spatial & Action (8) : GameQA, Magma-AITW, Magma-Mind2Web, Robo2VLM, Spatial-SSRL, ST-VQA, Visual Jigsaw 2D/3D
- Knowledge & Recognition (12) : A-OKVQA, GQA, IconQA, Indoor-QA, KVG, KVQA, PopVQA, VCR, ViQuAE, Visual7W, VizWiz, VQAv2
- Grounding, Counting & Search (11) : AerialVG, GroundUI, MultiHop, Objects365-QA, OOD-VQA, OS-ATLAS, Pixel Reasoner, PixMo, RefCOCOg, TallyQA, Visual Probe
- Captioning & IF (6) : PixMo-AskAnything, PixMo-CapQA, PixMo-Cap, MM-RLVR-IFEval, MMIF-23K, Flickr30K
Filtering pipelines
~50 samples per category for direct reporting criteria: correctness (<5% annotation error rate), unambiguity (each question has a single verifiable answer), verifiability
automatic filter judge = Qwen3-VL-235B-A22B-Instruct. remove ambiguous/image-irrelevant/unverifiable question
Results: pre→post filter average 61.3-64.1
(not in the paper): Specify how datasets were weighted within categories X

training recipe

Algorithm: GSPO (Group Sequence Policy Optimization, asymmetric clipping). maintained entropy better than GRPO / DAPO in ablation (0.58±0.11 vs 0.50±0.11 / 0.22±0.15) with a slightly higher mean score (54.7 vs 54.3 / 54.3).
single stage RL, no warm-start SFT. ~600 steps ≈ 1 epoch
KL penalty 0
context length: soft overlong penalty buffer $[L_{max}-2048, L_{max}]$
temperature, etc. sampling: Qwen3 series T=0.7, Qwen2.5 series are shown in appendix Tables C2 through C3.
SFT vs RL ablation: For the same Vero-600K, RL scores +4.4 points over SFT - not because the data itself is better, but because the RL recipe needs to work with it

reward - task-routed (one of the core contributions)

Total reward $$ R(y, y^) = (1-\alpha) R_{acc}(y, y^) + \alpha R_{fmt}(y) + R_{overlong}(y),\quad \alpha=0.2 $$
overlong penalty (Eq. 4): $$ R_{overlong}(y) = \min!\Big(-\frac{|y|-(L_{max}-B)}{B}\lambda,\ 0\Big),\quad B=2048,\ \lambda=1.0 $$
format reward: <think>...</think><answer>...</answer> 1 for following the structure, 0 for not following the structure. Viewing format (such as not using \boxed{...}) is 0.5 for partial
10 accuracy rewards - branching by answer format
1. string match (exact text equality)

multiple choice (single letter extraction)
numeric → math-verify (symbolic parse + tolerance)
list string match (any-match, such as synonym)
ordering → full reward for correct list order, set is correct, 0.2 discount for incorrect order
web action (JSON field weighted match)
grounding (bboxes Hungarian matching, IoU/F1 threshold 0.5)
clicking (point-in-box, coordinates [0,1000] normalize)
instruction following (percentage of constraints met)
LLM-as-judge - Qwen3-32B (thinking disabled), 1-10 points, OLMo3 judge setup variant

ablation: math-verify single reward 51.8 → multi-route 57.2 (Table 4b). task-routed wins by +5.4 absolute score difference

User comment (p.5, next to entity recognition "A: Seagull"): “Maybe entity recognition should just be an exact match.”

reward hacking & judge guideline

All you need is an LLM judge and the model will inflate your score with self-evaluative language (“This satisfies all requirements”, “exhaustively documents every… detail”) + fabricated measurements
mitigation: specify Automatic Failure Conditions in judge prompt - automatic 1 point if self-evaluative / meta-commentary caught. Designing reward hacks to lose
(? What is it): How likely is this failure condition to cut through normal reasoning? I don’t think I’ve measured the false-positive rate.

evaluation — VeroEval 30 bench

Chart & OCR (6): ChartQA-Pro, ChartQA, InfoVQA, CharXiv, ChartMuseum, EvoChart
STEM (4): MMMU-Pro Standard, MMMU-Pro Vision, MathVision, MathVista-testmini
Spatial & Action (5): Blink, ERQA, GameQA-Lite, EmbSpatial, CVBench
Knowledge & Recognition (4): RealWorldQA, SimpleVQA, FVQA, MM-Vet V2
Grounding, Counting & Search (8): CountBenchQA, CountQA, MME-RealWorld, VStarBench, AerialVG, VisualProbe, ScreenSpot, ScreenSpot-Pro
Captioning & IF (3): MM-MTBench, MIA-Bench, MMIFEval

result

Vero-Qwen3I-8B vs Qwen3-VL-8B-Instruct: +5.3 average.
- Chart&OCR +8.5 / STEM +6.4 / Spatial&Action +3.7 / Knowledge +1.0 / Grounding +5.3 / Captioning +5.6
Small gain in knowledge only - seems to be an area where the original base was already good at
Vero-Qwen3I-8B vs Qwen3-VL-8B-Thinking: Wins 23 / 30 bench (Instruct base, but stronger case than Thinking base model)
Vero-Qwen3T-8B vs Qwen3-VL-8B-Thinking: 24 / 30 (Grounding +7.2, Chart&OCR +4.2)
Vero-MiMo-7B vs MiMo-VL-7B-RL (closed RL recipe): Wins 3 out of 6 categories with STEM +0.5, Knowledge +5.1, Captioning +4.0 - open recipe is tied with closed recipe

ablation — cross-category transfer

Key claim: “data diversity + task-aware reward design mitigates negative transfer”.
Single-task RL often has a neutral or negative transfer to other categories. Example: RL turns on captioning alone caused Qwen2.5-VL other categories to drop by -4.4 to -35.5 points
6 When mixing categories, positive cross-category transfer is observed - i.e., adding one category helps another category
Large difference in reasoning length by category: Spatial & Action average 1983 words vs Grounding/Search average 125 words

User comments (p.13, next to “Spatial & Action”): “Funny, Spatial & Action needs more sentences than STEM.”

etc.

No separation of whether the true effect of task-routed rewards is (a) due to the accuracy of the reward signal or (b) due to different reward distributions across categories, which automatically produces a curricular/balancing effect
The RL of +4.4 in SFT vs RL ablation is fair based on the same data, but whether the hparam tuning on the SFT side was sufficient (not in the paper) is not known.
Only the Knowledge category gain is small at +1.0 - as if the knowledge benchmark itself is an area where there is little to gain from learning with RL (factual recall).
MiMo-VL-7B-RL wins 3 out of 6 in gyrus comparison = a bit of a burden on average. Still, “fully open recipe + 600K data to catch up with closed recipe” is the core of contribution

TL;DR#

Details#

data — Vero-600K#

training recipe#

reward - task-routed (one of the core contributions)#

reward hacking & judge guideline#

evaluation — VeroEval 30 bench#

result#

ablation — cross-category transfer#

etc.#

TL;DR

Details

data — Vero-600K

training recipe

reward - task-routed (one of the core contributions)

reward hacking & judge guideline

evaluation — VeroEval 30 bench

result

ablation — cross-category transfer

etc.