TL;DR
- I read this because.. : open source VLM + RL
- task : MLLM / RL (general visual reasoning)
- problem : open source VLM + RL
- idea: Explore various recipes
- input/output :
{image, question} -> <think>...</think><answer>...</answer> - Architecture: RL trained directly on Qwen3-VL-8B-Instruct / Thinking, Qwen2.5-VL-7B-Instruct, and MiMo-VL-7B-SFT. (The paper does not specify whether these models were frozen.)
- objective : GSPO, asymmetric clipping
- baseline : Qwen3-VL-8B-Instruct, Qwen3-VL-8B-Thinking, MiMo-VL-7B-RL
- Data: Vero-600K = 59 datasets, 6 categories × 100,000
- evaluation : VeroEval 30 benchmark (Chart&OCR 6 / STEM 4 / Spatial&Action 5 / Knowledge 4 / Grounding&Counting 8 / Captioning&IF 3)
- Result: Vero-Qwen3I-8B achieved an average score of +5.3 compared to the baseline and outperformed Qwen3-VL-8B-Thinking in 23 out of 30 categories. Vero-MiMo-7B outperforms MiMo-VL-7B-RL (closed recipe) in 3 out of 6 categories (STEM +0.5, Knowledge +5.1, Captioning +4.0).
- Contribution: (1) Open release of 600K open RL data + 30 benchmark evaluation suites (2) Ablation study showing that task-routed rewards (10 types) yield a +5.4 improvement compared to a single math-verified reward (3) The claim that “negative transfer in visual reasoning RL is mitigated by data diversity and task-aware rewards”
- etc. :
Details
data — Vero-600K
- Category Structure
- Chart & OCR (9) : ChartQA, InfoVQA, CoSyn-Chart/Diagram/Table, ArxivQA, ECD-VQA, EvoChart, InfographicVQA, ReachQA
- STEM (13) : CoSyn-Math, AI2D, Geo170K, GeomVerse, GeoQA+, MMK12, PathVQA, RAVEN, TQA, VisualWebInstruct, VQA-RAD, We-Math 2.0 (Pro & Std)
- Spatial & Action (8) : GameQA, Magma-AITW, Magma-Mind2Web, Robo2VLM, Spatial-SSRL, ST-VQA, Visual Jigsaw 2D/3D
- Knowledge & Recognition (12) : A-OKVQA, GQA, IconQA, Indoor-QA, KVG, KVQA, PopVQA, VCR, ViQuAE, Visual7W, VizWiz, VQAv2
- Grounding, Counting & Search (11) : AerialVG, GroundUI, MultiHop, Objects365-QA, OOD-VQA, OS-ATLAS, Pixel Reasoner, PixMo, RefCOCOg, TallyQA, Visual Probe
- Captioning & IF (6) : PixMo-AskAnything, PixMo-CapQA, PixMo-Cap, MM-RLVR-IFEval, MMIF-23K, Flickr30K
- data filtering
- Established criteria by reviewing approximately 50 samples per category: correctness (<5% annotation error rate), unambiguity (whether each question has a single verifiable answer), and verifiability
- Automatic filter judge =
Qwen3-VL-235B-A22B-Instruct. Removes ambiguous, image-irrelevant, and unverifiable questions - Results: pre→post filter average 61.3–64.1
- data mixture
- They said it would have been better to just divide it equally
- (Not included in the paper): Does not specify how the datasets within a category were weighted
training recipe
- Algorithm: GSPO (Group Sequence Policy Optimization, asymmetric clipping). In the ablation study, it maintains entropy better than GRPO and DAPO (0.58±0.11 vs. 0.50±0.11 / 0.22±0.15) while also achieving a slightly higher average score (54.7 vs. 54.3 / 54.3).
- Single-stage RL, no warm-start SFT. ~600 steps ≈ 1 epoch
- KL penalty 0
- context length: soft overlong penalty buffer $[L_{max}-2048, L_{max}]$
- SFT vs. RL ablation: On the same Vero-600K dataset, RL outperformed SFT by 4.4 points
reward
- Total reward $$ R(y, y^) = (1-\alpha) R_{acc}(y, y^) + \alpha R_{fmt}(y) + R_{overlong}(y),\quad \alpha=0.2 $$
- overlong penalty (Eq. 4): $$ R_{overlong}(y) = \min!\Big(-\frac{|y|-(L_{max}-B)}{B}\lambda,\ 0\Big),\quad B=2048,\ \lambda=1.0 $$
- Reward format: 1 if the
<think>...</think><answer>...</answer>structure is followed; 0 if not. For answer formats that do not use\boxed{...}, a partial score of 0.5 is awarded. - 10 types of accuracy rewards — branched by answer format
- string match (exact text equality)
- Multiple choice (single-letter selection)
- numeric →
math-verify(symbolic parse + tolerance) - List string match (any-match, such as synonyms)
- Ordering → Full reward if the list is in the correct order; 0.2 discount if the set is correct but the order is wrong
- web action (JSON field weighted match)
- Grounding (Hungarian matching of bounding boxes, IoU/F1 threshold of 0.5)
- Clicking (point-in-box, coordinates [0,1000] normalized)
- Instruction following (Rate of compliance with constraints)
- LLM-as-judge — Qwen3-32B (disabled on thinking tasks), 1–10 points, modified OLMo3 judge setup
ablation:
math-verify: single reward 51.8 → multi-route 57.2 (Table 4b). task-routed wins by a margin of +5.4 absolute points
reward hacking & judge guideline
- If you use only an LLM as a judge, the model will inflate scores by using self-evaluative language (“This satisfies all requirements,” “exhaustively documents every… detail”) and fabricated metrics.
- Mitigation: Specify Automatic Failure Conditions in the judge prompt — automatically deduct 1 point if self-evaluative or meta-commentary is detected. Design the system so that reward hacking results in a loss.
- (? I wonder): Is there a chance that this failure condition could compromise normal reasoning? It doesn’t seem like the false-positive rate was measured separately.
evaluation — VeroEval 30 bench
- Chart & OCR (6): ChartQA-Pro, ChartQA, InfoVQA, CharXiv, ChartMuseum, EvoChart
- STEM (4): MMMU-Pro Standard, MMMU-Pro Vision, MathVision, MathVista-testmini
- Spatial & Action (5): Blink, ERQA, GameQA-Lite, EmbSpatial, CVBench
- Knowledge & Recognition (4): RealWorldQA, SimpleVQA, FVQA, MM-Vet V2
- Grounding, Counting & Search (8): CountBenchQA, CountQA, MME-RealWorld, VStarBench, AerialVG, VisualProbe, ScreenSpot, ScreenSpot-Pro
- Captioning & IF (3): MM-MTBench, MIA-Bench, MMIFEval
result
- Vero-Qwen3I-8B vs Qwen3-VL-8B-Instruct: +5.3 on average
- Chart&OCR +8.5 / STEM +6.4 / Spatial&Action +3.7 / Knowledge +1.0 / Grounding +5.3 / Captioning +5.6
- Limited knowledge gain — It appears to be an area where the original base was already performing well.
- Vero-Qwen3I-8B vs Qwen3-VL-8B-Thinking: Won on the 23/30 benchmark (an example where the Instruct-based model outperforms the Thinking-based model)
- Vero-Qwen3T-8B vs Qwen3-VL-8B-Thinking: 24 / 30 (Grounding +7.2, Chart&OCR +4.2)
- Vero-MiMo-7B vs MiMo-VL-7B-RL (closed RL recipe): Out of 6 categories, it outperformed the latter in 3—STEM +0.5, Knowledge +5.1, and Captioning +4.0
ablation — cross-category transfer
- Key claim: “Negative transfer is mitigated through data diversity and task-aware reward design”
- Single-task RL often results in neutral or negative transfer to other categories. For example, when training Qwen2.5-VL using only captioning, its scores on other categories drop by as much as -4.4 to -35.5 points.
- When all 6 categories are combined, positive cross-category transfer is observed—in other words, adding one category helps with the others as well
- Significant differences in reasoning length across categories: Spatial & Action average 1,983 words vs. Grounding/Search average 125 words
- Interestingly, “Spatial & Action” requires more sentences than “STEM.”
etc.
- It remains unclear whether the true effectiveness of task-routed rewards stems from (a) the accuracy of the reward signal or (b) the fact that the reward distribution varies by category, thereby automatically producing curriculum and balancing effects.
- In the SFT vs. RL ablation, the fact that RL achieved a score of +4.4 is fair since it was based on the same data, but it is unclear whether sufficient hyperparameter tuning was performed for SFT (not mentioned in the paper).
- The gain in the Knowledge category is only +1.0, which is small—this is likely because the Knowledge benchmark itself is a domain (factual recall) where there is little to be gained from RL training.