TL;DR
- I read this because.. : It’s a skywork series, and it talks about entropy.
- task : multimodal reasoning model
- problem : The gap with the closed model of MLLM is bigger
- IDEA : Continuation of the previous series (https://github.com/long8v/PTIR/issues/232 ) where we only want to learn projectors. A paper with several recipes and analysis.
- input/output : {image/text, prompt} -> {reasoning, answer}
- architecture : InternVL-38B
- objective : CE loss (SFT), GRPO loss (RL), entropy-guided checkpoint selection
- baseline : InternVL3-78B, Qwen2.5-VL-72B, GPT-4o, Claude 3.7, QVQ-72B
- data : cold-start STEM QA (12K), math RL data (15K), multi-domain connector tuning (10K)
- evaluation : 20+ benchmarks (MMMU, MathVista, LogicVista, PhyX, etc.) using VLMEvalKit
- result : SOTA among open-source (MMMU 76.0%), demonstrating reasoning transfer and generalization
- contribution : Proposed critical token entropy metric, highlighted connector role, provided RL analysis and ablation
- etc. : slow-thinking > fast-thinking, reasoning hallucination issue found, connector tuning is only effective
details
- thumbnail
data preparation
- LongCoT: 20K Chinese high-school difficulty – Skywork r1v2 rejection sampling (final answer) –> 12K
- GRPO : K-12 level 15K high quality math data –> entire multi-choice, fill-in-the-blank
- Data for connector only : 20 domains 10K
Post-Training Recipes
- reward: format, accuracy reward
- cold start sft
- thousands of cold-start samples from an early internal version of Skywork-R1V2
- employed the Skywork-VL-Reward (Wang et al., 2025d) alongside GPT-4o to filter rambling and overly lengthy samples, resulting in a refined cold-start dataset
vision lanuage benchmark performance
Use vlmevalkit, but refine it a bit per task, and will open source it soon
Empirical Analysis on Reinforcement Learning
- Critical Token Entropy Indicates Reasoning Ability
If you only do cold start CoT SFT, you are only pretending to reason and not really activating generalizable reasoning capabilities (repeating existing patterns rather than truly activating generalizable reasoning capabilities).
To measure this, they calculated the entropy of critical tokens (wait, alteratively, etc.) and used it to measure checkpointing (which correlates well with MMMU performance)
The Connector Module Activation is Vital in RL
The Distribution Shift in Curriculum Learning Hinder Generalization
We moved from K12 -> competition difficulty once, and performance on high difficulty tends to increase, but normal difficulty tends to decrease, while pyhsics and logics remain the same.
Complex skills, special patterns, and high-level strategies required for hard problems tend to clash at a normal level.
Component freeze ablation in the process of learning multiple domains after the RL stage
Discussion
In-domain (mathvista) and out-of-domain (mmmu) performance differences between SFT and RL with math-only
SFT does not generalize, RL does (https://github.com/long8v/PTIR/issues/230 )
thinking budget
Hallucination in Skywork-R1V3βs Chain-of-Thought Impairs Reasoning Performance
Analysis on Entropy Token in Visual Reasoning Task
As training progresses, the overall entropy of tokens decreases (determinisitic termination), but the probability of tokens with high entropy increases.
In other words, it will be trained in the direction of more delibration tokens such as wait, …
- The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models https://arxiv.org/pdf/2505.22617
- Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoninghttps://arxiv.org/abs/2506.01939