Image

paper , model , code

TL;DR

  • I read this because.. : It’s a skywork series, and it talks about entropy.
  • task : multimodal reasoning model
  • problem : The gap with the closed model of MLLM is bigger
  • IDEA : Continuation of the previous series (https://github.com/long8v/PTIR/issues/232 ) where we only want to learn projectors. A paper with several recipes and analysis.
  • input/output : {image/text, prompt} -> {reasoning, answer}
  • architecture : InternVL-38B
  • objective : CE loss (SFT), GRPO loss (RL), entropy-guided checkpoint selection
  • baseline : InternVL3-78B, Qwen2.5-VL-72B, GPT-4o, Claude 3.7, QVQ-72B
  • data : cold-start STEM QA (12K), math RL data (15K), multi-domain connector tuning (10K)
  • evaluation : 20+ benchmarks (MMMU, MathVista, LogicVista, PhyX, etc.) using VLMEvalKit
  • result : SOTA among open-source (MMMU 76.0%), demonstrating reasoning transfer and generalization
  • contribution : Proposed critical token entropy metric, highlighted connector role, provided RL analysis and ablation
  • etc. : slow-thinking > fast-thinking, reasoning hallucination issue found, connector tuning is only effective

details

  • thumbnail
Image
  • data preparation

    • Image
    • LongCoT: 20K Chinese high-school difficulty – Skywork r1v2 rejection sampling (final answer) –> 12K
    • GRPO : K-12 level 15K high quality math data –> entire multi-choice, fill-in-the-blank
    • Data for connector only : 20 domains 10K
    • Image
  • Post-Training Recipes

    • reward: format, accuracy reward
    • cold start sft
      • thousands of cold-start samples from an early internal version of Skywork-R1V2
      • employed the Skywork-VL-Reward (Wang et al., 2025d) alongside GPT-4o to filter rambling and overly lengthy samples, resulting in a refined cold-start dataset
  • vision lanuage benchmark performance

  • Use vlmevalkit, but refine it a bit per task, and will open source it soon

    • Image
  • Empirical Analysis on Reinforcement Learning

    • Critical Token Entropy Indicates Reasoning Ability
    • Image
  • If you only do cold start CoT SFT, you are only pretending to reason and not really activating generalizable reasoning capabilities (repeating existing patterns rather than truly activating generalizable reasoning capabilities).

  • To measure this, they calculated the entropy of critical tokens (wait, alteratively, etc.) and used it to measure checkpointing (which correlates well with MMMU performance)

  • The Connector Module Activation is Vital in RL

    • Image
  • The Distribution Shift in Curriculum Learning Hinder Generalization

    • Image
  • We moved from K12 -> competition difficulty once, and performance on high difficulty tends to increase, but normal difficulty tends to decrease, while pyhsics and logics remain the same.

  • Complex skills, special patterns, and high-level strategies required for hard problems tend to clash at a normal level.

  • Component freeze ablation in the process of learning multiple domains after the RL stage

    • Image
  • Discussion

  • Image
  • In-domain (mathvista) and out-of-domain (mmmu) performance differences between SFT and RL with math-only

  • SFT does not generalize, RL does (https://github.com/long8v/PTIR/issues/230 )

  • thinking budget

    • Image
    • Image
  • Hallucination in Skywork-R1V3’s Chain-of-Thought Impairs Reasoning Performance

    • Image
  • Analysis on Entropy Token in Visual Reasoning Task

    • Image
  • As training progresses, the overall entropy of tokens decreases (determinisitic termination), but the probability of tokens with high entropy increases.

  • In other words, it will be trained in the direction of more delibration tokens such as wait, …

    • The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models https://arxiv.org/pdf/2505.22617
    • Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoninghttps://arxiv.org/abs/2506.01939