[213] Skywork-R1V3 Technical Report

paper , model , code

TL;DR

I read this because.. : skywork 시리즈인데 entropy 얘기가 있어서.
task : multimodal reasoning model
problem : MLLM의 closed model과의 gap이 더 큼
idea : projector만 학습하고자 하는 것은 이전 시리즈(https://github.com/long8v/PTIR/issues/232 ) 와 이어짐. 여러 가지 레시피와 분석을 넣은 논문.
input/output : {image/text, prompt} -> {reasoning, answer}
architecture : InternVL-38B
objective : CE loss (SFT), GRPO loss (RL), entropy-guided checkpoint selection
baseline : InternVL3-78B, Qwen2.5-VL-72B, GPT-4o, Claude 3.7, QVQ-72B
data : cold-start STEM QA (12K), math RL data (15K), multi-domain connector tuning (10K)
evaluation : 20+ benchmarks (MMMU, MathVista, LogicVista, PhyX 등) using VLMEvalKit
result : open-source 중 SOTA (MMMU 76.0%), reasoning transfer와 generalization 입증
contribution : critical token entropy 지표 제안, connector 역할 강조, RL 분석 및 ablation 제공
etc. : slow-thinking > fast-thinking, reasoning hallucination 이슈 발견, connector tuning만 효과적

details

thumbnail

data preparation
- LongCoT: 20K Chinese high-school difficulty – Skywork r1v2 rejection sampling (final answer) –> 12K
- GRPO : K-12 level 15K high quality math data –> entire multi-choice, fill-in-the-blank
- Data for connector only : 20 domains 10K
Post-Training Recipes
- reward: format, accuracy reward
- cold start sft
  - thousands of cold-start samples from an early internal version of Skywork-R1V2
  - employed the Skywork-VL-Reward (Wang et al., 2025d) alongside GPT-4o to filter rambling and overly lengthy samples, resulting in a refined cold-start dataset
vision lanuage benchmark performance
- vlmevalkit을 사용하되 task별로 조금 가다듬었다고 하는데 곧 오픈소스할거라고 함
Empirical Analysis on Reinforcement Learning
- Critical Token Entropy Indicates Reasoning Ability
- cold start CoT SFT만 하는 경우 reasoning을 하는 척만 하고 실제로는 generalizable reasoning 능력은 발현되고 있지 않다고 함( repeating existing patterns rather than truly activating generalizable reasoning capabilities)
- 이를 측정하기 위해 critical token(wait, alteratively 등)의 entropy를 계산하고 이를 체크포인트 측정하는데 사용했다고 함 (mmmu 성능과 correlation이 높음)
The Connector Module Activation is Vital in RL
The Distribution Shift in Curriculum Learning Hinder Generalization
- K12 -> competition 난이도로 한번 옮기는 작업을 했는데 높은 난이도에 대한 성능은 오르나 normal 난이도는 떨어지고 pyhsics, logics는 유지되는 경향성.
- hard problem에서 필요한 복잡한 skill, special pattern, high-level strategy가 normal level에선 충돌하는듯한 경향성
RL stage 이후에 여러 도메인 학습하는 공정에서 component freeze ablation
Discussion
math-only로 SFT와 RL을 했을 때의 in-domain (mathvista), out-of-domain (mmmu) 성능 차이
SFT는 generalize가 안되고 RL은 됨 (https://github.com/long8v/PTIR/issues/230 )
thinking budget
Hallucination in Skywork-R1V3’s Chain-of-Thought Impairs Reasoning Performance
Analysis on Entropy Token in Visual Reasoning Task
- 학습이 진행됨에 따라 전반적인 토큰의 엔트로피는 낮아지나(determinisitic 해지나) 높은 엔트로피를 가진 토큰들의 확률은 높아짐.
- 즉, wait, … 과 같은 delibration token들이 더 많이 나오는 방향으로 학습됨
- The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models https://arxiv.org/pdf/2505.22617
- Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoninghttps://arxiv.org/abs/2506.01939

TL;DR#

details#

TL;DR

details