TL;DR
- I read this because.. : skywork ์๋ฆฌ์ฆ์ธ๋ฐ entropy ์๊ธฐ๊ฐ ์์ด์.
- task : multimodal reasoning model
- problem : MLLM์ closed model๊ณผ์ gap์ด ๋ ํผ
- idea : projector๋ง ํ์ตํ๊ณ ์ ํ๋ ๊ฒ์ ์ด์ ์๋ฆฌ์ฆ(https://github.com/long8v/PTIR/issues/232 ) ์ ์ด์ด์ง. ์ฌ๋ฌ ๊ฐ์ง ๋ ์ํผ์ ๋ถ์์ ๋ฃ์ ๋ ผ๋ฌธ.
- input/output : {image/text, prompt} -> {reasoning, answer}
- architecture : InternVL-38B
- objective : CE loss (SFT), GRPO loss (RL), entropy-guided checkpoint selection
- baseline : InternVL3-78B, Qwen2.5-VL-72B, GPT-4o, Claude 3.7, QVQ-72B
- data : cold-start STEM QA (12K), math RL data (15K), multi-domain connector tuning (10K)
- evaluation : 20+ benchmarks (MMMU, MathVista, LogicVista, PhyX ๋ฑ) using VLMEvalKit
- result : open-source ์ค SOTA (MMMU 76.0%), reasoning transfer์ generalization ์ ์ฆ
- contribution : critical token entropy ์งํ ์ ์, connector ์ญํ ๊ฐ์กฐ, RL ๋ถ์ ๋ฐ ablation ์ ๊ณต
- etc. : slow-thinking > fast-thinking, reasoning hallucination ์ด์ ๋ฐ๊ฒฌ, connector tuning๋ง ํจ๊ณผ์
details
- thumbnail
data preparation
- LongCoT: 20K Chinese high-school difficulty – Skywork r1v2 rejection sampling (final answer) –> 12K
- GRPO : K-12 level 15K high quality math data –> entire multi-choice, fill-in-the-blank
- Data for connector only : 20 domains 10K
Post-Training Recipes
- reward: format, accuracy reward
- cold start sft
- thousands of cold-start samples from an early internal version of Skywork-R1V2
- employed the Skywork-VL-Reward (Wang et al., 2025d) alongside GPT-4o to filter rambling and overly lengthy samples, resulting in a refined cold-start dataset
vision lanuage benchmark performance
- vlmevalkit์ ์ฌ์ฉํ๋ task๋ณ๋ก ์กฐ๊ธ ๊ฐ๋ค๋ฌ์๋ค๊ณ ํ๋๋ฐ ๊ณง ์คํ์์คํ ๊ฑฐ๋ผ๊ณ ํจ
Empirical Analysis on Reinforcement Learning
- Critical Token Entropy Indicates Reasoning Ability
- cold start CoT SFT๋ง ํ๋ ๊ฒฝ์ฐ reasoning์ ํ๋ ์ฒ๋ง ํ๊ณ ์ค์ ๋ก๋ generalizable reasoning ๋ฅ๋ ฅ์ ๋ฐํ๋๊ณ ์์ง ์๋ค๊ณ ํจ( repeating existing patterns rather than truly activating generalizable reasoning capabilities)
- ์ด๋ฅผ ์ธก์ ํ๊ธฐ ์ํด critical token(wait, alteratively ๋ฑ)์ entropy๋ฅผ ๊ณ์ฐํ๊ณ ์ด๋ฅผ ์ฒดํฌํฌ์ธํธ ์ธก์ ํ๋๋ฐ ์ฌ์ฉํ๋ค๊ณ ํจ (mmmu ์ฑ๋ฅ๊ณผ correlation์ด ๋์)
The Connector Module Activation is Vital in RL
The Distribution Shift in Curriculum Learning Hinder Generalization
- K12 -> competition ๋์ด๋๋ก ํ๋ฒ ์ฎ๊ธฐ๋ ์์ ์ ํ๋๋ฐ ๋์ ๋์ด๋์ ๋ํ ์ฑ๋ฅ์ ์ค๋ฅด๋ normal ๋์ด๋๋ ๋จ์ด์ง๊ณ pyhsics, logics๋ ์ ์ง๋๋ ๊ฒฝํฅ์ฑ.
- hard problem์์ ํ์ํ ๋ณต์กํ skill, special pattern, high-level strategy๊ฐ normal level์์ ์ถฉ๋ํ๋๋ฏํ ๊ฒฝํฅ์ฑ
RL stage ์ดํ์ ์ฌ๋ฌ ๋๋ฉ์ธ ํ์ตํ๋ ๊ณต์ ์์ component freeze ablation
Discussion
math-only๋ก SFT์ RL์ ํ์ ๋์ in-domain (mathvista), out-of-domain (mmmu) ์ฑ๋ฅ ์ฐจ์ด
SFT๋ generalize๊ฐ ์๋๊ณ RL์ ๋จ (https://github.com/long8v/PTIR/issues/230 )
thinking budget
Hallucination in Skywork-R1V3โs Chain-of-Thought Impairs Reasoning Performance
Analysis on Entropy Token in Visual Reasoning Task
- ํ์ต์ด ์งํ๋จ์ ๋ฐ๋ผ ์ ๋ฐ์ ์ธ ํ ํฐ์ ์ํธ๋กํผ๋ ๋ฎ์์ง๋(determinisitic ํด์ง๋) ๋์ ์ํธ๋กํผ๋ฅผ ๊ฐ์ง ํ ํฐ๋ค์ ํ๋ฅ ์ ๋์์ง.
- ์ฆ, wait, … ๊ณผ ๊ฐ์ delibration token๋ค์ด ๋ ๋ง์ด ๋์ค๋ ๋ฐฉํฅ์ผ๋ก ํ์ต๋จ
- The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models https://arxiv.org/pdf/2505.22617
- Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoninghttps://arxiv.org/abs/2506.01939