Image

paper , model , code

TL;DR

  • I read this because.. : skywork ์‹œ๋ฆฌ์ฆˆ์ธ๋ฐ entropy ์–˜๊ธฐ๊ฐ€ ์žˆ์–ด์„œ.
  • task : multimodal reasoning model
  • problem : MLLM์˜ closed model๊ณผ์˜ gap์ด ๋” ํผ
  • idea : projector๋งŒ ํ•™์Šตํ•˜๊ณ ์ž ํ•˜๋Š” ๊ฒƒ์€ ์ด์ „ ์‹œ๋ฆฌ์ฆˆ(https://github.com/long8v/PTIR/issues/232 ) ์™€ ์ด์–ด์ง. ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ๋ ˆ์‹œํ”ผ์™€ ๋ถ„์„์„ ๋„ฃ์€ ๋…ผ๋ฌธ.
  • input/output : {image/text, prompt} -> {reasoning, answer}
  • architecture : InternVL-38B
  • objective : CE loss (SFT), GRPO loss (RL), entropy-guided checkpoint selection
  • baseline : InternVL3-78B, Qwen2.5-VL-72B, GPT-4o, Claude 3.7, QVQ-72B
  • data : cold-start STEM QA (12K), math RL data (15K), multi-domain connector tuning (10K)
  • evaluation : 20+ benchmarks (MMMU, MathVista, LogicVista, PhyX ๋“ฑ) using VLMEvalKit
  • result : open-source ์ค‘ SOTA (MMMU 76.0%), reasoning transfer์™€ generalization ์ž…์ฆ
  • contribution : critical token entropy ์ง€ํ‘œ ์ œ์•ˆ, connector ์—ญํ•  ๊ฐ•์กฐ, RL ๋ถ„์„ ๋ฐ ablation ์ œ๊ณต
  • etc. : slow-thinking > fast-thinking, reasoning hallucination ์ด์Šˆ ๋ฐœ๊ฒฌ, connector tuning๋งŒ ํšจ๊ณผ์ 

details

  • thumbnail
Image
  • data preparation

    • Image
    • LongCoT: 20K Chinese high-school difficulty – Skywork r1v2 rejection sampling (final answer) –> 12K
    • GRPO : K-12 level 15K high quality math data –> entire multi-choice, fill-in-the-blank
    • Data for connector only : 20 domains 10K
    • Image
  • Post-Training Recipes

    • reward: format, accuracy reward
    • cold start sft
      • thousands of cold-start samples from an early internal version of Skywork-R1V2
      • employed the Skywork-VL-Reward (Wang et al., 2025d) alongside GPT-4o to filter rambling and overly lengthy samples, resulting in a refined cold-start dataset
  • vision lanuage benchmark performance

    • vlmevalkit์„ ์‚ฌ์šฉํ•˜๋˜ task๋ณ„๋กœ ์กฐ๊ธˆ ๊ฐ€๋‹ค๋“ฌ์—ˆ๋‹ค๊ณ  ํ•˜๋Š”๋ฐ ๊ณง ์˜คํ”ˆ์†Œ์Šคํ• ๊ฑฐ๋ผ๊ณ  ํ•จ
    • Image
  • Empirical Analysis on Reinforcement Learning

    • Critical Token Entropy Indicates Reasoning Ability
    • Image
    • cold start CoT SFT๋งŒ ํ•˜๋Š” ๊ฒฝ์šฐ reasoning์„ ํ•˜๋Š” ์ฒ™๋งŒ ํ•˜๊ณ  ์‹ค์ œ๋กœ๋Š” generalizable reasoning ๋Šฅ๋ ฅ์€ ๋ฐœํ˜„๋˜๊ณ  ์žˆ์ง€ ์•Š๋‹ค๊ณ  ํ•จ( repeating existing patterns rather than truly activating generalizable reasoning capabilities)
    • ์ด๋ฅผ ์ธก์ •ํ•˜๊ธฐ ์œ„ํ•ด critical token(wait, alteratively ๋“ฑ)์˜ entropy๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ  ์ด๋ฅผ ์ฒดํฌํฌ์ธํŠธ ์ธก์ •ํ•˜๋Š”๋ฐ ์‚ฌ์šฉํ–ˆ๋‹ค๊ณ  ํ•จ (mmmu ์„ฑ๋Šฅ๊ณผ correlation์ด ๋†’์Œ)
  • The Connector Module Activation is Vital in RL

    • Image
  • The Distribution Shift in Curriculum Learning Hinder Generalization

    • Image
    • K12 -> competition ๋‚œ์ด๋„๋กœ ํ•œ๋ฒˆ ์˜ฎ๊ธฐ๋Š” ์ž‘์—…์„ ํ–ˆ๋Š”๋ฐ ๋†’์€ ๋‚œ์ด๋„์— ๋Œ€ํ•œ ์„ฑ๋Šฅ์€ ์˜ค๋ฅด๋‚˜ normal ๋‚œ์ด๋„๋Š” ๋–จ์–ด์ง€๊ณ  pyhsics, logics๋Š” ์œ ์ง€๋˜๋Š” ๊ฒฝํ–ฅ์„ฑ.
    • hard problem์—์„œ ํ•„์š”ํ•œ ๋ณต์žกํ•œ skill, special pattern, high-level strategy๊ฐ€ normal level์—์„  ์ถฉ๋Œํ•˜๋Š”๋“ฏํ•œ ๊ฒฝํ–ฅ์„ฑ
  • RL stage ์ดํ›„์— ์—ฌ๋Ÿฌ ๋„๋ฉ”์ธ ํ•™์Šตํ•˜๋Š” ๊ณต์ •์—์„œ component freeze ablation

    • Image
  • Discussion

  • Image
  • math-only๋กœ SFT์™€ RL์„ ํ–ˆ์„ ๋•Œ์˜ in-domain (mathvista), out-of-domain (mmmu) ์„ฑ๋Šฅ ์ฐจ์ด

  • SFT๋Š” generalize๊ฐ€ ์•ˆ๋˜๊ณ  RL์€ ๋จ (https://github.com/long8v/PTIR/issues/230 )

  • thinking budget

    • Image
    • Image
  • Hallucination in Skywork-R1V3โ€™s Chain-of-Thought Impairs Reasoning Performance

    • Image
  • Analysis on Entropy Token in Visual Reasoning Task

    • Image
    • ํ•™์Šต์ด ์ง„ํ–‰๋จ์— ๋”ฐ๋ผ ์ „๋ฐ˜์ ์ธ ํ† ํฐ์˜ ์—”ํŠธ๋กœํ”ผ๋Š” ๋‚ฎ์•„์ง€๋‚˜(determinisitic ํ•ด์ง€๋‚˜) ๋†’์€ ์—”ํŠธ๋กœํ”ผ๋ฅผ ๊ฐ€์ง„ ํ† ํฐ๋“ค์˜ ํ™•๋ฅ ์€ ๋†’์•„์ง.
    • ์ฆ‰, wait, … ๊ณผ ๊ฐ™์€ delibration token๋“ค์ด ๋” ๋งŽ์ด ๋‚˜์˜ค๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šต๋จ
    • The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models https://arxiv.org/pdf/2505.22617
    • Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoninghttps://arxiv.org/abs/2506.01939