Image

paper , code , dataset

TL;DR

  • I read this because.. : vision rl
  • task : MLLM R1 replicate
  • Problem :** MLLM R1 Letdown
  • idea : Let’s work hard to collect GRPO data
  • input/output : {image, Q} -> reasoning, A
  • architecture : InternVL2.5-7B-Instruct(r1 style), InternVL2.5-Pretrained-38B (r1-zero style)
  • objective : RLOO loss
  • baseline : SFT, CoT SFT(MAmmoTH-VL-8B), MPO(MMPR dataset)
  • data : GeoQA-Plus, K12, CLEVR, Geometry3K, MATH, IconQA, M3CoT, DVQA, ScienceQA, ChartQA, AI2D, UniGeo, InfoVQA, GeoS, MapQA
  • evaluation : MathVista, MathVerse, MathVision, Olympiad
  • result : Improved math performance on average. Least data scale.
  • contribution : worked hard and fast
  • etc. :

Details

Dataset

Image

  • chart comprehension: ChartQA, DVQA, …
  • General Scientific Reasoning: AI2D, ScienceQA, …
  • Mathematical Reasoning: K12(proposed), GeoQA

training

  • reward format + <think>...</think><answer>...</answer> parsing accuracy reward

  • loss The advantage calculation is based on the RLOO

Image

loss is the PPO-clip loss

Image

Add KL divergence term to loss ablation

Image

  • extra hparams
    • rollout bs 128 / training bs 64 (8 rollout per sample)
    • temperature 1
  • Exclude KL divergence from loss term
  • The format reward coefficient follows the instruction well, so if you start at 0.5 / pretrained weight, it will be 1.0
    • Image

key findings

  • data filtering is crucial Let InternVL2.5-8B-Instruct generate 8 times, then remove {0, 1} Remove Image

was a big difference.

  • KL divergence Image

The length tended to decrease when there was KL divergence, and the accuracy was different with KL divergence off and on, so I turned it off.

  • Visual Aha Moment Image

evaluation

  • K12
  • 500 fill-in-the-blank math questions at the middle to high school level
    • greedy decoding with a temperature 0

Result

  • The learning process Image
Image
  • First of all, it performs better than SFT or MPO, except for MAmmoTH-VL-8B (https://mammoth-vl.github.io/) .
  • Compared to SFT with training data scale, it is definitely better than SFT (SFT is all down) and math average is better than MPO with slightly more data. Most of the improvement is in mathverse and K12. olympiad is not high.
Image
  • Once we evaluate each bench, the difference in performance between the small and large models is dramatic for the Olympiad
  • Mathvista is not good with mm-eureka, neither large scale nor small scale. I don’t know why.

discussion

What you tried and it didn’t work

  • curriculum learning
  • We assigned difficulty to the K12 data and then sorted the data by difficulty.
    • Image
  • Curriculum learning tends to make learning less stable.
  • I wondered if we were getting stuck in the early to middle stages without exploring hard problems.
  • online data filtering
    • Image
  • Improved performance when excluding difficulty {0,1} is called offline data filtering and PRIME-like is called online data filtering.
  • Online data filtering dynamically allows you to expect to see different data as your model improves.
    • Image
  • However, online didn’t perform as well as it should have because of the gradient instability caused by varying batch size in each training round.
  • model size
  • There are examples of R1-zero scenarios being successful in small models, but not very stable in mm situations
    • Image