TL;DR
- I read this because.. : vision rl
- task : MLLM R1 replicate
- Problem :** MLLM R1 Letdown
- idea : Let’s work hard to collect GRPO data
- input/output : {image, Q} -> reasoning, A
- architecture : InternVL2.5-7B-Instruct(r1 style), InternVL2.5-Pretrained-38B (r1-zero style)
- objective : RLOO loss
- baseline : SFT, CoT SFT(MAmmoTH-VL-8B), MPO(MMPR dataset)
- data : GeoQA-Plus, K12, CLEVR, Geometry3K, MATH, IconQA, M3CoT, DVQA, ScienceQA, ChartQA, AI2D, UniGeo, InfoVQA, GeoS, MapQA
- evaluation : MathVista, MathVerse, MathVision, Olympiad
- result : Improved math performance on average. Least data scale.
- contribution : worked hard and fast
- etc. :
Details
Dataset
- chart comprehension: ChartQA, DVQA, …
- General Scientific Reasoning: AI2D, ScienceQA, …
- Mathematical Reasoning: K12(proposed), GeoQA
training
reward format +
<think>...</think><answer>...</answer>parsing accuracy rewardloss The advantage calculation is based on the RLOO
loss is the PPO-clip loss
Add KL divergence term to loss ablation
- extra hparams
- rollout bs 128 / training bs 64 (8 rollout per sample)
- temperature 1
- Exclude KL divergence from loss term
- The format reward coefficient follows the instruction well, so if you start at 0.5 / pretrained weight, it will be 1.0
key findings
- data filtering is crucial
Let InternVL2.5-8B-Instruct generate 8 times, then remove {0, 1} Remove
was a big difference.
- KL divergence
The length tended to decrease when there was KL divergence, and the accuracy was different with KL divergence off and on, so I turned it off.
- Visual Aha Moment
evaluation
- K12
- 500 fill-in-the-blank math questions at the middle to high school level
- greedy decoding with a temperature 0
Result
- The learning process
- First of all, it performs better than SFT or MPO, except for MAmmoTH-VL-8B (https://mammoth-vl.github.io/) .
- Compared to SFT with training data scale, it is definitely better than SFT (SFT is all down) and math average is better than MPO with slightly more data. Most of the improvement is in mathverse and K12. olympiad is not high.
- Once we evaluate each bench, the difference in performance between the small and large models is dramatic for the Olympiad
- Mathvista is not good with mm-eureka, neither large scale nor small scale. I don’t know why.
discussion
What you tried and it didn’t work
- curriculum learning
- We assigned difficulty to the K12 data and then sorted the data by difficulty.
- Curriculum learning tends to make learning less stable.
- I wondered if we were getting stuck in the early to middle stages without exploring hard problems.
- online data filtering
- Improved performance when excluding difficulty {0,1} is called offline data filtering and PRIME-like is called online data filtering.
- Online data filtering dynamically allows you to expect to see different data as your model improves.
- However, online didn’t perform as well as it should have because of the gradient instability caused by varying batch size in each training round.
- model size
- There are examples of R1-zero scenarios being successful in small models, but not very stable in mm situations