[207] MM-EUREKA: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning

TL;DR

I read this because.. : vision rl
task : MLLM R1 replicate
Problem :** MLLM R1 Letdown
idea : Let’s work hard to collect GRPO data
input/output : {image, Q} -> reasoning, A
architecture : InternVL2.5-7B-Instruct(r1 style), InternVL2.5-Pretrained-38B (r1-zero style)
objective : RLOO loss
baseline : SFT, CoT SFT(MAmmoTH-VL-8B), MPO(MMPR dataset)
data : GeoQA-Plus, K12, CLEVR, Geometry3K, MATH, IconQA, M3CoT, DVQA, ScienceQA, ChartQA, AI2D, UniGeo, InfoVQA, GeoS, MapQA
evaluation : MathVista, MathVerse, MathVision, Olympiad
result : Improved math performance on average. Least data scale.
contribution : worked hard and fast
etc. :

reward format + <think>...</think><answer>...</answer> parsing accuracy reward
loss The advantage calculation is based on the RLOO

loss is the PPO-clip loss

Add KL divergence term to loss ablation

extra hparams
- rollout bs 128 / training bs 64 (8 rollout per sample)
- temperature 1
Exclude KL divergence from loss term
The format reward coefficient follows the instruction well, so if you start at 0.5 / pretrained weight, it will be 1.0

data filtering is crucial Let InternVL2.5-8B-Instruct generate 8 times, then remove {0, 1} Remove

was a big difference.

The length tended to decrease when there was KL divergence, and the accuracy was different with KL divergence off and on, so I turned it off.

K12
500 fill-in-the-blank math questions at the middle to high school level
- greedy decoding with a temperature 0

First of all, it performs better than SFT or MPO, except for MAmmoTH-VL-8B (https://mammoth-vl.github.io/) .
Compared to SFT with training data scale, it is definitely better than SFT (SFT is all down) and math average is better than MPO with slightly more data. Most of the improvement is in mathverse and K12. olympiad is not high.

Once we evaluate each bench, the difference in performance between the small and large models is dramatic for the Olympiad
Mathvista is not good with mm-eureka, neither large scale nor small scale. I don’t know why.

What you tried and it didn’t work

curriculum learning
We assigned difficulty to the K12 data and then sorted the data by difficulty.
Curriculum learning tends to make learning less stable.
I wondered if we were getting stuck in the early to middle stages without exploring hard problems.
online data filtering
Improved performance when excluding difficulty {0,1} is called offline data filtering and PRIME-like is called online data filtering.
Online data filtering dynamically allows you to expect to see different data as your model improves.
However, online didn’t perform as well as it should have because of the gradient instability caused by varying batch size in each training round.
model size
There are examples of R1-zero scenarios being successful in small models, but not very stable in mm situations