[187] Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

TL;DR

I read this because.. : reasoning in LVLM
task : MLLM
Problem :** MLLM’s CoT ability is poor
Idea:** Let’s create CoT data + Learn DPO
architecture : InternVL2-8B
objective : DPO loss + CE loss + BCOloss
baseline : InternVL2-8B, InternVL2-8B-SFT, DPO variants, Gemini, GPT4o, LLaVA-1.5-13B, Qwen2VL-7B, …
data : proposed MMPR (3.2M)
evaluation : M3CoT, Mathvista, MathVision, MMVET, LLaVA-Bench, POPE, CRPE, MMHalbench
Result :** Significantly improved CoT ability and math performance (mathvista 67.0). Claimed that preference optimization over SFT was critical for CoT performance.
contribution : Dataset released. The proposed LOS combination also performs well
etc. :

Details

thumbnail

MMPR dataset

If there is an answer, it is selected if the answer is correct, or loose In the case of unanswered questions, we select all the generated children as chosen, and in the case of loose, we cover half of the generated sentences and ask them to generate the rest. This is said to cause a lot of hallucinations. (?) – Name it DropNTP 2.5M answered data // 750K unanswered data

examples
source

MPO Loss

Combination of DPO loss (0.8) + BCO loss (0.2) + SFT loss (1) (smaug also showed that dpo doesn’t generate rationale?)

BCO loss

We learn together a binary classifier for good or bad, and the delta over there is a moving average of past rewards.

Result

Significantly improved CoT bench and math bench (similar performance to 76B variant)

text benchmarks

Improved performance with significant increases in TheoremQA, a complex science question, and IFEval, an instruction following bench. text Isn’t there a CoT bench…?

Ablations

SFT loss vs MPO

Overall increase in both direct / CoT when putting CoT with SFT When put into MPO, both direct / CoT improve significantly, with CoT > direct grades on all benches

DropNTP vs RLAIF

Suggested method is simpler and better for hallucination

DPO variants

First of all, they all performed better than SFTs, but simply using DPO loss did not improve CoT performance over direct.

TL;DR#

Details#

MMPR dataset#

MPO Loss#

Result#

Ablations#