TL;DR
- I read this because.. : reasoning in LVLM
- task : MLLM
- Problem :** MLLM’s CoT ability is poor
- Idea:** Let’s create CoT data + Learn DPO
- architecture : InternVL2-8B
- objective : DPO loss + CE loss + BCOloss
- baseline : InternVL2-8B, InternVL2-8B-SFT, DPO variants, Gemini, GPT4o, LLaVA-1.5-13B, Qwen2VL-7B, …
- data : proposed MMPR (3.2M)
- evaluation : M3CoT, Mathvista, MathVision, MMVET, LLaVA-Bench, POPE, CRPE, MMHalbench
- Result :** Significantly improved CoT ability and math performance (mathvista 67.0). Claimed that preference optimization over SFT was critical for CoT performance.
- contribution : Dataset released. The proposed LOS combination also performs well
- etc. :
Details
- thumbnail
MMPR dataset
If there is an answer, it is selected if the answer is correct, or loose In the case of unanswered questions, we select all the generated children as chosen, and in the case of loose, we cover half of the generated sentences and ask them to generate the rest. This is said to cause a lot of hallucinations. (?) – Name it DropNTP 2.5M answered data // 750K unanswered data
examples
source
MPO Loss
Combination of DPO loss (0.8) + BCO loss (0.2) + SFT loss (1) (smaug also showed that dpo doesn’t generate rationale?)
- BCO loss
We learn together a binary classifier for good or bad, and the delta over there is a moving average of past rewards.
Result
Significantly improved CoT bench and math bench (similar performance to 76B variant)
- text benchmarks
Improved performance with significant increases in TheoremQA, a complex science question, and IFEval, an instruction following bench. text Isn’t there a CoT bench…?
Ablations
- SFT loss vs MPO
Overall increase in both direct / CoT when putting CoT with SFT When put into MPO, both direct / CoT improve significantly, with CoT > direct grades on all benches
- DropNTP vs RLAIF
Suggested method is simpler and better for hallucination
- DPO variants
First of all, they all performed better than SFTs, but simply using DPO loss did not improve CoT performance over direct.