image

paper , dataset , code

TL;DR

  • I read this because.. : reasoning in LVLM
  • task : MLLM
  • Problem :** MLLM’s CoT ability is poor
  • Idea:** Let’s create CoT data + Learn DPO
  • architecture : InternVL2-8B
  • objective : DPO loss + CE loss + BCOloss
  • baseline : InternVL2-8B, InternVL2-8B-SFT, DPO variants, Gemini, GPT4o, LLaVA-1.5-13B, Qwen2VL-7B, …
  • data : proposed MMPR (3.2M)
  • evaluation : M3CoT, Mathvista, MathVision, MMVET, LLaVA-Bench, POPE, CRPE, MMHalbench
  • Result :** Significantly improved CoT ability and math performance (mathvista 67.0). Claimed that preference optimization over SFT was critical for CoT performance.
  • contribution : Dataset released. The proposed LOS combination also performs well
  • etc. :

Details

  • thumbnail image

MMPR dataset

If there is an answer, it is selected if the answer is correct, or loose In the case of unanswered questions, we select all the generated children as chosen, and in the case of loose, we cover half of the generated sentences and ask them to generate the rest. This is said to cause a lot of hallucinations. (?) – Name it DropNTP 2.5M answered data // 750K unanswered data

  • examples image

  • source image

MPO Loss

Combination of DPO loss (0.8) + BCO loss (0.2) + SFT loss (1) (smaug also showed that dpo doesn’t generate rationale?)

image
  • BCO loss image
image

We learn together a binary classifier for good or bad, and the delta over there is a moving average of past rewards.

Result

image

Significantly improved CoT bench and math bench (similar performance to 76B variant)

  • text benchmarks image

Improved performance with significant increases in TheoremQA, a complex science question, and IFEval, an instruction following bench. text Isn’t there a CoT bench…?

Ablations

  • SFT loss vs MPO image

Overall increase in both direct / CoT when putting CoT with SFT When put into MPO, both direct / CoT improve significantly, with CoT > direct grades on all benches

  • DropNTP vs RLAIF image

Suggested method is simpler and better for hallucination

  • DPO variants image

First of all, they all performed better than SFTs, but simply using DPO loss did not improve CoT performance over direct.