image

paper

TL;DR

  • I read this because.. : reasoning ability in VLM
  • task : VLM
  • Problem :** Most of the VLM instruction data is short
  • Idea:** Create CoT data with GPT4-o
  • architecture : LLaVA-NeXT
  • objective : CE loss -> DPO loss
  • baseline : LLaVA-NeXT, GPT4o, Cambrian, (data) RLAIF
  • data : ShareGPT4-o Reasoning (not yet public)
  • evaluation : A-OKVQA, DocVQA, ChartQA, AI2D, ScienceQA, …
  • result : Evenly high performance across all benches.
  • contribution : Improved benchmarks with a small dataset. Lots of reasoning related analysis.

Details

  • motivation image

Data

  • reasoning data distilation image
image image image

Result

image

Organize your data as above

  • (1) format: Configured to the level that only the answer format can be correct. 50 samplings for each of the 9 datasets. 2K on both CoT/direct + LLaVA-pretrain
  • (2) direct data: (1) + 193K with the answer right away in Full
  • (3) CoT data: (1) + what you put in CoT 193K + additionally GLLaVA-align/QA
  • (4) CoT SFT: (1) + direct + whatever you put in both CoT + additionally GLLaVA-align/QA
image

CAN REASONING BE IMPLICITLY LEARNED FROM DIRECT PREDICTION? – Compare (1) and (2) -> when trained with only direct answers, CoT inferencing showed little or no improvement (mathvista -1.7)

HOW EFFECTIVE IS COT REASONING DATA? – (3) Performance improvements on computationally intensive benchmarks such as chartQA and Mathvista, and surprisingly on text-heavy benchmarks such as TextVQA, DocVQA, and InfoVQA. – (4) The best average performance was achieved when both CoT and Direct were trained. However, TextVQA, DocVQA, and AI2D performed better on direct. This is likely due to the fact extraction-oriented benchmarks.

ABLATION TESTS ON DATA COMPOSITION image

Math side data ablation. text only sft was removed because it didn’t do much good

image

ablation for science. They were both good together.

Comparsion of GPT4o / Cambrian image

ScienceQA performs well on closed set. Could be a train data problem…

DPO Result

image image image

There are more things like BoN, but I’ll organize them later.