[184] Improve Vision Language Model Chain-of-thought Reasoning

paper

TL;DR

I read this because.. : reasoning ability in VLM
task : VLM
Problem :** Most of the VLM instruction data is short
Idea:** Create CoT data with GPT4-o
architecture : LLaVA-NeXT
objective : CE loss -> DPO loss
baseline : LLaVA-NeXT, GPT4o, Cambrian, (data) RLAIF
data : ShareGPT4-o Reasoning (not yet public)
evaluation : A-OKVQA, DocVQA, ChartQA, AI2D, ScienceQA, …
result : Evenly high performance across all benches.
contribution : Improved benchmarks with a small dataset. Lots of reasoning related analysis.

Details

motivation

Data

reasoning data distilation

Result

Organize your data as above

(1) format: Configured to the level that only the answer format can be correct. 50 samplings for each of the 9 datasets. 2K on both CoT/direct + LLaVA-pretrain
(2) direct data: (1) + 193K with the answer right away in Full
(3) CoT data: (1) + what you put in CoT 193K + additionally GLLaVA-align/QA
(4) CoT SFT: (1) + direct + whatever you put in both CoT + additionally GLLaVA-align/QA

CAN REASONING BE IMPLICITLY LEARNED FROM DIRECT PREDICTION? – Compare (1) and (2) -> when trained with only direct answers, CoT inferencing showed little or no improvement (mathvista -1.7)

HOW EFFECTIVE IS COT REASONING DATA? – (3) Performance improvements on computationally intensive benchmarks such as chartQA and Mathvista, and surprisingly on text-heavy benchmarks such as TextVQA, DocVQA, and InfoVQA. – (4) The best average performance was achieved when both CoT and Direct were trained. However, TextVQA, DocVQA, and AI2D performed better on direct. This is likely due to the fact extraction-oriented benchmarks.

ABLATION TESTS ON DATA COMPOSITION