TL;DR
- I read this because.. : reasoning ability in VLM
- task : VLM
- Problem :** Most of the VLM instruction data is short
- Idea:** Create CoT data with GPT4-o
- architecture : LLaVA-NeXT
- objective : CE loss -> DPO loss
- baseline : LLaVA-NeXT, GPT4o, Cambrian, (data) RLAIF
- data : ShareGPT4-o Reasoning (not yet public)
- evaluation : A-OKVQA, DocVQA, ChartQA, AI2D, ScienceQA, …
- result : Evenly high performance across all benches.
- contribution : Improved benchmarks with a small dataset. Lots of reasoning related analysis.
Details
- motivation
Data
- reasoning data distilation
Result
Organize your data as above
- (1) format: Configured to the level that only the answer format can be correct. 50 samplings for each of the 9 datasets. 2K on both CoT/direct + LLaVA-pretrain
- (2) direct data: (1) + 193K with the answer right away in Full
- (3) CoT data: (1) + what you put in CoT 193K + additionally GLLaVA-align/QA
- (4) CoT SFT: (1) + direct + whatever you put in both CoT + additionally GLLaVA-align/QA
CAN REASONING BE IMPLICITLY LEARNED FROM DIRECT PREDICTION? – Compare (1) and (2) -> when trained with only direct answers, CoT inferencing showed little or no improvement (mathvista -1.7)
HOW EFFECTIVE IS COT REASONING DATA? – (3) Performance improvements on computationally intensive benchmarks such as chartQA and Mathvista, and surprisingly on text-heavy benchmarks such as TextVQA, DocVQA, and InfoVQA. – (4) The best average performance was achieved when both CoT and Direct were trained. However, TextVQA, DocVQA, and AI2D performed better on direct. This is likely due to the fact extraction-oriented benchmarks.
ABLATION TESTS ON DATA COMPOSITION
Math side data ablation. text only sft was removed because it didn’t do much good
ablation for science. They were both good together.
Comparsion of GPT4o / Cambrian
ScienceQA performs well on closed set. Could be a train data problem…
DPO Result
There are more things like BoN, but I’ll organize them later.