[140] Improved Baselines with Visual Instruction Tuning

TL;DR

I read this because.. : aka LLaVA1.5 / ShareGPT4V because you followed the LLaVA1.5 recipe
task : LVLM
problem : LLaVA is good at reasoning and real-world instruction following, but it performs poorly on our benchmarks.
idea : do a better job of prompting for short answers like scale up / VQA!
input/output : image + question -> answer
architecture : ViT-L/14(336 resolution) + LLaMA 13B
objective : ce loss
baseline : llava, Qwen-VL, Shikra, BLIP-2, IDEFICS, instructBLIP
data : (alignment) LCS-558K(LAION-CC-SBU with BLIP caption) / (end-to-end finetuning) LLaVA instruction data + VQA(OKVQA, A-OKVQA), OCR(OCRVQA, TextCaps), region-level VQA(Visual Genome, RefCOCO)
evaluation : GQA, MME, MM-Vet, VQA, GQA, VisWiz, SQA, VQA, POPE, …
result : Improved by adding VQA when finetuning, improved by adding format prompt, improved by adding 2-layer mlp instead of linear, improved by increasing resolution, improved by adding various data such as ShareGPT (ShareGPT also has multilingual capability)
contribution : Remarkable performance with few resources and open data.
etc. :

Good performance with minimal tuning (ending in 8 A100 days with 1.2M scale public data)

alignment learning LCS-558K(LAION-CC-SBU with BLIP caption) In the middle is llava-lightning, which seems to be a variant for faster convergence. If you look at https://github.com/haotian-liu/LLaVA/issues/86#issuecomment-1533346022 , it is said to converge faster because it has roughly the same quantities as CC and a much larger concept converage. CC and blip caption seem to be very different in text form, but… lol Isn’t it an invisible trick to take a little benchmark? It’s a pity that llava 1.5 didn’t measure performance for conservation, maybe it would have been much lower?
end-to-end finetuning LLaVA instruction data + VQA(OKVQA, A-OKVQA), OCR(OCRVQA, TextCaps), region-level VQA(Visual Genome, RefCOCO) I didn’t realize Visual Genome had VQA…. https://paperswithcode.com/dataset/visual-genome

Why LLaVA performed so poorly in our benchmarks VQA requires short answers, one or two words, LLaVA is not trained that way / spring the data a bit -> “response formatting format” When you type something like VQAv2, instead of Q: {Question} A: {Answer}" instead of Q: {Question}, prompt Answer the question using a single word or phrase. This doubled the performance of simply putting VQAv2 into the training data, especially on a benchmark called MME. 502 -> 1197

LLaVA replied strangely

Answers well for unrelated images
JSON can be plucked! (OCR capability)
zs multi-lingual I’m using data from ShareGPT (https://sharegpt.com/) , so it followed the multilingual instructions. A platform where users can post their chatGPT questions and answers, presumably language only. In particular, MMBench-CN actually beat Qwen-VL-Chat utilizing chinese instruction data (which is weird).
computational cost 6 hours for pretraining / 20 hours for visual instruction tuning using 8A100s
limitation

The image seq len increases with resolution. q-former replaces it, but it seems to be slow to converge. We need to study how to train q-former efficiently.
Unable to process multi image. No data.
You’re still limited to your target domain
There is a hallucination