see llava https://github.com/long8v/PTIR/issues/128#issue-1749571159 here
TL;DR
- I read this because.. : aka LLaVA1.5 / ShareGPT4V because you followed the LLaVA1.5 recipe
- task : LVLM
- problem : LLaVA is good at reasoning and real-world instruction following, but it performs poorly on our benchmarks.
- idea : do a better job of prompting for short answers like scale up / VQA!
- input/output : image + question -> answer
- architecture : ViT-L/14(336 resolution) + LLaMA 13B
- objective : ce loss
- baseline : llava, Qwen-VL, Shikra, BLIP-2, IDEFICS, instructBLIP
- data : (alignment) LCS-558K(LAION-CC-SBU with BLIP caption) / (end-to-end finetuning) LLaVA instruction data + VQA(OKVQA, A-OKVQA), OCR(OCRVQA, TextCaps), region-level VQA(Visual Genome, RefCOCO)
- evaluation : GQA, MME, MM-Vet, VQA, GQA, VisWiz, SQA, VQA, POPE, …
- result : Improved by adding VQA when finetuning, improved by adding format prompt, improved by adding 2-layer mlp instead of linear, improved by increasing resolution, improved by adding various data such as ShareGPT (ShareGPT also has multilingual capability)
- contribution : Remarkable performance with few resources and open data.
- etc. :
Details
contribution
Good performance with minimal tuning (ending in 8 A100 days with 1.2M scale public data)
Dataset
alignment learning LCS-558K(LAION-CC-SBU with BLIP caption) In the middle is llava-lightning, which seems to be a variant for faster convergence. If you look at https://github.com/haotian-liu/LLaVA/issues/86#issuecomment-1533346022 , it is said to converge faster because it has roughly the same quantities as CC and a much larger concept converage. CC and blip caption seem to be very different in text form, but… lol Isn’t it an invisible trick to take a little benchmark? It’s a pity that llava 1.5 didn’t measure performance for conservation, maybe it would have been much lower?
end-to-end finetuning LLaVA instruction data + VQA(OKVQA, A-OKVQA), OCR(OCRVQA, TextCaps), region-level VQA(Visual Genome, RefCOCO) I didn’t realize Visual Genome had VQA…. https://paperswithcode.com/dataset/visual-genome
Improved baseline of LLaVA
- Why LLaVA performed so poorly in our benchmarks
VQA requires short answers, one or two words, LLaVA is not trained that way / spring the data a bit
-> “response formatting format”
When you type something like VQAv2, instead of
Q: {Question} A: {Answer}" instead ofQ: {Question}, promptAnswer the question using a single word or phrase. This doubled the performance of simply putting VQAv2 into the training data, especially on a benchmark called MME. 502 -> 1197
Result / Ability
Answers well for unrelated images
JSON can be plucked! (OCR capability)
zs multi-lingual I’m using data from ShareGPT (https://sharegpt.com/) , so it followed the multilingual instructions. A platform where users can post their chatGPT questions and answers, presumably language only. In particular, MMBench-CN actually beat Qwen-VL-Chat utilizing chinese instruction data (which is weird).
computational cost 6 hours for pretraining / 20 hours for visual instruction tuning using 8A100s
limitation
- The image seq len increases with resolution. q-former replaces it, but it seems to be slow to converge. We need to study how to train q-former efficiently.
- Unable to process multi image. No data.
- You’re still limited to your target domain
- There is a hallucination
- d–etails