image

paper

see llava https://github.com/long8v/PTIR/issues/128#issue-1749571159 here

TL;DR

  • I read this because.. : aka LLaVA1.5 / ShareGPT4V because you followed the LLaVA1.5 recipe
  • task : LVLM
  • problem : LLaVA is good at reasoning and real-world instruction following, but it performs poorly on our benchmarks.
  • idea : do a better job of prompting for short answers like scale up / VQA!
  • input/output : image + question -> answer
  • architecture : ViT-L/14(336 resolution) + LLaMA 13B
  • objective : ce loss
  • baseline : llava, Qwen-VL, Shikra, BLIP-2, IDEFICS, instructBLIP
  • data : (alignment) LCS-558K(LAION-CC-SBU with BLIP caption) / (end-to-end finetuning) LLaVA instruction data + VQA(OKVQA, A-OKVQA), OCR(OCRVQA, TextCaps), region-level VQA(Visual Genome, RefCOCO)
  • evaluation : GQA, MME, MM-Vet, VQA, GQA, VisWiz, SQA, VQA, POPE, …
  • result : Improved by adding VQA when finetuning, improved by adding format prompt, improved by adding 2-layer mlp instead of linear, improved by increasing resolution, improved by adding various data such as ShareGPT (ShareGPT also has multilingual capability)
  • contribution : Remarkable performance with few resources and open data.
  • etc. :

Details

contribution

image

Good performance with minimal tuning (ending in 8 A100 days with 1.2M scale public data)

Dataset

  • alignment learning LCS-558K(LAION-CC-SBU with BLIP caption) In the middle is llava-lightning, which seems to be a variant for faster convergence. If you look at https://github.com/haotian-liu/LLaVA/issues/86#issuecomment-1533346022 , it is said to converge faster because it has roughly the same quantities as CC and a much larger concept converage. CC and blip caption seem to be very different in text form, but… lol Isn’t it an invisible trick to take a little benchmark? It’s a pity that llava 1.5 didn’t measure performance for conservation, maybe it would have been much lower?

  • end-to-end finetuning LLaVA instruction data + VQA(OKVQA, A-OKVQA), OCR(OCRVQA, TextCaps), region-level VQA(Visual Genome, RefCOCO) I didn’t realize Visual Genome had VQA…. https://paperswithcode.com/dataset/visual-genome image

Improved baseline of LLaVA

  • Why LLaVA performed so poorly in our benchmarks VQA requires short answers, one or two words, LLaVA is not trained that way / spring the data a bit -> “response formatting format” When you type something like VQAv2, instead of Q: {Question} A: {Answer}" instead of Q: {Question}, prompt Answer the question using a single word or phrase. This doubled the performance of simply putting VQAv2 into the training data, especially on a benchmark called MME. 502 -> 1197
image

Result / Ability

image LLaVA replied strangely
  • Answers well for unrelated images image

  • JSON can be plucked! (OCR capability) image

  • zs multi-lingual I’m using data from ShareGPT (https://sharegpt.com/) , so it followed the multilingual instructions. A platform where users can post their chatGPT questions and answers, presumably language only. In particular, MMBench-CN actually beat Qwen-VL-Chat utilizing chinese instruction data (which is weird).

  • computational cost 6 hours for pretraining / 20 hours for visual instruction tuning using 8A100s image

  • limitation

  1. The image seq len increases with resolution. q-former replaces it, but it seems to be slow to converge. We need to study how to train q-former efficiently.
  2. Unable to process multi image. No data.
  3. You’re still limited to your target domain
  4. There is a hallucination
  • d–etails
image