image

paper , page

TL;DR

  • I read this because.. : Models trained with data using GPT4-V
  • task : VLM
  • problem : instruction data is too noisy
  • idea : Let’s collect data with GPT4-V! Later, we’ll use the captioner training to align her with the kids that came out.
  • input/output : image - (api call) -> GPT4V caption => Learn with LLaVA1.5 style
  • architecture : LLaVA-1.5
  • objective : ce loss
  • baseline : To see the effect of the data, I trained by adding to LLaVA-7B / LLaVA-1.5-7B(13B) / Qwen-VL-Chat-7B, taking the LLaVA 1.5 architecture as it is, changing some training details and pretraining - finetuning, in all cases sota
  • data : image={LAION-400M, COCO, SBU, SAM, TextCaps}, text={GPT4-V call}
  • evaluation : SEED, VizWiz, VQA-v2, SQA, QBench, MM-Vet, MMBench-CN, MMBench, MME_cog, MME_per, LLaVA-Bench
  • result : sota~
  • contribution : Data disclosure. Model disclosure. Emphasize that data is more important than architecture!!!
  • etc. :

Details

  • thumnail (caption example / performance) image

  • caption style / error image

Data

  • dataset statistics image

etc: SAM, TextCaps, WikiArt + 1K images from webcrawled data (split evenly between images of landmarks and images of celebrities). (scratched)

  • data collection image
image

Different types of data prompted differently image

Collect 100K like this

  • ShareGPT4V-PT Create a separate model called ShareCaptioner to create a 1.2M dataset. 44 A100 GPU days said to have taken. There is no information about the model, so I wonder if it is the same as the ShareGPT4V-7B model? No information about further refinement.

The data we used image

Human evaluation of 3 image

more analysis image

image
  • Improving model performance for this dataset image

To make a fair comparison, we subtracted 100K of data for “detailed caption” from the original data recipe they were training on and put this data in

ShareGPT4V-7B model

  • LLaVA-1.5
  • ViT-L/14 336x336 / Vicuna-v1.5 7B
  • training
    • pretraining:
      • w/ ShareGPT4V-PT
  • image encoder (only learn the latter half) + projector + llm all finetune
    • bs 256 / 4700 steps
    • supervised finetuning:
  • LLaVA contains 23k of detailed captions, which we sample from ShareGPT4V.
  • vision encoder freeze / projector with llm finetune
image

Ablations

The effect of training with each piece of data

image

The effect of learning only the latter half image

image