[138] ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

TL;DR

I read this because.. : Models trained with data using GPT4-V
task : VLM
problem : instruction data is too noisy
idea : Let’s collect data with GPT4-V! Later, we’ll use the captioner training to align her with the kids that came out.
input/output : image - (api call) -> GPT4V caption => Learn with LLaVA1.5 style
architecture : LLaVA-1.5
objective : ce loss
baseline : To see the effect of the data, I trained by adding to LLaVA-7B / LLaVA-1.5-7B(13B) / Qwen-VL-Chat-7B, taking the LLaVA 1.5 architecture as it is, changing some training details and pretraining - finetuning, in all cases sota
data : image={LAION-400M, COCO, SBU, SAM, TextCaps}, text={GPT4-V call}
evaluation : SEED, VizWiz, VQA-v2, SQA, QBench, MM-Vet, MMBench-CN, MMBench, MME_cog, MME_per, LLaVA-Bench
result : sota~
contribution : Data disclosure. Model disclosure. Emphasize that data is more important than architecture!!!
etc. :

etc: SAM, TextCaps, WikiArt + 1K images from webcrawled data (split evenly between images of landmarks and images of celebrities). (scratched)

Different types of data prompted differently

Collect 100K like this

ShareGPT4V-PT Create a separate model called ShareCaptioner to create a 1.2M dataset. 44 A100 GPU days said to have taken. There is no information about the model, so I wonder if it is the same as the ShareGPT4V-7B model? No information about further refinement.

The data we used

Human evaluation of 3

more analysis

To make a fair comparison, we subtracted 100K of data for “detailed caption” from the original data recipe they were training on and put this data in

LLaVA-1.5
ViT-L/14 336x336 / Vicuna-v1.5 7B
training
- pretraining:
  - w/ ShareGPT4V-PT
image encoder (only learn the latter half) + projector + llm all finetune
- bs 256 / 4700 steps
- supervised finetuning:
LLaVA contains 23k of detailed captions, which we sample from ShareGPT4V.
vision encoder freeze / projector with llm finetune

The effect of training with each piece of data

The effect of learning only the latter half