TL;DR
- I read this because.. : Models trained with data using GPT4-V
- task : VLM
- problem : instruction data is too noisy
- idea : Let’s collect data with GPT4-V! Later, we’ll use the captioner training to align her with the kids that came out.
- input/output : image - (api call) -> GPT4V caption => Learn with LLaVA1.5 style
- architecture : LLaVA-1.5
- objective : ce loss
- baseline : To see the effect of the data, I trained by adding to LLaVA-7B / LLaVA-1.5-7B(13B) / Qwen-VL-Chat-7B, taking the LLaVA 1.5 architecture as it is, changing some training details and pretraining - finetuning, in all cases sota
- data : image={LAION-400M, COCO, SBU, SAM, TextCaps}, text={GPT4-V call}
- evaluation : SEED, VizWiz, VQA-v2, SQA, QBench, MM-Vet, MMBench-CN, MMBench, MME_cog, MME_per, LLaVA-Bench
- result : sota~
- contribution : Data disclosure. Model disclosure. Emphasize that data is more important than architecture!!!
- etc. :
Details
thumnail (caption example / performance)
caption style / error
Data
- dataset statistics
etc: SAM, TextCaps, WikiArt + 1K images from webcrawled data (split evenly between images of landmarks and images of celebrities). (scratched)
- data collection
Different types of data prompted differently
Collect 100K like this
- ShareGPT4V-PT Create a separate model called ShareCaptioner to create a 1.2M dataset. 44 A100 GPU days said to have taken. There is no information about the model, so I wonder if it is the same as the ShareGPT4V-7B model? No information about further refinement.
The data we used
Human evaluation of 3
more analysis
- Improving model performance for this dataset
To make a fair comparison, we subtracted 100K of data for “detailed caption” from the original data recipe they were training on and put this data in
ShareGPT4V-7B model
- LLaVA-1.5
- ViT-L/14 336x336 / Vicuna-v1.5 7B
- training
- pretraining:
- w/ ShareGPT4V-PT
- pretraining:
- image encoder (only learn the latter half) + projector + llm all finetune
- bs 256 / 4700 steps
- supervised finetuning:
- LLaVA contains 23k of detailed captions, which we sample from ShareGPT4V.
- vision encoder freeze / projector with llm finetune
Ablations
The effect of training with each piece of data
The effect of learning only the latter half