TL;DR
- I read this because.. : very recent VLM model
- task : VLM + LLM
- problem : multi-modal task freezes LLM and actually tries to do V+L well, but I want both V/L to do well.
- idea : Overall BLIP-2 style, with the difference that the LLM has different $W_K$, $W_V$, and Norm for each modality, and the LLM is tuned accordingly.
- input/output : text + image -> text
- architecture : CLIP ViT-L/14 + vision abstractor(=Q-former) + LLaMA-2 w/ Modality-Adaptive Module(MAM)
- objective : ce loss
- baseline : Models based on the 7B LLM. BLIP-2, MiniGPT-4, LLAVA, mPLUG-Owl, InstructBLIP, Otter, Qwen-VL-Chat, LLaVA-1.5
- data : 400M samples from {CC3/12M, COCO, COYO, LAION-en, DataComp} for pretraining / {captioning(TextCaps, COCO), VQA(VQAv2, OKVQA, OCR-VQA, GQA, A-OKVQA), region-aware(RefCOCO, VisualGenome), multi-modal instruction(LLaVa-instruct-150k), text-only instruction data(ShareGPT80-K, SlimOrca)}
- evaluation : caption / vqa / multimodal benchmark(MME, MMBench, MM-Vet, SEED-Bench, Q-Bench) / text benchmark(MMLU, BBH, AGIEval, ARC-c, ARC-e)
- Result : Almost all of the 7B models are sota. textual instructions are also used + MAM improves performance over LLaMA2 even in pure text benchmarks.
- contribution : VLM model also improves text performance?
- etc. : alibaba money is good….
Details
Architecture
- Vision Abstractor eventually uses the Q-former
- Modality-Adaptive Module will eventually have a different weight/norm depending on the modality of the input. But the query weight is the same. Here, W for images is newly initialized, so it is learned in step-1 pretraining.
- There are two learning phases
- For pre-training, use these {CC3/12M, COCO, COYO, LAION-en, DataComp} to learn the initialized parts of the vision encoder / q-former / language decoder. It would be interesting to compare it to BLIP-2 , where BLIP-2 takes CLIP ViT and freezes the vision encoder. And the images it uses are newly captured data (CapFilt) from a similar source. Here, we don’t freeze the vision encoder and just use the relatively noisy alt-text! In a way, we’re retraining the kind of data we saw in CLIP in a generative form.
- In joint-instruction tuning, we unfreeze everything and learn only with instruction data. The difference is that we also added text instruction data.
The difference between the two steps is resolution / LLM seq len
Result
caption, VQA / multi-modal benchmark
pure text benchmark
Saying it’s because of MAM
- Effect of using two modalities for instruction data + effect of MAM
Using text intstruction data makes mm perform worse, using mm instruction makes text worse, using both makes both perform slightly worse than either alone + using MAM makes both better
vision encoder freeze effect
num queries
Requires a lot of text VQA
- resolution
textVQA works overwhelmingly well lol
Qualitative Result
Claims to see text in early layers and images in later layers thanks to MAM -> what’s the point?
Given an unrelated image and text, state that you focused on the text if you have a MAM I think you’re both wrong, but if you have a MAM, you’ll say at least 7 lol