[137] mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

paper

TL;DR

I read this because.. : very recent VLM model
task : VLM + LLM
problem : multi-modal task freezes LLM and actually tries to do V+L well, but I want both V/L to do well.
idea : Overall BLIP-2 style, with the difference that the LLM has different $W_K$, $W_V$, and Norm for each modality, and the LLM is tuned accordingly.
input/output : text + image -> text
architecture : CLIP ViT-L/14 + vision abstractor(=Q-former) + LLaMA-2 w/ Modality-Adaptive Module(MAM)
objective : ce loss
baseline : Models based on the 7B LLM. BLIP-2, MiniGPT-4, LLAVA, mPLUG-Owl, InstructBLIP, Otter, Qwen-VL-Chat, LLaVA-1.5
data : 400M samples from {CC3/12M, COCO, COYO, LAION-en, DataComp} for pretraining / {captioning(TextCaps, COCO), VQA(VQAv2, OKVQA, OCR-VQA, GQA, A-OKVQA), region-aware(RefCOCO, VisualGenome), multi-modal instruction(LLaVa-instruct-150k), text-only instruction data(ShareGPT80-K, SlimOrca)}
evaluation : caption / vqa / multimodal benchmark(MME, MMBench, MM-Vet, SEED-Bench, Q-Bench) / text benchmark(MMLU, BBH, AGIEval, ARC-c, ARC-e)
Result : Almost all of the 7B models are sota. textual instructions are also used + MAM improves performance over LLaMA2 even in pure text benchmarks.
contribution : VLM model also improves text performance?
etc. : alibaba money is good….

Details

Architecture

Vision Abstractor eventually uses the Q-former
Modality-Adaptive Module will eventually have a different weight/norm depending on the modality of the input. But the query weight is the same. Here, W for images is newly initialized, so it is learned in step-1 pretraining.
There are two learning phases

For pre-training, use these {CC3/12M, COCO, COYO, LAION-en, DataComp} to learn the initialized parts of the vision encoder / q-former / language decoder. It would be interesting to compare it to BLIP-2 , where BLIP-2 takes CLIP ViT and freezes the vision encoder. And the images it uses are newly captured data (CapFilt) from a similar source. Here, we don’t freeze the vision encoder and just use the relatively noisy alt-text! In a way, we’re retraining the kind of data we saw in CLIP in a generative form.
In joint-instruction tuning, we unfreeze everything and learn only with instruction data. The difference is that we also added text instruction data.

The difference between the two steps is resolution / LLM seq len

Result

caption, VQA / multi-modal benchmark
pure text benchmark

Saying it’s because of MAM

Effect of using two modalities for instruction data + effect of MAM

Using text intstruction data makes mm perform worse, using mm instruction makes text worse, using both makes both perform slightly worse than either alone + using MAM makes both better

vision encoder freeze effect
num queries

Requires a lot of text VQA

resolution

textVQA works overwhelmingly well lol

Qualitative Result

Claims to see text in early layers and images in later layers thanks to MAM -> what’s the point?

Given an unrelated image and text, state that you focused on the text if you have a MAM I think you’re both wrong, but if you have a MAM, you’ll say at least 7 lol

TL;DR#

Details#

Architecture#

Result#

Qualitative Result#

TL;DR

Details

Architecture

Result

Qualitative Result