image

paper

TL;DR

  • I read this because.. : very recent VLM model
  • task : VLM + LLM
  • problem : multi-modal task freezes LLM and actually tries to do V+L well, but I want both V/L to do well.
  • idea : Overall BLIP-2 style, with the difference that the LLM has different $W_K$, $W_V$, and Norm for each modality, and the LLM is tuned accordingly.
  • input/output : text + image -> text
  • architecture : CLIP ViT-L/14 + vision abstractor(=Q-former) + LLaMA-2 w/ Modality-Adaptive Module(MAM)
  • objective : ce loss
  • baseline : Models based on the 7B LLM. BLIP-2, MiniGPT-4, LLAVA, mPLUG-Owl, InstructBLIP, Otter, Qwen-VL-Chat, LLaVA-1.5
  • data : 400M samples from {CC3/12M, COCO, COYO, LAION-en, DataComp} for pretraining / {captioning(TextCaps, COCO), VQA(VQAv2, OKVQA, OCR-VQA, GQA, A-OKVQA), region-aware(RefCOCO, VisualGenome), multi-modal instruction(LLaVa-instruct-150k), text-only instruction data(ShareGPT80-K, SlimOrca)}
  • evaluation : caption / vqa / multimodal benchmark(MME, MMBench, MM-Vet, SEED-Bench, Q-Bench) / text benchmark(MMLU, BBH, AGIEval, ARC-c, ARC-e)
  • Result : Almost all of the 7B models are sota. textual instructions are also used + MAM improves performance over LLaMA2 even in pure text benchmarks.
  • contribution : VLM model also improves text performance?
  • etc. : alibaba money is good….

Details

image

Architecture

image
  • Vision Abstractor eventually uses the Q-former
  • Modality-Adaptive Module will eventually have a different weight/norm depending on the modality of the input. But the query weight is the same. Here, W for images is newly initialized, so it is learned in step-1 pretraining.
  • There are two learning phases
  1. For pre-training, use these {CC3/12M, COCO, COYO, LAION-en, DataComp} to learn the initialized parts of the vision encoder / q-former / language decoder. It would be interesting to compare it to BLIP-2 , where BLIP-2 takes CLIP ViT and freezes the vision encoder. And the images it uses are newly captured data (CapFilt) from a similar source. Here, we don’t freeze the vision encoder and just use the relatively noisy alt-text! In a way, we’re retraining the kind of data we saw in CLIP in a generative form.
  2. In joint-instruction tuning, we unfreeze everything and learn only with instruction data. The difference is that we also added text instruction data.
image

The difference between the two steps is resolution / LLM seq len

Result

  • caption, VQA / multi-modal benchmark image

  • pure text benchmark image

Saying it’s because of MAM image

  • Effect of using two modalities for instruction data + effect of MAM
image

Using text intstruction data makes mm perform worse, using mm instruction makes text worse, using both makes both perform slightly worse than either alone + using MAM makes both better

  • vision encoder freeze effect image

  • num queries image

Requires a lot of text VQA

  • resolution image

textVQA works overwhelmingly well lol

Qualitative Result

image

Claims to see text in early layers and images in later layers thanks to MAM -> what’s the point?

image

Given an unrelated image and text, state that you focused on the text if you have a MAM I think you’re both wrong, but if you have a MAM, you’ll say at least 7 lol