image

paper

TL;DR

  • I read this because.. : I read this because you did a good job of ablation for the data recipe
  • task : MLLM
  • problem : projector has the effect of lengthening the seq len, and the resampler structure is not locally-aware, which seems to hurt the score.
  • idea : use conv or deformable attention instead of linear projection
  • input/output : image, text(query) -> text(answer)
  • architecture : CLIP ViT-L/14 + ResNet or DDETR (with minor changes) + LLM (Vicuna-7B / 13B)
  • objective : LM loss
  • baseline : LLaVA, MiniGPT-4, LLaMA-Adatper2, mPLUG-owl, InstructBLIP, IDEFICS, Shikra, Qwen-VL, LLaVA-1.5
  • data : (pretraining) COYO100M, BLIP-CapFilt (instruction) captioning(BlipCapFilt, COYO100M), VQA-open(VQAv2, GQA, OCRVQA, VSR), VQA-mc(SicenceQA, A-OKVQA), REC(RefCOCO, RefCOCO+, RefCOCOg, VG), Instruction(LLaVA150K, ShareGPT)
  • evaluation : SEED(limb), MME, MMB(binary), LLAVAW
  • result : sota
  • contribution : Identified and improved weaknesses in the resampler. Shared tips on various recipes (Hyeja’s paper..)
  • etc. : 준범님 참여한 논문 다 좋은듯.

Details

  • motivation image

Analyzing spatial-related challenges in benchmarks linear projection vs resampler Analysis that resampler kids are bad at spatial. Finer details are lost in the sampler process. Linear styles, on the other hand, tend to convey local information well

Honey-bee

image
  • MLLM objective
image
  • architecture
  1. vision encoder 2) projector 3) large language model
  • efficieny of mllm Most of the bottlenecks (memory consumption, throughput) are in the LLM, meaning that the number of visual tokens you pass to the LLM determines its efficiency.
image image

For example, linear projection has very few parameters, but has a similar time to the same # tokens resampler. In other words, training time is proportional to # tokens. How resampler takes longer to learn a step as the #visual token increases (A slightly different point from llava’s claim that it converges quickly with fewer parameters. There we say “converge” because of the fewer parameters, here we just mean the speed of learning right away)

  • proposed

As mentioned in the motivation, it seems that the resampler structure doesn’t reflect locality. Let’s add a visual projector to reflect locality.

  • Abstractor image

C-abstractor is a ResNet The D-abstractor is a Deformable Attention

Result image

Training

Overall, the llava-like training strategy

  • Pre-training for vision-language alignment. 1:1 COYO to BlipCapFilt (this ratio was determined after a short manual learning curve) Learn projector only

  • visual instruction tuning Learn like an LLM with projector The data looks like this image

image

Hidden Recipe for Visual Instruction Tuning

image
  • Dataset Combination
  • It’s good to have a variety of
  • Benchmark performance drops significantly, especially when open-ended VQAs are removed
  • MMB, SEED drops a lot when multiple-choice VQAs are removed -> important for aligning response patterns
  • LLAVAW drops significantly when captioning data is removed -> LLAVAW favors narrative and descriptive responses
  • LLaVAW (evaluating to GPT) drops when using visual or text instruction-following datasets.
image
  • Dataset Balancing
  • When pretraing, the 1:1
  • instruction, you can only tune it manually ㅜㅜ image

VSR, ShareGPT, ScienceQA, and OCRVQA have lower absolute amounts, reducing the ratio OCRVQA, VG reduced experimentally I left out BlipCapFilt for captioning because of cost, but ablation didn’t degrade performance (!! you took alt-text and threw away caption)

  • Instruction vs Multi-task image

Instruction is better when given as an instruction vs. dataset or task name

  • template
image

For granularity, it was nice to have different templates per “task” (!!) It was better to use one template than multiple (!!) flip reverses the order of QA, but it’s not very helpful

  • multi-turn image

It was nice to have multi-turn VQA, especially since it deduplicates similar questions. image

  • Evaluation image

D-etails

  • examples of benchmarks image

SEED says a lot of fine-grained

  • Designing the architecture of the resampler image

  • Templates image

Captions can be added to the Change VQA, REC tasks to fine-grained For example, in [Visual Semantic Reasoning] (https://paperswithcode.com/dataset/vsr) , replace The cat is inside the refrigerator, False with Is the cat inside the refrigerator? And use what’s already there for instruction without template