TL;DR
- I read this because.. : I read this because you did a good job of ablation for the data recipe
- task : MLLM
- problem : projector has the effect of lengthening the seq len, and the resampler structure is not locally-aware, which seems to hurt the score.
- idea : use conv or deformable attention instead of linear projection
- input/output : image, text(query) -> text(answer)
- architecture : CLIP ViT-L/14 + ResNet or DDETR (with minor changes) + LLM (Vicuna-7B / 13B)
- objective : LM loss
- baseline : LLaVA, MiniGPT-4, LLaMA-Adatper2, mPLUG-owl, InstructBLIP, IDEFICS, Shikra, Qwen-VL, LLaVA-1.5
- data : (pretraining) COYO100M, BLIP-CapFilt (instruction) captioning(BlipCapFilt, COYO100M), VQA-open(VQAv2, GQA, OCRVQA, VSR), VQA-mc(SicenceQA, A-OKVQA), REC(RefCOCO, RefCOCO+, RefCOCOg, VG), Instruction(LLaVA150K, ShareGPT)
- evaluation : SEED(limb), MME, MMB(binary), LLAVAW
- result : sota
- contribution : Identified and improved weaknesses in the resampler. Shared tips on various recipes (Hyeja’s paper..)
- etc. : 준범님 참여한 논문 다 좋은듯.
Details
- motivation
Analyzing spatial-related challenges in benchmarks linear projection vs resampler Analysis that resampler kids are bad at spatial. Finer details are lost in the sampler process. Linear styles, on the other hand, tend to convey local information well
Honey-bee
- MLLM objective
- architecture
- vision encoder 2) projector 3) large language model
- efficieny of mllm Most of the bottlenecks (memory consumption, throughput) are in the LLM, meaning that the number of visual tokens you pass to the LLM determines its efficiency.
For example, linear projection has very few parameters, but has a similar time to the same # tokens resampler. In other words, training time is proportional to # tokens. How resampler takes longer to learn a step as the #visual token increases (A slightly different point from llava’s claim that it converges quickly with fewer parameters. There we say “converge” because of the fewer parameters, here we just mean the speed of learning right away)
- proposed
As mentioned in the motivation, it seems that the resampler structure doesn’t reflect locality. Let’s add a visual projector to reflect locality.
- Abstractor
C-abstractor is a ResNet The D-abstractor is a Deformable Attention
Result
Training
Overall, the llava-like training strategy
Pre-training for vision-language alignment. 1:1 COYO to BlipCapFilt (this ratio was determined after a short manual learning curve) Learn projector only
visual instruction tuning Learn like an LLM with projector The data looks like this
Hidden Recipe for Visual Instruction Tuning
- Dataset Combination
- It’s good to have a variety of
- Benchmark performance drops significantly, especially when open-ended VQAs are removed
- MMB, SEED drops a lot when multiple-choice VQAs are removed -> important for aligning response patterns
- LLAVAW drops significantly when captioning data is removed -> LLAVAW favors narrative and descriptive responses
- LLaVAW (evaluating to GPT) drops when using visual or text instruction-following datasets.
- Dataset Balancing
- When pretraing, the 1:1
- instruction, you can only tune it manually ㅜㅜ
VSR, ShareGPT, ScienceQA, and OCRVQA have lower absolute amounts, reducing the ratio OCRVQA, VG reduced experimentally I left out BlipCapFilt for captioning because of cost, but ablation didn’t degrade performance (!! you took alt-text and threw away caption)
- Instruction vs Multi-task
Instruction is better when given as an instruction vs. dataset or task name
- template
For granularity, it was nice to have different templates per “task” (!!) It was better to use one template than multiple (!!) flip reverses the order of QA, but it’s not very helpful
- multi-turn
It was nice to have multi-turn VQA, especially since it deduplicates similar questions.
- Evaluation
D-etails
- examples of benchmarks
SEED says a lot of fine-grained
Designing the architecture of the resampler
Templates
Captions can be added to the Change VQA, REC tasks to fine-grained For example, in [Visual Semantic Reasoning] (https://paperswithcode.com/dataset/vsr) , replace The cat is inside the refrigerator, False with Is the cat inside the refrigerator? And use what’s already there for instruction without template