[143] Honeybee: Locality-enhanced Projector for Multimodal LLM

TL;DR

I read this because.. : I read this because you did a good job of ablation for the data recipe
task : MLLM
problem : projector has the effect of lengthening the seq len, and the resampler structure is not locally-aware, which seems to hurt the score.
idea : use conv or deformable attention instead of linear projection
input/output : image, text(query) -> text(answer)
architecture : CLIP ViT-L/14 + ResNet or DDETR (with minor changes) + LLM (Vicuna-7B / 13B)
objective : LM loss
baseline : LLaVA, MiniGPT-4, LLaMA-Adatper2, mPLUG-owl, InstructBLIP, IDEFICS, Shikra, Qwen-VL, LLaVA-1.5
data : (pretraining) COYO100M, BLIP-CapFilt (instruction) captioning(BlipCapFilt, COYO100M), VQA-open(VQAv2, GQA, OCRVQA, VSR), VQA-mc(SicenceQA, A-OKVQA), REC(RefCOCO, RefCOCO+, RefCOCOg, VG), Instruction(LLaVA150K, ShareGPT)
evaluation : SEED(limb), MME, MMB(binary), LLAVAW
result : sota
contribution : Identified and improved weaknesses in the resampler. Shared tips on various recipes (Hyeja’s paper..)
etc. : 준범님 참여한 논문 다 좋은듯.

Details

motivation

Analyzing spatial-related challenges in benchmarks linear projection vs resampler Analysis that resampler kids are bad at spatial. Finer details are lost in the sampler process. Linear styles, on the other hand, tend to convey local information well

Honey-bee

MLLM objective

architecture

vision encoder 2) projector 3) large language model

efficieny of mllm Most of the bottlenecks (memory consumption, throughput) are in the LLM, meaning that the number of visual tokens you pass to the LLM determines its efficiency.

For example, linear projection has very few parameters, but has a similar time to the same # tokens resampler. In other words, training time is proportional to # tokens. How resampler takes longer to learn a step as the #visual token increases (A slightly different point from llava’s claim that it converges quickly with fewer parameters. There we say “converge” because of the fewer parameters, here we just mean the speed of learning right away)

proposed

As mentioned in the motivation, it seems that the resampler structure doesn’t reflect locality. Let’s add a visual projector to reflect locality.

Abstractor

C-abstractor is a ResNet The D-abstractor is a Deformable Attention

Result

Training

Overall, the llava-like training strategy

Pre-training for vision-language alignment. 1:1 COYO to BlipCapFilt (this ratio was determined after a short manual learning curve) Learn projector only
visual instruction tuning Learn like an LLM with projector The data looks like this

Hidden Recipe for Visual Instruction Tuning

Dataset Combination
It’s good to have a variety of
Benchmark performance drops significantly, especially when open-ended VQAs are removed
MMB, SEED drops a lot when multiple-choice VQAs are removed -> important for aligning response patterns
LLAVAW drops significantly when captioning data is removed -> LLAVAW favors narrative and descriptive responses
LLaVAW (evaluating to GPT) drops when using visual or text instruction-following datasets.

Dataset Balancing
When pretraing, the 1:1
instruction, you can only tune it manually ㅜㅜ

VSR, ShareGPT, ScienceQA, and OCRVQA have lower absolute amounts, reducing the ratio OCRVQA, VG reduced experimentally I left out BlipCapFilt for captioning because of cost, but ablation didn’t degrade performance (!! you took alt-text and threw away caption)

Instruction vs Multi-task

Instruction is better when given as an instruction vs. dataset or task name

template

For granularity, it was nice to have different templates per “task” (!!) It was better to use one template than multiple (!!) flip reverses the order of QA, but it’s not very helpful

multi-turn

It was nice to have multi-turn VQA, especially since it deduplicates similar questions.

Evaluation

D-etails

examples of benchmarks

SEED says a lot of fine-grained

Designing the architecture of the resampler
Templates

Captions can be added to the Change VQA, REC tasks to fine-grained For example, in [Visual Semantic Reasoning] (https://paperswithcode.com/dataset/vsr) , replace The cat is inside the refrigerator, False with Is the cat inside the refrigerator? And use what’s already there for instruction without template

TL;DR#

Details#

Honey-bee#

Training#

Hidden Recipe for Visual Instruction Tuning#

D-etails#