[143] Honeybee: Locality-enhanced Projector for Multimodal LLM

TL;DR

I read this because.. : data recipe에 대해서 ablation을 잘했다고 해서 읽음
task : MLLM
problem : projector는 seq len이 길어지는 효과가 있고 resampler 구조는 local-aware한 능력이 없어서 점수가 떨어지는 것 같다
idea : linear projection 대신 conv나 deformable attention을 사용하자
input/output : image, text(query) -> text(answer)
architecture : CLIP ViT-L/14 + ResNet or DDETR(약간의 변경) + LLM(Vicuna-7B / 13B)
objective : LM loss
baseline : LLaVA, MiniGPT-4, LLaMA-Adatper2, mPLUG-owl, InstructBLIP, IDEFICS, Shikra, Qwen-VL, LLaVA-1.5
data : (pretraining) COYO100M, BLIP-CapFilt (instruction) captioning(BlipCapFilt, COYO100M), VQA-open(VQAv2, GQA, OCRVQA, VSR), VQA-mc(SicenceQA, A-OKVQA), REC(RefCOCO, RefCOCO+, RefCOCOg, VG), Instruction(LLaVA150K, ShareGPT)
evaluation : SEED(사지선다), MME, MMB(binary), LLAVAW
result : sota
contribution : resampler의 약점을 파악하고 개선. 다양한 레시피 관련 팁 공유(혜자 논문..)
etc. : 준범님 참여한 논문 다 좋은듯..

Details

motivation

벤치마크들 중 spatial 관련 애들 linear projection vs resampler 대한 분석 resampler 애들이 spatial을 못한다는 분석. finer detail들이 sampler 과정에서 사라진다. 반면에 linear 스타일은 local 정보까지 잘 전달하는 경향이 있다

Honey-bee

MLLM objective

architecture

vision encoder 2) projector 3) large language model

efficieny of mllm 대부분의 병목(메모리 소비, throughput)이 LLM에서 걸림. 즉 LLM에 건내주는 visual token 수가 efficiency를 결정함.

예를 들어 linear projection은 파라미터가 거의 없지만, 같은 # tokens resampler랑 시간이 비슷함. 즉 학습 시간은 # tokens랑 비례함 resampler의 # visual token이 늘어남에 따라 한 step 학습하는데 시간이 오래걸리는 모습 (llava에서 주장하는 파라미터가 적어서 금방 수렴한다랑 약간 다른 포인트의 논지. 거긴 파라미터가 적어서 “수렴"을 얘기하고 여긴 그냥 당장 학습 속도를 의미)

proposed

motivation에서 나온 이야기처럼 resampler 구조가 locality를 반영을 못하는 것 같다. locality를 반영할 visual projector를 추가해주자

Abstractor

C-abstractor는 ResNet D-abstractor는 Deformable Attention

결과

Training

전체적으로 llava-like training strategy

Pre-training for vision-language alignment. COYO와 BlipCapFilt를 1:1 (이런 비율은 manual하게 짧게 학습해보고 정했다고 함) projector만 학습
visual instruction tuning projector와 LLM 같이 학습 데이터는 아래와 같음

Hidden Recipe for Visual Instruction Tuning

Dataset Combination
- 다양하는게 쓰는게 좋고
- 특히 open-ended VQA류를 뺐을 때 벤치마크 성능이 많이 떨어짐
- multiple-choice VQA류를 빼면 MMB, SEED가 많이 떨어짐 -> aligning response patterns에 중요
- captioning data를 빼면 LLAVAW가 많이 떨어짐 -> LLAVAW가 narrative and descriptive responses를 선호함
- visual or text instruction-following datasets 하면 LLaVAW(GPT로 평가시키는거)가 떨어짐.

Dataset Balancing
- pretraing 할 때는 1:1
- instruction에서는 manually tune 할 수 밖에 없다 ㅜㅜ

VSR, ShareGPT, ScienceQA, OCRVQA는 절대적이 양이 적어서 비율을 줄임 OCRVQA, VG는 실험적으로 줄임 Captioning에 BlipCapFilt을 뺀건 cost 때문이었지만 ablation 해봤을 때 성능이 떨어지진 않았음 (!! alt-text를 취하고 caption을 버렸군)

Instruction vs Multi-task

instruction을 주는 식으로 하냐 vs 데이터셋이나 태스크 이름으로 주는 식으로 하냐에서 instruction이 더 좋았다

template

granularity는 “task"별로 template을 다르게 쓰는 것이 좋았다 (!!) template을 여러개 쓰는 것보다 하나만 쓰는게 좋았다 (!!) flip은 QA 순서를 바꾸는 식인데 별로 도움이 안됐다

multi-turn

VQA류 같은건 multi-turn으로 만드는게 좋았다. 특히 비슷한 질문들 dedup까지 하니까 아주 좋았다

Evaluation

D-etails

examples of benchmarks

SEED가 fine-grained한게 많다고 하네

resampler의 architecture 디자인
Templates

캡션류는 별도 프롬프트 없이 VQA, REC task는 fine-grained하게 바꿈 가령 Visual Semantic Reasoning 에서 The cat is inside the refrigerator, False를 Is the cat inside the refrigerator?를 No 형식으로 바꿈 그리고 이미 instruction용으로 나온건 template없이 그대로 사용

TL;DR#

Details#

Honey-bee#

Training#

Hidden Recipe for Visual Instruction Tuning#

D-etails#