[166] Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

paper , repo , demo

TL;DR

I read this because.. : 오랜만에 LVLM 모델. data curation에 대해 궁금해서
task : Vision Language Model
problem : 대단한 주제의식은 없고.. academic setting으로 학습할 만한 좋은 성능의 VLM 을 만드는게 목표
idea : 1) resolution을 효율적으로 처리하기 위해 high resolution, low resolution 두개를 CA 해서 정보를 뽑자 2) Data curation 잘하자 3) 중간에 Stable Diffusion 을 사용한 generation을 잘하기 위해
input/output : {image, Q} -> {A} (optionally call SD according to answer)
architecture : CLIP ViT-L (low resolution용) + ConvNext-L (high resolution) + mining layer(projection and MLP) + LLM(Gemma-2B, Vicuna-7, 13B, Mixtral-8x7B, Hermes-2-Yi-34B)
objective : CE loss
baseline : (normal resolution) MobileVLM, InstructBLIP, Qwen-VL, Shikra, IDEFICS-80B, LLaMA-VID, LLaVA-1.5 (high resolution) OtterHD, CogVLM-chat, LLaVA-NeXT, (private models) Gemini Pro, Qwen-VL-Plus, GPT-4V
data : (alignment) 558K from CC3M filtered by llava, 695K ALLaVA, (instruction) 643K LLaVA (except textcaps), 100K from ShareGPT4V, 10K LAION-GPT4V, 700K ALLaVA, 5K text-only multiturn from LIMA and OpenAssistant, 28K OCR related(10K DocVQA, 4K ChartQA, 10K DVQA, 4K AI2D) + (generation-related instruction) 13K 구성함
evaluation : TextVQA, MMB, MME, MM-Vet, MMMU, MathVista
result : 주어진 벤치마크 중 좋은 성능
contribution : information merge 재밌는듯. 좋은 데이터 curation. data ablation 많이 해줌.
etc. :

Details

thumbnail

architecture

overall framework
proposed patch info mining

data

Result

ablation

qualitative examples

play with demo

TL;DR#

Details#

architecture#

data#

Result#

TL;DR

Details

architecture

data

Result