[166] Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

paper , repo , demo

TL;DR

I read this because.. : LVLM model after a long time. I was curious about data curation, so I used
task : Vision Language Model
problem : I don’t have a strong sense of subject matter… my goal is to create a VLM that performs well enough to be taught in an academic setting.
idea : 1) To efficiently handle resolution, let’s CA two high resolution and low resolution to extract information 2) Let’s do a good job of data curation 3) To do a good job of generation using Stable Diffusion in the middle
input/output : {image, Q} -> {A} (optionally call SD according to answer)
architecture : CLIP ViT-L (for low resolution) + ConvNext-L (high resolution) + mining layer (projection and MLP) + LLM (Gemma-2B, Vicuna-7, 13B, Mixtral-8x7B, Hermes-2-Yi-34B)
objective : CE loss
baseline : (normal resolution) MobileVLM, InstructBLIP, Qwen-VL, Shikra, IDEFICS-80B, LLaMA-VID, LLaVA-1.5 (high resolution) OtterHD, CogVLM-chat, LLaVA-NeXT, (private models) Gemini Pro, Qwen-VL-Plus, GPT-4V
data : (alignment) 558K from CC3M filtered by llava, 695K ALLaVA, (instruction) 643K LLaVA (except textcaps), 100K from ShareGPT4V, 10K LAION-GPT4V, 700K ALLaVA, 5K text-only multiturn from LIMA and OpenAssistant, 28K OCR related(10K DocVQA, 4K ChartQA, 10K DVQA, 4K AI2D) + (generation-related instruction) 13K configured
evaluation : TextVQA, MMB, MME, MM-Vet, MMMU, MathVista
result : Good performance among given benchmarks
contribution : information merge Interesting. Good data curation. data ablation does a lot.
etc. :

Details

thumbnail

architecture

overall framework
proposed patch info mining

data

Result

ablation

qualitative examples

play with demo

TL;DR#

Details#

architecture#

data#

Result#

TL;DR

Details

architecture

data

Result