image

paper , repo , demo

TL;DR

  • I read this because.. : LVLM model after a long time. I was curious about data curation, so I used
  • task : Vision Language Model
  • problem : I don’t have a strong sense of subject matter… my goal is to create a VLM that performs well enough to be taught in an academic setting.
  • idea : 1) To efficiently handle resolution, let’s CA two high resolution and low resolution to extract information 2) Let’s do a good job of data curation 3) To do a good job of generation using Stable Diffusion in the middle
  • input/output : {image, Q} -> {A} (optionally call SD according to answer)
  • architecture : CLIP ViT-L (for low resolution) + ConvNext-L (high resolution) + mining layer (projection and MLP) + LLM (Gemma-2B, Vicuna-7, 13B, Mixtral-8x7B, Hermes-2-Yi-34B)
  • objective : CE loss
  • baseline : (normal resolution) MobileVLM, InstructBLIP, Qwen-VL, Shikra, IDEFICS-80B, LLaMA-VID, LLaVA-1.5 (high resolution) OtterHD, CogVLM-chat, LLaVA-NeXT, (private models) Gemini Pro, Qwen-VL-Plus, GPT-4V
  • data : (alignment) 558K from CC3M filtered by llava, 695K ALLaVA, (instruction) 643K LLaVA (except textcaps), 100K from ShareGPT4V, 10K LAION-GPT4V, 700K ALLaVA, 5K text-only multiturn from LIMA and OpenAssistant, 28K OCR related(10K DocVQA, 4K ChartQA, 10K DVQA, 4K AI2D) + (generation-related instruction) 13K configured
  • evaluation : TextVQA, MMB, MME, MM-Vet, MMMU, MathVista
  • result : Good performance among given benchmarks
  • contribution : information merge Interesting. Good data curation. data ablation does a lot.
  • etc. :

Details

  • thumbnail image

architecture

  • overall framework image

  • proposed patch info mining image

image

data

image

Result

image

  • ablation image

image

  • qualitative examples image

image

  • play with demo image

image image

image image