image

paper , repo , demo

TL;DR

  • I read this because.. : ์˜ค๋žœ๋งŒ์— LVLM ๋ชจ๋ธ. data curation์— ๋Œ€ํ•ด ๊ถ๊ธˆํ•ด์„œ
  • task : Vision Language Model
  • problem : ๋Œ€๋‹จํ•œ ์ฃผ์ œ์˜์‹์€ ์—†๊ณ .. academic setting์œผ๋กœ ํ•™์Šตํ•  ๋งŒํ•œ ์ข‹์€ ์„ฑ๋Šฅ์˜ VLM ์„ ๋งŒ๋“œ๋Š”๊ฒŒ ๋ชฉํ‘œ
  • idea : 1) resolution์„ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด high resolution, low resolution ๋‘๊ฐœ๋ฅผ CA ํ•ด์„œ ์ •๋ณด๋ฅผ ๋ฝ‘์ž 2) Data curation ์ž˜ํ•˜์ž 3) ์ค‘๊ฐ„์— Stable Diffusion ์„ ์‚ฌ์šฉํ•œ generation์„ ์ž˜ํ•˜๊ธฐ ์œ„ํ•ด
  • input/output : {image, Q} -> {A} (optionally call SD according to answer)
  • architecture : CLIP ViT-L (low resolution์šฉ) + ConvNext-L (high resolution) + mining layer(projection and MLP) + LLM(Gemma-2B, Vicuna-7, 13B, Mixtral-8x7B, Hermes-2-Yi-34B)
  • objective : CE loss
  • baseline : (normal resolution) MobileVLM, InstructBLIP, Qwen-VL, Shikra, IDEFICS-80B, LLaMA-VID, LLaVA-1.5 (high resolution) OtterHD, CogVLM-chat, LLaVA-NeXT, (private models) Gemini Pro, Qwen-VL-Plus, GPT-4V
  • data : (alignment) 558K from CC3M filtered by llava, 695K ALLaVA, (instruction) 643K LLaVA (except textcaps), 100K from ShareGPT4V, 10K LAION-GPT4V, 700K ALLaVA, 5K text-only multiturn from LIMA and OpenAssistant, 28K OCR related(10K DocVQA, 4K ChartQA, 10K DVQA, 4K AI2D) + (generation-related instruction) 13K ๊ตฌ์„ฑํ•จ
  • evaluation : TextVQA, MMB, MME, MM-Vet, MMMU, MathVista
  • result : ์ฃผ์–ด์ง„ ๋ฒค์น˜๋งˆํฌ ์ค‘ ์ข‹์€ ์„ฑ๋Šฅ
  • contribution : information merge ์žฌ๋ฐŒ๋Š”๋“ฏ. ์ข‹์€ ๋ฐ์ดํ„ฐ curation. data ablation ๋งŽ์ด ํ•ด์คŒ.
  • etc. :

Details

  • thumbnail image

architecture

  • overall framework image

  • proposed patch info mining image

image

data

image

Result

image

  • ablation image

image

  • qualitative examples image

image

  • play with demo image

image image

image image