TL;DR
- I read this because.. : ์ค๋๋ง์ LVLM ๋ชจ๋ธ. data curation์ ๋ํด ๊ถ๊ธํด์
- task : Vision Language Model
- problem : ๋๋จํ ์ฃผ์ ์์์ ์๊ณ .. academic setting์ผ๋ก ํ์ตํ ๋งํ ์ข์ ์ฑ๋ฅ์ VLM ์ ๋ง๋๋๊ฒ ๋ชฉํ
- idea : 1) resolution์ ํจ์จ์ ์ผ๋ก ์ฒ๋ฆฌํ๊ธฐ ์ํด high resolution, low resolution ๋๊ฐ๋ฅผ CA ํด์ ์ ๋ณด๋ฅผ ๋ฝ์ 2) Data curation ์ํ์ 3) ์ค๊ฐ์ Stable Diffusion ์ ์ฌ์ฉํ generation์ ์ํ๊ธฐ ์ํด
- input/output : {image, Q} -> {A} (optionally call SD according to answer)
- architecture : CLIP ViT-L (low resolution์ฉ) + ConvNext-L (high resolution) + mining layer(projection and MLP) + LLM(Gemma-2B, Vicuna-7, 13B, Mixtral-8x7B, Hermes-2-Yi-34B)
- objective : CE loss
- baseline : (normal resolution) MobileVLM, InstructBLIP, Qwen-VL, Shikra, IDEFICS-80B, LLaMA-VID, LLaVA-1.5 (high resolution) OtterHD, CogVLM-chat, LLaVA-NeXT, (private models) Gemini Pro, Qwen-VL-Plus, GPT-4V
- data : (alignment) 558K from CC3M filtered by llava, 695K ALLaVA, (instruction) 643K LLaVA (except textcaps), 100K from ShareGPT4V, 10K LAION-GPT4V, 700K ALLaVA, 5K text-only multiturn from LIMA and OpenAssistant, 28K OCR related(10K DocVQA, 4K ChartQA, 10K DVQA, 4K AI2D) + (generation-related instruction) 13K ๊ตฌ์ฑํจ
- evaluation : TextVQA, MMB, MME, MM-Vet, MMMU, MathVista
- result : ์ฃผ์ด์ง ๋ฒค์น๋งํฌ ์ค ์ข์ ์ฑ๋ฅ
- contribution : information merge ์ฌ๋ฐ๋๋ฏ. ์ข์ ๋ฐ์ดํฐ curation. data ablation ๋ง์ด ํด์ค.
- etc. :
Details
- thumbnail
architecture
overall framework
proposed patch info mining
data
Result
- ablation
- qualitative examples
- play with demo