image

paper

TL;DR

  • I read this because.. : very recent VLM model
  • task : VLM + LLM
  • problem : multi-modal task๋Š” LLM freeze ์‹œํ‚ค๊ณ  ์‚ฌ์‹ค์ƒ V+L์„ ์ž˜ํ•˜๋ ค๊ณ  ํ•˜๋Š” ์‹œ๋„๊ฐ€ ๋งŽ์€๋ฐ V/L ๋‘˜๋‹ค ์ž˜ํ•˜๊ฒŒ ํ•˜๊ณ  ์‹ถ๋‹ค
  • idea : ์ „๋ฐ˜์ ์œผ๋กœ BLIP-2 style. ์ด๋•Œ LLM์„ modality๋ณ„๋กœ $W_K$, $W_V$, Norm์„ ๋‹ค๋ฅด๊ฒŒ ํ•˜๋Š”๊ฒŒ ๋‹ค๋ฅธ ์ . ๊ทธ๋ฆฌ๊ณ  LLM๋„ ๊ฐ™์ด tuning.
  • input/output : text + image -> text
  • architecture : CLIP ViT-L/14 + vision abstractor(=Q-former) + LLaMA-2 w/ Modality-Adaptive Module(MAM)
  • objective : ce loss
  • baseline : 7B LLM ๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ๋“ค. BLIP-2, MiniGPT-4, LLAVA, mPLUG-Owl, InstructBLIP, Otter, Qwen-VL-Chat, LLaVA-1.5
  • data : 400M samples from {CC3/12M, COCO, COYO, LAION-en, DataComp} for pretraining / {captioning(TextCaps, COCO), VQA(VQAv2, OKVQA, OCR-VQA, GQA, A-OKVQA), region-aware(RefCOCO, VisualGenome), multi-modal instruction(LLaVa-instruct-150k), text-only instruction data(ShareGPT80-K, SlimOrca)}
  • evaluation : caption / vqa / multimodal benchmark(MME, MMBench, MM-Vet, SEED-Bench, Q-Bench) / text benchmark(MMLU, BBH, AGIEval, ARC-c, ARC-e)
  • result : 7B model ๋“ค ์ค‘์— ๊ฑฐ์˜ ๋‹ค sota. textual instruction๋„ ๊ฐ™์ด ์”€ + MAM์— ๋”ฐ๋ผ pure text benchmark์—์„œ๋„ LLaMA2๋ณด๋‹ค ์„ฑ๋Šฅ ๊ฐœ์„ 
  • contribution : VLM ๋ชจ๋ธ์ด text ์„ฑ๋Šฅ๋„ ๊ฐœ์„ ํ•˜๋Š”๊ฑด ์•„๋งˆ ์ฒ˜์Œ?
  • etc. : alibaba ๋ˆ ๋งŽ์€๋“ฏ..

Details

image

Architecture

image
  • Vision Abstractor๋Š” ๊ฒฐ๊ตญ Q-former
  • Modality-Adaptive Module์€ ๊ฒฐ๊ตญ input์˜ modality์— ๋”ฐ๋ผ weight / norm์„ ๋‹ค๋ฅด๊ฒŒ ํ•˜๊ฒ ๋‹ค๋Š” ์ . ๊ทผ๋ฐ query weight๋Š” ๊ฐ™์Œ. ์—ฌ๊ธฐ์„œ ์ด๋ฏธ์ง€์— ๋Œ€ํ•œ W๋Š” ์ƒˆ๋กœ initialize๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— step-1 pretraining ๋•Œ ํ•™์Šต๋˜๋Š” ๋ถ€๋ถ„.
  • ํ•™์Šต ๋‹จ๊ณ„๋Š” ๋‘ ๋‹จ๊ณ„์ธ๋ฐ
  1. Pre-training ๋•Œ๋Š” {CC3/12M, COCO, COYO, LAION-en, DataComp} ์ด๋Ÿฐ ๊ฑธ๋กœ vision encoder / q-former / language decoder์˜ ์ดˆ๊ธฐํ™”๋œ ๋ถ€๋ถ„์„ ํ•™์Šต. BLIP-2 ๋ž‘ ๋น„๊ต ํ•˜๋ฉด ์žฌ๋ฐŒ์„ ๊ฒƒ ๊ฐ™์€๋ฐ, BLIP-2์—์„œ๋Š” CLIP ViT ๊ฐ€์ ธ์™€์„œ vision encoder freeze. ๊ทธ๋ฆฌ๊ณ  ์‚ฌ์šฉํ•˜๋Š” ์ด๋ฏธ์ง€๋Š” ๋น„์Šทํ•œ ์†Œ์Šค์˜ ์ƒˆ๋กœ ์บก์…”๋‹๋œ ๋ฐ์ดํ„ฐ(CapFilt) ์—ฌ๊ธฐ์„œ๋Š” vision encoder freeze ํ•˜์ง€ ์•Š๊ณ  ์ƒ๋Œ€์ ์œผ๋กœ Noisyํ•œ alt-text๋ฅ˜๋ฅผ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉ! ์–ด๋–ป๊ฒŒ ๋ณด๋ฉด CLIP์—์„œ ๋ณธ ์ข…๋ฅ˜์˜ ๋ฐ์ดํ„ฐ๋ฅผ generation ํ˜•ํƒœ๋กœ ๋‹ค์‹œ ํ•™์Šตํ•˜๋Š” ๊ผด.
  2. joint-instruction tuning ๋•Œ๋Š” ๋‹ค unfreezeํ•˜๊ณ  instruction data๋กœ๋งŒ ํ•™์Šต. ์ด๋•Œ text instruction data๋„ ๋„ฃ์€๊ฒŒ ๋‹ค๋ฅธ ์ .
image

๋‘ ๋‹จ๊ณ„์—์„œ ๋‹ฌ๋ผ์ง€๋Š”๊ฑฐ resolution / LLM seq len

Result

  • caption, VQA / multi-modal benchmark image

  • pure text benchmark image

์ด๊ฑด MAM ๋•๋ถ„์ด๋‹ค๋ผ๊ณ  ๋งํ•จ image

  • instruction data๋ฅผ ๋‘ modality๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์˜ ํšจ๊ณผ + MAM์˜ ํšจ๊ณผ
image

text intstruction data์‚ฌ์šฉํ•˜๋ฉด mm ์„ฑ๋Šฅ์ด ์•ˆ์ข‹๊ณ  mm instruction ์‚ฌ์šฉํ•˜๋ฉด text๊ฐ€ ์•ˆ ์ข‹์•„์ง€๋Š”๋ฐ ๋‘˜๋‹ค ์‚ฌ์šฉํ•˜๋ฉด ๊ฐ์ž ์‚ฌ์šฉํ•œ ๊ฒƒ๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์•ฝ๊ฐ„ ์•ˆ์ข‹์Œ + MAM ์“ฐ๋ฉด ๋‘˜๋‹ค ์ข‹์•„์ง

  • vision encoder freeze ํšจ๊ณผ image

  • num queries image

text VQA๊ฐ€ ๋งŽ์ด ํ•„์š”

  • resolution image

textVQA๊ฐ€ ์••๋„์ ์œผ๋กœ ํšจ๊ณผ๊ฐ€ ์ข‹๋„น ใ…‹ใ…‹

Qualitative Result

image

MAM ๋•๋ถ„์— ์ดˆ๊ธฐ ๋ ˆ์ด์–ด์—” ํ…์ŠคํŠธ, ํ›„๋ฐ˜ ๋ ˆ์ด์–ด์—” ์ด๋ฏธ์ง€๋ฅผ ๋ณธ๋‹ค๊ณ  ์ฃผ์žฅ -> ๋ญ๊ฐ€ ์ข‹์€๊ฑด์ง€ ์ž˜(?)

image

๊ด€๋ จ์—†๋Š” ์ด๋ฏธ์ง€๋ž‘ ํ…์ŠคํŠธ ์ฃผ์–ด์กŒ์„ ๋•Œ MAM ๊ฐ€ ์žˆ์„ ๊ฒฝ์šฐ ํ…์ŠคํŠธ์— ์ง‘์ค‘ํ–ˆ๋‹ค๊ณ  ์„œ์ˆ  ๋‘˜๋‹ค ํ‹€๋ฆฐ ๊ฒƒ ๊ฐ™๊ธดํ•œ๋ฐ.. MAM ์žˆ์œผ๋ฉด ์ ์–ด๋„ 7๊ฐœ ๋งํ•˜๊ธด ํ•จ ใ…‹ใ…‹