image

paper

see llava https://github.com/long8v/PTIR/issues/128#issue-1749571159 here

TL;DR

  • I read this because.. : aka LLaVA1.5 / ShareGPT4V์—์„œ LLaVA1.5 ๋ ˆ์‹œํ”ผ๋ฅผ ๋”ฐ๋ž๋‹ค๊ณ  ํ•ด์„œ ์˜ค๊ฒŒ ๋จ
  • task : LVLM
  • problem : LLaVA๋Š” reasoning๋„ ๋›ฐ์–ด๋‚˜๊ณ  real-world instruction following๋„ ์ž˜ํ•˜์ง€๋งŒ benchmark์— ๋Œ€ํ•ด์„œ๋Š” ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง€๋Š”๋ฐ ์ด๋ฅผ ๊ฐœ์„ ํ•ด๋ณด์ž
  • idea : ์—ฌ๋Ÿฌ ๊ฐ€์ง€ scale up / VQA ๊ฐ™์€ ๋‹จ๋‹ต์— ๋Œ€ํ•ด์„œ๋Š” prompt๋ฅผ ์ข€ ๋” ์ž˜ ์ฃผ๋„๋ก ํ•˜์ž!
  • input/output : image + question -> answer
  • architecture : ViT-L/14(336 resolution) + LLaMA 13B
  • objective : ce loss
  • baseline : llava, Qwen-VL, Shikra, BLIP-2, IDEFICS, instructBLIP
  • data : (alignment) LCS-558K(LAION-CC-SBU with BLIP caption) / (end-to-end finetuning) LLaVA instruction data + VQA(OKVQA, A-OKVQA), OCR(OCRVQA, TextCaps), region-level VQA(Visual Genome, RefCOCO)
  • evaluation : GQA, MME, MM-Vet, VQA, GQA, VisWiz, SQA, VQA, POPE, …
  • result : VQA๋ฅ˜๋ฅผ finetuningํ•  ๋•Œ ๋„ฃ์œผ๋‹ˆ ๊ฐœ์„ , format prompt๋ฅผ ํ•˜๋‹ˆ ๊ฐœ์„ , linear๋Œ€์‹  2-layer mlp๋ฅผ ๋„ฃ์œผ๋‹ˆ ๊ฐœ์„ , resolution์„ ๋†’์ด๋‹ˆ ๊ฐœ์„ , ShareGPT ๋“ฑ ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ ๋„ฃ์œผ๋‹ˆ ๊ฐœ์„ (ShareGPT๋ฅผ ๋„ฃ์–ด์„œ multilingual ๋Šฅ๋ ฅ๋„ ์ƒ๊น€)
  • contribution : ์ ์€ ๋ฆฌ์†Œ์Šค, open data๋งŒ์œผ๋กœ ๊ด„๋ชฉํ• ๋งŒํ•œ ์„ฑ๋Šฅ์„ ๋‚ธ ๊ฒƒ.
  • etc. :

Details

contribution

image

์ตœ์†Œํ•œ์˜ tuning(1.2M scale์˜ public data๋กœ 8 A100 days๋กœ ๋๋‚˜๋Š”)์œผ๋กœ ์ข‹์€ ์„ฑ๋Šฅ

Dataset

  • alignment learning LCS-558K(LAION-CC-SBU with BLIP caption) ์ค‘๊ฐ„์— llava-lightning์ด๋ž€๊ฒŒ ์žˆ์—ˆ๊ณ  ์ˆ˜๋ ด์„ ์ข€ ๋” ๋นจ๋ฆฌ ํ•˜๊ธฐ ์œ„ํ•œ variant์ธ ๋“ฏํ•˜๋‹ค. https://github.com/haotian-liu/LLaVA/issues/86#issuecomment-1533346022 ๋ฅผ ๋ณด๋ฉด CC๋ž‘ ๋Œ€๋žต์ ์œผ๋กœ ์ˆ˜๋Ÿ‰์„ ๋งž์ท„๊ณ  much larger concept converage ํ•ด์„œ ์ˆ˜๋ ด์„ ๋” ๋นจ๋ฆฌ ํ•œ๋‹ค๊ณ  ํ•œ๋‹ค. CC๋ž‘ blip caption์€ text ํ˜•ํƒœ๊ฐ€ ๋งŽ์ด ๋‹ค๋ฅผ ๊ฒƒ ๊ฐ™๊ธด ํ•œ๋ฐ.. ใ…‹ใ…‹ ์•ฝ๊ฐ„ ๋ฒค์น˜๋งˆํฌ๋ฅผ ์ฐ๊ธฐ ์œ„ํ•œ ์ž˜ ๋ณด์ด์ง€ ์•Š๋Š” trick์ด ์•„๋‹Œ์ง€? llava 1.5๊ฐ€ conservation์— ๋Œ€ํ•œ ์„ฑ๋Šฅ์„ ์•ˆ ์žฐ๊ฒŒ ์•„์‰ฝ๋‹ค ์•„๋งˆ ํ›จ์”ฌ ๋‚ฎ๊ฒŒ ๋‚˜์˜ค์ง€ ์•Š์•˜์„๊นŒ?

  • end-to-end finetuning LLaVA instruction data + VQA(OKVQA, A-OKVQA), OCR(OCRVQA, TextCaps), region-level VQA(Visual Genome, RefCOCO) ๋ชฐ๋ž๋Š”๋ฐ Visual Genome์ด VQA๊ฐ€ ์žˆ์—ˆ๊ตฌ๋‚ญ.. https://paperswithcode.com/dataset/visual-genome image

Improved baseline of LLaVA

  • LLaVA๊ฐ€ ๋ฒค์น˜๋งˆํฌ์—์„œ ์„ฑ๋Šฅ์ด ์•ˆ์ข‹์•˜๋˜ ์ด์œ  VQA๋Š” ๋‹จ๋‹ต์œผ๋กœ ํ•œ ๋‘ ๋‹จ์–ด๋กœ ๋๋‚ด์•ผ ํ•˜๋Š”๋ฐ LLaVA๋Š” ๊ทธ๋Ÿฐ ์‹์œผ๋กœ ํ•™์Šต๋˜์ง€ ์•Š์Œ / ๋ฐ์ดํ„ฐ๋ฅผ ์กฐ๊ธˆ ๋ด„ -> “response formatting format” VQAv2 ๊ฐ™์€ ๊ฑธ ๋„ฃ์„ ๋•Œ Q: {Question} A: {Answer} ๋Œ€์‹  Answer the question using a single word or phrase๋ผ๊ณ  prompt๋ฅผ ์คŒ. ์ด๋ ‡๊ฒŒ ํ•ด์„œ ๋‹จ์ˆœํžˆ VQAv2๋ฅผ training data์— ๋„ฃ์œผ๋‹ˆ๊นŒ ํŠนํžˆ MME๋ผ๋Š” ๋ฒค์น˜๋งˆํฌ์—์„œ ์„ฑ๋Šฅ์ด 2๋ฐฐ๊ฐ€ ๋จ 502 -> 1197
image

Result / Ability

image LLaVA๋Š” ์ด์ƒํ•˜๊ฒŒ ๋Œ€๋‹ต
  • ๊ด€๋ จ ์—†๋Š” ์ด๋ฏธ์ง€์— ๋Œ€ํ•ด์„œ๋„ ์ž˜ ๋Œ€๋‹ต image

  • json ๋ฝ‘๊ธฐ ๊ฐ€๋Šฅ! (ocr ๋Šฅ๋ ฅ) image

  • zs multi-lingual ShareGPT(https://sharegpt.com/ )๋ผ๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ด์„œ์ธ์ง€ multilingual instruction์„ ๋”ฐ๋ฅด๋”๋ผ ์‚ฌ์šฉ์ž๊ฐ€ ์ž๊ธฐ๊ฐ€ ์‚ฌ์šฉํ•œ chatGPT ์งˆ๋‹ต์„ ์˜ฌ๋ฆด ์ˆ˜ ์žˆ๋Š” ํ”Œ๋žซํผ ์•„๋งˆ language only ์ธ๋“ฏํ•˜๋‹ค. ํŠนํžˆ MMBench-CN์—์„œ ์‹ค์ œ๋กœ chinese instruction data๋ฅผ ํ™œ์šฉํ•œ Qwen-VL-Chat์„ ์ด๊ฒผ๋‹ค (์‹ ๊ธฐํ•˜๋„ค)

  • computational cost 6 hours for pretraining / 20 hours for visual instruction tuning using 8A100s image

  • limitation

  1. resolution์— ๋”ฐ๋ผ image seq len์ด ๋Š˜์–ด๋‚œ๋‹ค๋Š” ์ . q-former๊ฐ€ ๊ทธ๋Ÿฐ๊ฑธ ๋Œ€์ฒดํ•˜๋Š”๋ฐ ์ด๊ฑด ์ˆ˜๋ ด์ด ๋А๋ฆฐ ๊ฒƒ ๊ฐ™๋”๋ผ. ํšจ์œจ์ ์œผ๋กœ q-former๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋Š” ์—ฐ๊ตฌ๊ฐ€ ์ง„ํ–‰๋˜์–ด์•ผ
  2. multi image ์ฒ˜๋ฆฌ ๋ถˆ๊ฐ€. ๋ฐ์ดํ„ฐ๊ฐ€ ์—†๋‹ค.
  3. ์—ฌ์ „ํžˆ ํƒ€๊ฒŸ ๋„๋ฉ”์ธ์— ํ•œ์ •๋˜์–ด ์žˆ๋‹ค
  4. hallucination์ด ์žˆ๋‹ค
  • d–etails
image