image

paper , page

TL;DR

  • I read this because.. : GPT4-V๋ฅผ ํ™œ์šฉํ•œ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šตํ•œ ๋ชจ๋ธ
  • task : VLM
  • problem : instruction data๊ฐ€ ๋„ˆ๋ฌด noisyํ•˜๋‹ค
  • idea : GPT4-V๋กœ ๋ฐ์ดํ„ฐ ๋ชจ์œผ์ž! ํ›„์— captioner ํ•™์Šตํ•ด์„œ ๋‚˜์˜จ ์• ๋“ค์„ ๊ฐ€์ง€๊ณ  ์–˜๋ฅผ alignment ํ•  ๋•Œ ์“ฐ์ž
  • input/output : image - (api call) -> GPT4V caption => LLaVA1.5 style๋กœ ํ•™์Šต
  • architecture : LLaVA-1.5
  • objective : ce loss
  • baseline : ๋ฐ์ดํ„ฐ์˜ ํšจ๊ณผ๋ฅผ ๋ณด๊ธฐ ์œ„ํ•ด LLaVA-7B / LLaVA-1.5-7B(13B) / Qwen-VL-Chat-7B์— ์ถ”๊ฐ€ํ•˜์—ฌ ํ•™์Šต, LLaVA 1.5 ์•„ํ‚คํ…์ณ ๊ทธ๋Œ€๋กœ ๊ฐ€์ ธ์™€์„œ ํ•™์Šต ๋””ํ…Œ์ผ ์กฐ๊ธˆ ๋ฐ”๊พธ๊ณ  pretraining - finetuning ํ–ˆ์„ ๋•Œ ๋ชจ๋“  ๊ฒฝ์šฐ์—์„œ sota
  • data : image={LAION-400M, COCO, SBU, SAM, TextCaps}, text={GPT4-V call}
  • evaluation : SEED, VizWiz, VQA-v2, SQA, QBench, MM-Vet, MMBench-CN, MMBench, MME_cog, MME_per, LLaVA-Bench
  • result : sota~
  • contribution : ๋ฐ์ดํ„ฐ ๊ณต๊ฐœ. ๋ชจ๋ธ ๊ณต๊ฐœ. ์•„ํ‚คํ…์ณ๋ณด๋‹ค ๋ฐ์ดํ„ฐ๊ฐ€ ์ค‘์š”ํ•˜๋‹ค!!!๋ฅผ ๊ฐ•์กฐ
  • etc. :

Details

  • thumnail (caption example / performance) image

  • caption style / error image

Data

  • dataset statistics image

etc: SAM, TextCaps, WikiArt + 1K images from webcrawled data (split evenly between images of landmarks and images of celebrities). (์ถ”๊ฐ€์ ์œผ๋กœ ๊ธ์€ ๋“ฏ)

  • data collection image
image

๋ฐ์ดํ„ฐ ์ข…๋ฅ˜๋ณ„๋กœ prompt๋ฅผ ๋‹ค๋ฅด๊ฒŒ ์คฌ๋‹ค๊ณ  ํ•จ image

์ด๋ ‡๊ฒŒ 100K์ˆ˜์ง‘

  • ShareGPT4V-PT ShareCaptioner๋ผ๋Š” ๋ชจ๋ธ์„ ๋”ฐ๋กœ ๋งŒ๋“ค์–ด์„œ 1.2M ๋ฐ์ดํ„ฐ์…‹์„ ๋งŒ๋“ฆ.
    44 A100 GPU days ๊ฑธ๋ ธ๋‹ค๊ณ  ํ•จ. ๋ชจ๋ธ์— ๋Œ€ํ•œ ์ •๋ณด๊ฐ€ ์—†๋Š”๊ฑธ ๋ด์„œ ShareGPT4V-7B ๋ชจ๋ธ์ด๋ž‘ ๊ฐ™์€๊ฒƒ ์•„๋‹๊นŒ? ์ž์„ธํ•˜๊ฒŒ ์ถ”๊ฐ€๋กœ ์ •์ œํ–ˆ๋‹ค๋˜์ง€ ํ•˜๋Š” ์ •๋ณด๋Š” ์—†์Œ.

์ด๋•Œ ์‚ฌ์šฉํ•œ ๋ฐ์ดํ„ฐ image

3๊ฐœ์— ๋Œ€ํ•œ human evaluation image

more analysis image

image
  • ์ด ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•œ ๋ชจ๋ธ ์„ฑ๋Šฅ ๊ฐœ์„  image

๊ณต์ •ํ•œ ๋น„๊ต๋ฅผ ์œ„ํ•ด ์›๋ž˜ ์Ÿค๋„ค ํ•™์Šตํ•  ๋•Œ ์žˆ์—ˆ๋˜ data recipe ์ค‘์— ‘detailed caption’์— ํ•ด๋‹นํ•˜๋Š” 100K์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋นผ๊ณ  ์ด ๋ฐ์ดํ„ฐ๋ฅผ ๋„ฃ์Œ

ShareGPT4V-7B model

  • LLaVA-1.5
  • ViT-L/14 336x336 / Vicuna-v1.5 7B
  • training
    • pretraining:
      • w/ ShareGPT4V-PT
      • image encoder(latter half๋งŒ ํ•™์Šต) + projector + llm all finetune
      • bs 256 / 4700 steps
    • supervised finetuning:
      • LLaVA์—์„œ detailed caption 23k๊ฐ€ ๋“ค์–ด๊ฐ€๋Š”๋ฐ ์ด๊ฑธ ShareGPT4V์—์„œ ์ƒ˜ํ”Œ๋งํ•ด์„œ ์‚ฌ์šฉ
      • vision encoder freeze / projector์™€ llm finetune
image

Ablations

๊ฐ ๋ฐ์ดํ„ฐ๋ฅผ ๋„ฃ์–ด์„œ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์˜ ํšจ๊ณผ

image

latter half๋งŒ ํ•™์Šตํ•œ ๊ฒƒ์˜ ํšจ๊ณผ image

image