image

paper , code

TL;DR

  • I read this because.. : aka noc. ๋ญ”๊ฐ€ CLIP score์— ๋Œ€ํ•ด ๋ถ„์„์„ ์ž˜ ํ–ˆ์„ ๊ฒƒ ๊ฐ™์•„์„œ ์ฝ์Œ.
  • task : captioning with noisy image-text label
  • problem : COCO, Visual Genome ๊ฐ™์€ ๋ฐ์ดํ„ฐ๋Š” scalableํ•˜์ง€ ์•Š์Œ. ๊ทธ๋ ‡๋‹ค๊ณ  web-crawled pair๋ฅผ ์“ฐ์ž๋‹ˆ noisyํ•  ์ˆ˜ ์žˆ๊ณ  ์ด๊ฑธ CLIP score๋กœ filtering ํ•˜์ž๋‹ˆ ๋˜ ๋ฐ์ดํ„ฐ์˜ ์ƒ๋‹น์ˆ˜๊ฐ€ ์‚ฌ๋ผ์ง.
  • idea : CLIP score๋ฅผ binning ํ•œ ๋’ค ์ž„๋ฒ ๋”ฉํ•˜์—ฌ captioningํ•  ๋•Œ ์ œ๊ณตํ•˜๊ฒŒ ํ•˜๊ณ  inference ๋‹จ๊ณ„์—์„œ๋Š” ๊ฐ€์žฅ ์ž˜ align ๋œ score๋ฅผ ์ฃผ๊ณ  ์ถ”๋ก ํ•˜๊ฒŒ ํ•จ
  • input/output : image, clip score of {image, text} pair -> text
  • architecture : CLIP ViT-L/14 + 6-layer transformer(94.5M)
  • objective : cross-entropy loss
  • baseline : no filtering, filtering(clip score 0.3), loss reweighting(loss์— clip score๋ฅผ ๊ณฑํ•ด์คŒ), ZeroCap, Socratic Model, DeCAP
  • data : CC3M (noisyํ•œ ์ถ•์— ์†ํ•˜๋Š” ๊ตฌ๋‚˜.!), ablation์œผ๋กœ COYO๋„ ํ•ด๋ด„
  • evaluation : COCO, nocaps์— ๋Œ€ํ•ด BLEU, METEOR, CIDEr, SPICE, CLIPScore // self-retrieval R@1(ํŠน์ • ์ด๋ฏธ์ง€๋กœ ์ƒ์„ฑํ•œ caption์œผ๋กœ retrieval ํ–ˆ์„ ๋•Œ ๊ทธ ์ด๋ฏธ์ง€๊ฐ€ ๋‚˜์˜ค๋Š”์ง€)
  • result : BLEU ๋นผ๊ณ  sota
  • contribution : ๊ฐ„๋‹จํ•˜๊ณ  ์ง๊ด€์ ์ž„~
  • etc. : ์›ํ•˜๋Š” ๊ฑด ๋ชป ์–ป์—ˆ์ง€๋งŒ ์žฌ๋ฐŒ๊ฒŒ ์ฝ์—ˆ๋‹ค~ ๊ฐ€์žฅ ๋น„์Šทํ•œ๊ฑด BLIP์ด๋ผ๋Š”๋ฐ ์ƒ๊ฐํ•ด๋ณด๋‹ˆ๊นŒ ๊ทธ๋Ÿผ.. BLIP ์ฐธ ์„ ๊ตฌ์ ์ธ ์—ฐ๊ตฌ์ธ๋“ฏ

Details

  • motivation image image
image
  • architecture image

  • results image

  • ablations image

  • qualitative image