image

paper , code

TL;DR

  • I read this because.. : VLM์„ ํ•  ์‹œ๊ฐ„.. ์•„๋งˆ ์ฒซ interleaved image-text data. OpenFlamingo์—์„œ ์‚ฌ์šฉ.
  • task : data
  • problem : open interleaved image-text data
  • idea : common crawl์—์„œ ์‹œ์ž‘ํ•ด์„œ ์ด๋ฏธ์ง€ ์ทจ๋“.
  • input/output : sequence of images, sequence of texts -> text
  • architecture : OpenFlamingo(3B) https://github.com/long8v/PTIR/issues/118 ,
  • objective : CE loss
  • baseline : LAION-2B๋กœ๋งŒ trained๋œ Flamingo
  • data : Multimodal C4(mmc4), Multimodal C4 fewer-faces(mmc4-ff), mmc4-core, mmc4-core-ff -> COCO caption
  • evaluation : zero-shot captioning, 4-, 8-shot captioning
  • result : LAION-2B pretained๋ณด๋‹ค ์›”๋“ฑํžˆ ์ข‹์€ ์„ฑ๋Šฅ (์‹ ๊ธฐํ•˜๋„น..)
  • contribution : first public open interleaved image-text data
  • etc. :

Details

mmc4

image

Data Curation Process

  • source Common Crawl์˜ April 2019(๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” dump๋ผ๊ณ  ํ•จ) ์ค‘์— clean ๋ฒ„์ „์ธ c4 ์‚ฌ์šฉ(365M documents, 156B tokens)
  • images c4์—์„œ original web page ๋‹ค์šด ๋ฐ›์€ ๋’ค์— ์ด๋ฏธ์ง€ ๋‹ค์šด๋กœ๋“œ ํ™•์žฅ๋ช… png / jpeg / jpg๋งŒ ๋‚จ๊ธฐ๊ณ  logo, button, icon, plugin, widget์ด๋ž€ ๊ธ€์ž ์žˆ์œผ๋ฉด ์ œ๊ฑฐ. ์žฅ์ถ• 800 pixel๋กœ resize. -> 115M documents / 1.37B images
  • dedup + small resolution
    • dedup : https://gitlab.com/opennota/findimagedupes ์‚ฌ์šฉ
    • small resolution : ๋‹จ์ถ•์ด 150์ดํ•˜๋ฉด ์ œ๊ฑฐ.
    • ์žฅ๋‹จ์ถ• ๋น„์œจ์ด 2 ์ด์ƒ์ด๊ฑฐ๋‚˜ 0.5 ์ดํ•˜๋ฉด ์ œ๊ฑฐ(banner-like ads๋ฅผ ์ œ๊ฑฐํ•˜๋Š”๋ฐ ๋„์›€์ด ๋˜์—ˆ๋‹ค๊ณ  ํ•จ)
    • 3.7K์˜ sample์„ ๋ฝ‘์•„์„œ ํ™•์ธํ•œ ๊ฒฐ๊ณผ 2.5% ์ •๋„๊ฐ€ ๊ด‘๊ณ ์ธ๊ฑธ๋กœ ํ™•์ธ
  • NSFW
    • dataset2metadata ํŒจํ‚ค์ง€ ์‚ฌ์šฉํ•ด์„œ NSFW binary classifier๋ฅผ ์‚ฌ์šฉ
    • LAION-2B์—์„œ ๋ถ„๋ฅ˜ํ•œ NSFW ์ด๋ฏธ์ง€๋กœ classifier ํ•™์Šตํ•ด์„œ ๋ถ„๋ฅ˜.
  • Aligning images and sentences
    • C4๋Š” ์ „์ฒ˜๋ฆฌ๋œ ๋ฒ„์ „์ด๊ณ  ์ด๋ฏธ์ง€๋Š” ์ „์ฒด ๋‹ค์šด๋ฐ›์•˜์œผ๋ฏ€๋กœ ์ด๋ฏธ์ง€์— ํ•ด๋‹นํ•˜๋Š” ํ…์ŠคํŠธ๊ฐ€ ์—†์„ ์ˆ˜๋„ ์žˆ์Œ
    • html์˜ DOM์„ ๋ณด๊ณ  ๋˜๊ธด ํ•œ๋ฐ ๊ทธ๋ ‡๊ฒŒ ํ•˜์ง€ ์•Š์Œ
    • ์ผ๋‹จ ์ด๋ฏธ์ง€์™€ ๋ชจ๋“  ๋ฌธ์žฅ ๊ฐ„์˜ pairwise correlation์„ ๊ตฌํ•จ.
    • ์ด๋•Œ ํ•œ ๋ฌธ์žฅ์ด๋ผ๋„ ์œ ์‚ฌ๋„๊ฐ€ 0.15 ์•ˆ๋„˜๋Š” ์ด๋ฏธ์ง€๋ผ๋ฉด ์ œ๊ฑฐ
    • ๊ทธ ๋’ค์— bipartite matching์„ ์‹œ์ผœ์„œ ๊ฐ ๋ฌธ์žฅ์ด ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋Š” ์ด๋ฏธ์ง€๋Š” ํ•œ๊ฐœ๊ฐ€ ๋˜๋„๋ก assign image

์ด๋ ‡๊ฒŒ ํ•  ๊ฒฝ์šฐ์— ๊ทธ๋ƒฅ ์œ ์‚ฌ๋„ maxํ•˜๋Š” assign ํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค coverage๊ฐ€ ๋†’์•„์ง„๋‹ค.

assign ํ•œ ๋’ค์— Flamingo์˜ ๋ฐฉ์‹์— ๋”ฐ๋ผ ๋ฌธ์žฅ ์•ž์— ๋‘๊ฑฐ๋‚˜ ๋ฌธ์žฅ ๋’ค์— ๋‘  image

์‹ค์ œ ์˜ˆ์‹œ image

Exploring mmc4

  • url source image

  • topics LDA -> top frequent words -> GPT4๋กœ ์ฃผ์ œ
    image

Result

Open Flamingo ์‚ฌ์šฉํ•ด์„œ ํ•™์Šตํ•˜๊ณ  LAION-2B๋กœ ํ•™์Šตํ•œ ์• ์™€ ๋น„๊ต

  • Retrieval image

  • COCO caption image

15M์˜ LAION-2B๋กœ ํ•™์Šต๋œ ์• ๋ž‘ MSCOCO caption zero-shot / 4-/ 8-shot caption ํ•™์Šต ํ•œ ๊ฒƒ ๋น„๊ต ๋นจ๊ฐ„์ƒ‰์ด zero-shot ์„ฑ๋Šฅ. 4, 8 shot๋ณด๋‹ค ๋–จ์–ด์ง€๋Š” ์ด์œ ๋Š” LAION-2B๊ฐ€ ์งง์€ ํ…์ŠคํŠธ๋กœ๋งŒ ํ•™์Šต๋˜์–ด์„œ ๊ธด ํ…์ŠคํŠธ ๋‚˜์˜ค๋‹ˆ๊นŒ ๋ชปํ•˜๋Š”๊ฑฐ ์•„๋‹ˆ๋ƒํ•จ 2B๋กœ ๋น„๊ตํ•ด์•ผ๋˜๋Š”๊ฑฐ ์•„๋‹Œ์ง€;

(from FLAMINGO, coco dev set, 4shot) image