image

paper , code

TL;DR

  • task : Vision-and-Language Pretraining(VLP)
  • problem : ๊ธฐ์กด VLP์—์„œ CNN backbone, object detector๋ฅผ ํ•„์ˆ˜์ ์œผ๋กœ ์‚ฌ์šฉํ•˜๊ณ  visual encoder๋ฅผ ํ—ค๋น„ํ•˜๊ฒŒ ๋งŒ๋“ค์–ด์„œ ์„ฑ๋Šฅ์„ ๋ฝ‘๊ธด ์ข‹์ง€๋งŒ ์‹ค์ œ application์— ์ ์šฉํ•˜๊ธฐ์—” ์ ํ•ฉํ•˜์ง€ ์•Š๋‹ค.
  • idea : CNN ์—†์ด ํ†ตํ•ฉ๋œ VLP ๋ชจ๋ธ์„ ๋งŒ๋“ค์ž.
  • architecture : visual ์ž„๋ฒ ๋”ฉ์€ ViT์ฒ˜๋Ÿผ, word embedding์€ BERT ๋ฐฉ์‹์œผ๋กœ. ๊ฐ๊ฐ์˜ ์ธ์ฝ”๋”์—์„œ ๋‚˜์˜จ ์ž„๋ฒ ๋”ฉ์„ ๊ฐ์ž modal-type ์ž„๋ฒ ๋”ฉ๊ณผ ํ•ฉํ•œ๋’ค ํ•˜๋‚˜์˜ ํŠธ๋žœ์Šคํฌ๋จธ ์ธ์ฝ”๋”์— ๋„ฃ๊ณ  ๋‚˜์˜จ output์œผ๋กœ ์•„๋ž˜ pretraining task๋กœ ํ•™์Šต.
  • objective : Image Text Matching(์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ํŽ˜์–ด์—์„œ ์ด๋ฏธ์ง€๋ฅผ 50% ํ™•๋ฅ ๋กœ ๋‹ค๋ฅธ ์ด๋ฏธ์ง€๋กœ ๋ฐ”๊พธ๊ณ  ์›๋ž˜์˜ pair๊ฐ€ ๋งž๋Š”์ง€ binary๋กœ ํ•™์Šต), MLM, whole word masking(ํ† ํฐ ๋‹จ์œ„๊ฐ€ ์•„๋‹ˆ๋ผ ์›๋ž˜ word ๋‹จ์–ด๋ฅผ ๋งˆ์Šคํ‚น. gi, ##raf, ##fe์—์„œ ๊ฐ€์šด๋ฐ๋งŒ ๋งˆ์Šคํ‚นํ•˜๋ฉด ๋น„์ฅฌ์–ผ ์ •๋ณด ์—†์ด ํ…์ŠคํŠธ ์ •๋ณด๋งŒ์œผ๋กœ ์˜ˆ์ธก์ด ๊ฐ€๋Šฅํ•จ.)
  • baseline : ViLBERT, UNITER, PixelBERT …
  • result : time(ms)๋ฅผ benchmark ๋Œ€๋น„ 4~60๋ฐฐ ๊ฐœ์„ ํ•˜๋ฉด์„œ ์„ฑ๋Šฅ๋„ ใ„ฑใ…Š
  • contribution : 1) deep visual encoder์—†์ด ๋งŒ๋“ค์–ด runtime / ํšจ์œจ์„ฑ ๊ฐœ์„  2) region feature๋‚˜ deep convolution์—†์ด ๋‹จ์ˆœํ•œ ์•„ํ‚คํ…์ณ๋กœ ๋น„์Šทํ•œ ์„ฑ๋Šฅ 3) word masking, image augmentation์ด VLP ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•จ์„ ๋ณด์ž„
  • data : (pretraining) MSCOCO, Visual Genome, SBU captions , Google Conceptual Captions image (downstream) VQA v2, NLVR2(Natural Language for Visual Reasoning, ๋‘ ์ด๋ฏธ์ง€์™€ ๋‘ ์ด๋ฏธ์ง€๊ฐ„ ๊ด€๊ณ„(triplet)์ด ์ฃผ์–ด์ง€๊ณ  ์งˆ๋ฌธ์ด ์ฃผ์–ด์กŒ์„ ๋•Œ binary classification), Retrieval MSCOCO, Flickr30k for image-to-text, text-to-image retrieval

Details

image