image arxiv Problem : Vision-Language Pretraining(VLP)๋ฅผ ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์ด๋ฏธ์ง€์˜ bounding box, label์„ ๋‹ฌ์•„์•ผ ํ•˜์—ฌ annotation์˜ ๋น„์šฉ์ด ๋งŽ์ด ๋“ค๋ฉฐ zero-shot์œผ๋กœ ์ „ํ™˜์ด ์‰ฝ์ง€ ์•Š์Œ Solution : ์ด๋ฏธ์ง€๋Š” CoAtNet ์œผ๋กœ ์ธ์ฝ”๋”ฉํ•œ๊ฑธ ํ…์ŠคํŠธ ์ธ์ฝ”๋”ฉ๋œ ๊ฐ’์„ prefix๋กœ ๋‘์–ด์„œ encoder-decoder ๊ตฌ์กฐ๋กœ ํ•™์Šต. ์ด ๋•Œ์˜ ๋ฐ์ดํ„ฐ๋Š” ALIGN(noisyํ•œ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ํŽ˜์–ด ๋ฐ์ดํ„ฐ)์™€ C4(text-only)๋ฅผ ์‚ฌ์šฉํ•˜์˜€๋‹ค. finetuning์€ image captioning, visual reasoning, VQA, multimodal translation์„ ์ง„ํ–‰ํ•จ image Result : ๋‹ค์–‘ํ•œ finetuning task์—์„œ SOTA, zero-shot์—์„œ๋„ ๊ดœ์ฐฎ์€ ์„ฑ๋Šฅ image ์ด๋ฏธ์ง€ ์บก์…˜ ํƒœ์Šคํฌ์—์„œ finetuning์„ ์•ˆํ•ด๋„(zero-shot), ํ”„๋ฆฌํŠธ๋ ˆ์ด๋‹ ์—†๋Š” ๋ชจ๋ธ๊ณผ ์œ ์‚ฌํ•œ ์„ฑ์  image Vison-Lanugage ๋ชจ๋ธ์„ ํ•™์Šตํ•  ๋•Œ์— ํ…์ŠคํŠธ๋งŒ ์žˆ๋Š” corpus๋ฅผ ๋„ฃ๋Š”๊ฒƒ์ด ์œ ์šฉํ•˜๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•จ(decoder์˜ generation ๋Šฅ๋ ฅ์„ ๊ฐ•ํ™”)

etc :

  • VQA๋ฅผ ํ•  ๋•Œ์— CIDEr ๋ผ๋Š” loss๊ฐ€ ๋”ฐ๋กœ ์žˆ์Œ
  • VQA๋Š” ์ด๋ฏธ์ง€๋ฅผ ์ธ์ฝ”๋”์— ํ…์ŠคํŠธ๋ฅผ ๋””์ฝ”๋”์— ๋„ฃ์€ ๋’ค ๋””์ฝ”๋”์˜ ๋งˆ์ง€๋ง‰ ํ† ํฐ์˜ output์— FCN์„ ๋ถ™์—ฌ ํ•™์Šต๋จ
  • multimodal translation์€ ์ด๋ฏธ์ง€๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ์˜ ์„ค๋ช…์— ๋Œ€ํ•ด ์–ธ์–ด๋ฅผ ๋ฐ”๊พธ๋Š” ํƒœ์Šคํฌ
  • ์ธ์ฝ”๋”-๋””์ฝ”๋” ๊ตฌ์กฐ๊ฐ€ decoder-only ๊ตฌ์กฐ๋ณด๋‹ค ์ข‹์•˜๋‹ค
  • PrefixLM์€ prefix์— ๋Œ€ํ•ด์„œ๋Š” bi-direction์œผ๋กœ ๋ณด๊ณ  ์ดํ›„๋กœ๋Š” LM์œผ๋กœ ๋ณด๋Š” ํŠน์„ฑ(prefixLM์ด๋ž€ ๊ฒŒ ์ด ๋…ผ๋ฌธ์—์„œ ์ฒ˜์Œ ๋‚˜์˜จ๊ฑด๊ฐ€?)