image

paper

TL;DR

problem : ์ข‹์€ vision backbone ๋งŒ๋“ค๊ธฐ. ๋ถ„๋ฅ˜ ๋ ˆ์ด๋ธ”์— ๋Œ€ํ•œ ์ด๋ฏธ์ง€ ํ”„๋ฆฌํŠธ๋ ˆ์ด๋‹, ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ pair๋ฅผ ๋ฐ›์•„ contrastive loss๋กœ ํ•™์Šต๋˜๋Š” dual-encoder model, image ์ธ์ฝ”๋”๊ฐ€ ์žˆ๊ณ  text decoder๊ฐ€ cross-attention์œผ๋กœ ์ด๋ฏธ์ง€ ํ”ผ์ณ๋ฅผ ๋ฐ›์•„ classification, VQA๋“ฑ์„ ํ•˜๋Š” encoder-decoder model ์„ธ๊ฐœ๋ฅผ ํ†ตํ•ฉํ•˜์—ฌ scratch ๋ถ€ํ„ฐ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ์„ ๋งŒ๋“ค๊ณ  ์‹ถ๋‹ค.
solution : ์ด๋ฏธ์ง€ ํ…์ŠคํŠธ ํŽ˜์–ด๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, ์ด๋ฏธ์ง€ ์ธ์ฝ”๋” ํ…์ŠคํŠธ ๋””์ฝ”๋” ๋”ฐ๋กœ ์ธํ’‹์„ ๋ฐ›๊ณ  ์ด๋ฏธ์ง€ ์ธ์ฝ”๋”์—์„œ ๋‚˜์˜จ ๋งˆ์ง€๋ง‰ ํ† ํฐ๊ณผ ํ…์ŠคํŠธ ๋””์ฝ”๋”์˜ cls-token์œผ๋กœ contrastive loss, ํ…์ŠคํŠธ ๋””์ฝ”๋” ์œ„์— ์ด๋ฏธ์ง€ ์ธํ’‹๊ณผ ํฌ๋กœ์Šค ์–ดํ…์…˜์ด ์žˆ๋Š” multi-model text decoder๋ฅผ ์Œ“์€ ๋’ค captioning loss. ๋‘ loss์˜ ํ•ฉ์œผ๋กœ ํ”„๋ฆฌํŠธ๋ ˆ์ด๋‹ result : ๋‹ค์–‘ํ•œ task ์—์„œ SOTA image

Details

  • Architecture image

  • loss

captioning loss image

dual encoder contrastive loss image

  • Attentional Poolers : contrastive loss๋ฅผ ๊ณ„์‚ฐํ•  ๋•Œ, ์ด๋ฏธ์ง€์—์„œ ํ•˜๋‚˜์˜ ํ† ํฐ๋งŒ์„ ์‚ฌ์šฉํ•˜์ง€๋งŒ ์ธ์ฝ”๋”-๋””์ฝ”๋”์˜ ์บก์…”๋‹ ํƒœ์Šคํฌ๋ฅผ ํ• ๋•Œ๋Š” ์ „์ฒด ์ด๋ฏธ์ง€ ํ† ํฐ ์‹œํ€€์Šค๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ์ด๋Š” ์˜ˆ๋น„์‹คํ—˜์—์„œ visual recognition task๋ฅผ ํ•  ๋•Œ์—๋Š” ํ•˜๋‚˜์˜ pooled image๊ฐ€ ๋” ์„ฑ๋Šฅ์ด ์ข‹์•˜๊ณ , ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ์„ ํ•  ๋•Œ์—๋Š” region-level feature๋ฅผ ์ฐธ๊ณ ํ•˜๋ฉด ์ข‹์•„์„œ ๋” ๋งŽ์€ ํ† ํฐ์„ ๋ณด๋Š” ๊ฒƒ์ด ์œ ๋ฆฌํ–ˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์ด ๋•Œ๋ฌธ์— task-specific attentional pooling์„ ์‚ฌ์šฉํ•˜์—ฌ downstream task๋งˆ๋‹ค ๋‹ค๋ฅธ visual representation์„ ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ–ˆ๋‹ค. pooler๋Š” n๊ฐœ์˜ learnable query๋ฅผ ๊ฐ€์ง„ single multi-head attention ๋ ˆ์ด์–ด์ด๋‹ค. (์ด๋•Œ key์™€ value๋Š” encoder output) ์ด๋ฅผ ํ†ตํ•ด ๋‘๊ฐœ์˜ ๋‹ค๋ฅธ loss์— ๋Œ€ํ•ด, ๋‹ค๋ฅธ ๊ธธ์ด์˜ ์ฟผ๋ฆฌ๋ฅผ ๊ฐ–๊ฒŒ ํ•™์Šต๋  ์ˆ˜ ์žˆ๋‹ค. ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์ด learnable query๋Š” task adaptor ์—ญํ• ๋„ ํ•˜๊ฒŒ ๋œ๋‹ค.