image

paper

TL;DR

  • I read this because.. : AAAI CLIP
  • task : zs classification
  • problem : ํ•™์Šต์—†์ด CLIP์˜ zs classification ๋Šฅ๋ ฅ์„ ๋†’์ด๊ณ  ์‹ถ์Œ
  • idea : ํ•™์Šต ์—†์ด ์ค‘๊ฐ„์— image / text encoder์˜ feature๋“ค์„ ๊ตํ™˜ํ•˜์ž
  • input/output : {image, text} -> score
  • architecture : CLIP ResNet variant
  • objective : ํ•™์Šต ์—†์ด ๋ณ€๊ฒฝ or few-shot finetuneํ•œ ๋ฒ„์ „๋„ ์žˆ์Œ
  • baseline : CoOp, CLIP linear probing, CLIP adaptor
  • data : ImageNet, Caltech101, OxfordPets, StanfordCars, Flower102, … (CLIP zs)
  • evaluation : zs, few-shot accuracy
  • result : ํ•™์Šต ์ „ํ˜€ ์—†์ด ๋” ๋†’์€ ์„ฑ๋Šฅ!
  • contribution : fine-grained ํ•˜๊ฒŒ ๋” ์ž˜ ํ•˜๊ฒ ๋‹ค๊ณ  ์ค‘๊ฐ„ ๋ ˆ์ด์–ด๋ถ€ํ„ฐ SA๋ฅผ ๋„ฃ๋Š”๋‹ค๋˜์ง€, ๋งˆ์ง€๋ง‰์—์„œ ๋ชจ๋“  seq์„ ๋ณธ๋‹ค๋˜์ง€ ํ•˜๋Š” ์—ฐ๊ตฌ๋“ค์ด ๋งŽ์•˜๋Š”๋ฐ ์ด ์—ฐ๊ตฌ๋Š” ๊ทธ๋ ‡๊ฒŒ ์ปค ๋ณด์ด์ง€ ์•Š๋Š” ์—ฐ์‚ฐ์œผ๋กœ ์„ฑ๋Šฅ์„ ๋†’์ธ๊ฒŒ ์ข‹์Œ
  • etc. :

Details

motivation

image

architecture

image

projection ํ•˜์ง€ ์•Š์€ feature์— ๋Œ€ํ•ด attention์„ ํ•œ ๋‹ค์Œ์— feature์— ๊ณฑํ•ด์ฃผ๋Š” ํ˜•ํƒœ image

image

์ตœ์ข…์ ์ธ ์˜ˆ์ธก์€ ์ด๋ ‡๊ฒŒ ๋‘ modality๋ฅผ aggregateํ•œ ๊ฒƒ์— ๋Œ€ํ•œ weighted sum image

image

Result

image