image

paper , code

TL;DR

  • I read this because.. : github follow ํ•˜๋Š” ๋ถ„์ด Star ๋ˆŒ๋Ÿฌ์„œ ์•Œ๊ฒŒ๋จ
  • task : CLIP with long context
  • problem : CLIP์ด 77 ํ† ํฐ ๊ฐœ์ˆ˜๋กœ ์ œํ•œ๋˜๊ฒŒ ํ•™์Šต๋˜์–ด ์žˆ๊ณ  ์ด ์ค‘์— ์œ ํšจํ•˜๊ฒŒ ์‚ฌ์šฉ๋˜๋Š” ํ† ํฐ์€ 20๊ฐœ์ด๋‹ค.
  • idea : ๊ธด CLIP์„ ํ•™์Šต ํ•˜์ž. PE๋ฅผ interpolate ํ•˜๋˜ ์œ ํšจ ํ† ํฐ 20๊ฐœ๋Š” ๋‚จ๊ธฐ๊ณ  ๋‚˜๋จธ์ง€๋งŒ Interpolate ํ•˜์ž
  • input/output : {image, text} -> score
  • architecture : CLIP ViT-B/16, ViT-L/14
  • objective : infoNCE
  • baseline : CLIP
  • data : ShareGPT4V 1M
  • evaluation : ImageNet, COCO, FLICKR retrieval, ShareGPT4V retrieval (long context retreival)
  • result : ์ •๋Ÿ‰์ ์œผ๋กœ ์ข‹์€ ์„ฑ๋Šฅ. ํ›จ์”ฌ context๋ฅผ ์ž˜ ํŒŒ์•…ํ•˜๋Š” ๋“ฏํ•œ ๋А๋‚Œ.
  • contribution :
  • etc. :

Details

Problem

image

PE interpolate strategy

image

finetuning strategy

image - finegrained alignment๋Š” ํ•˜๋˜๋Œ€๋กœ ํ•˜๋Š”๋“ฏ - coarse grained alignment๋Š” ์ด๋ฏธ์ง€์— PCE ์•Œ๊ณ ๋ฆฌ์ฆ˜(PCAํ•œ ๋’ค Top 32๊ฐœ element๋ฅผ ๋‚จ๊น€)์„ ์ ์šฉํ•œ๋’ค์— threshold ๋นผ๋Š”๊ฑด ๋‚ฎ์ถ˜ ๋’ค ๊ณจ๋ผ์ง„ Eigenvector์™€ Eigenvalue๋กœ weighted sum ํ•œ๊ฑฐ์™€ short caption์ด align ๋˜๋Š” ํ˜•์‹ image image

Result

image image image image