image

paper , code

TL;DR

  • I read this because.. : github follow starred this post
  • task : CLIP with long context
  • Problem :** CLIP is trained to limit the number of tokens to 77, of which 20 are validly used.
  • Idea :** Learn a long CLIP. Interpolate PE, but keep 20 valid tokens and interpolate the rest
  • input/output : {image, text} -> score
  • architecture : CLIP ViT-B/16, ViT-L/14
  • objective : infoNCE
  • baseline : CLIP
  • data : ShareGPT4V 1M
  • evaluation : ImageNet, COCO, FLICKR retrieval, ShareGPT4V retrieval (long context retreival)
  • result : Quantitatively good performance. Feels much more contextual.
  • contribution :
  • etc. :

Details

Problem

image

PE interpolate strategy

image

finetuning strategy

image - finegrained alignment is the same as it always has been. - coarse grained alignment is a format where the image is subjected to the PCE algorithm (PCA, leaving the top 32 elements), and then the threshold subtraction is lowered, and the selected Eigenvectors and Eigenvalue weighted sum are aligned with the short caption. image image

Result

image image image image