[159] Long-CLIP: Unlocking the Long-Text Capability of CLIP

paper , code

TL;DR

I read this because.. : github follow starred this post
task : CLIP with long context
Problem :** CLIP is trained to limit the number of tokens to 77, of which 20 are validly used.
Idea :** Learn a long CLIP. Interpolate PE, but keep 20 valid tokens and interpolate the rest
input/output : {image, text} -> score
architecture : CLIP ViT-B/16, ViT-L/14
objective : infoNCE
baseline : CLIP
data : ShareGPT4V 1M
evaluation : ImageNet, COCO, FLICKR retrieval, ShareGPT4V retrieval (long context retreival)
result : Quantitatively good performance. Feels much more contextual.
contribution :
etc. :

Details

Problem

PE interpolate strategy

finetuning strategy

- finegrained alignment is the same as it always has been. - coarse grained alignment is a format where the image is subjected to the PCE algorithm (PCA, leaving the top 32 elements), and then the threshold subtraction is lowered, and the selected Eigenvectors and Eigenvalue weighted sum are aligned with the short caption.

TL;DR#

Details#

Problem#

PE interpolate strategy#

finetuning strategy#

Result#

TL;DR

Details

Problem

PE interpolate strategy

finetuning strategy

Result