[162] CALIP: Zero-Shot Enhancement of CLIP with Parameter-free Attention

TL;DR

I read this because.. : AAAI CLIP
task : zs classification
problem : I want to increase CLIP’s zs classification ability without training
idea : swap image / text encoder features in the middle without learning
input/output : {image, text} -> score
architecture : CLIP ResNet variant
objective : change without learning or have a few-shot refined version
baseline : CoOp, CLIP linear probing, CLIP adaptor
data : ImageNet, Caltech101, OxfordPets, StanfordCars, Flower102, … (CLIP zs)
evaluation : zs, few-shot accuracy
Result : Higher performance with zero learning!
contribution : There have been a lot of studies that have tried to do fine-grained better by putting SA in the middle layer, or looking at all seqs at the end, but this study shows that it is possible to improve performance with computations that don’t seem that big.
etc. :