image

paper

TL;DR

  • I read this because.. : AAAI CLIP
  • task : zs classification
  • problem : I want to increase CLIP’s zs classification ability without training
  • idea : swap image / text encoder features in the middle without learning
  • input/output : {image, text} -> score
  • architecture : CLIP ResNet variant
  • objective : change without learning or have a few-shot refined version
  • baseline : CoOp, CLIP linear probing, CLIP adaptor
  • data : ImageNet, Caltech101, OxfordPets, StanfordCars, Flower102, … (CLIP zs)
  • evaluation : zs, few-shot accuracy
  • Result : Higher performance with zero learning!
  • contribution : There have been a lot of studies that have tried to do fine-grained better by putting SA in the middle layer, or looking at all seqs at the end, but this study shows that it is possible to improve performance with computations that don’t seem that big.
  • etc. :

Details

motivation

image

architecture

image

Attention to features that are not projected and then multiplying them by the feature image

image

The final prediction is the weighted sum of these two modalities aggregated together. image

image

Result

image