TL;DR
- I read this because.. : AAAI CLIP
- task : zs classification
- problem : I want to increase CLIP’s zs classification ability without training
- idea : swap image / text encoder features in the middle without learning
- input/output : {image, text} -> score
- architecture : CLIP ResNet variant
- objective : change without learning or have a few-shot refined version
- baseline : CoOp, CLIP linear probing, CLIP adaptor
- data : ImageNet, Caltech101, OxfordPets, StanfordCars, Flower102, … (CLIP zs)
- evaluation : zs, few-shot accuracy
- Result : Higher performance with zero learning!
- contribution : There have been a lot of studies that have tried to do fine-grained better by putting SA in the middle layer, or looking at all seqs at the end, but this study shows that it is possible to improve performance with computations that don’t seem that big.
- etc. :
Details
motivation
architecture
Attention to features that are not projected and then multiplying them by the feature
The final prediction is the weighted sum of these two modalities aggregated together.