image

paper

TL;DR

  • I read this because.. : CLIP A way to learn conservatively without losing pretrained abilities. LiT related articles found while searching
  • task : CLIP
  • problem : When CLIP finetunes for a reference domain, CLIP may lose knowledge of the general domain that it was originally trained on.
  • idea : Ensemble CLIP zero-shot capabilities with a model finetuned to the target domain -> ensemble via weight interpolate!
  • input/output : {image, text} -> score
  • architecture : CLIP, ViT, BASIC-L
  • objective : InfoNCE
  • baseline : zs-CLIP, finetuned CLIP.
  • data : WIT(clip), JFT-300M(vit) -> ImageNet, ImageNetV2, ImageNet-R, ImageNet sketch, ObjectNet, ImageNet-A
  • evaluation : Accuracy in the original domain and the shifted domain.
  • Result : Improved performance for kids with domain shifts while maintaining ImageNet performance.
  • contribution : simple idea + easy to implement, yet performs well
  • etc. :

Details

image

Using a moving average of params has some sort of ensemble effect

domain shift data

image

Weight-space ensemble for finetuning

So simple…

  1. Take a pretrianed CLIP and do ft. fully ft(end-to-end) for target domaind or just the last classifier(LC)
  2. Average each element-wise with mixing coefficient image

Here, alpha should be found greedily, but I set it to 0.5 and it came out pretty close to optimum.

Result

image

First figure: Datasets with ImageNet (reference distribution) on x-axis and distribution shift on y-axis Purple is zs clip performance, blue is just training with that data. Orange are the ones we finetuned with that data. Second figure: Wise-FT increases performance for kids with distribution shifts without reducing reference accuracy

image

If you look at the finetuned ones, the ones with distribution shift are not performing as well. The proposed WISE-FT shows better performance than ft even in the reference domain (86.2 -> 87.1), and the kids with distribution shift are also better.

image

clip itself tends to have too much performance wiggle room depending on hparam -> weight-space ensemble makes frontier!

image

Better performance than finetuning for each domain!

Analysis

image

zero-shot and linear classifier had different trends, and linear-classifier had similar trends. -> there seemed to be a larger ensemble effect

image

Ensembling weights rather than ensembling outputs was a better performance improvement!