TL;DR
- I read this because.. : CLIP A way to learn conservatively without losing pretrained abilities. LiT related articles found while searching
- task : CLIP
- problem : When CLIP finetunes for a reference domain, CLIP may lose knowledge of the general domain that it was originally trained on.
- idea : Ensemble CLIP zero-shot capabilities with a model finetuned to the target domain -> ensemble via weight interpolate!
- input/output : {image, text} -> score
- architecture : CLIP, ViT, BASIC-L
- objective : InfoNCE
- baseline : zs-CLIP, finetuned CLIP.
- data : WIT(clip), JFT-300M(vit) -> ImageNet, ImageNetV2, ImageNet-R, ImageNet sketch, ObjectNet, ImageNet-A
- evaluation : Accuracy in the original domain and the shifted domain.
- Result : Improved performance for kids with domain shifts while maintaining ImageNet performance.
- contribution : simple idea + easy to implement, yet performs well
- etc. :
Details
Related work
- Stochastic Weight Averaging https://arxiv.org/pdf/1803.05407.pdf
Using a moving average of params has some sort of ensemble effect
domain shift data
Weight-space ensemble for finetuning
So simple…
- Take a pretrianed CLIP and do ft. fully ft(end-to-end) for target domaind or just the last classifier(LC)
- Average each element-wise with mixing coefficient
Here, alpha should be found greedily, but I set it to 0.5 and it came out pretty close to optimum.
Result
First figure: Datasets with ImageNet (reference distribution) on x-axis and distribution shift on y-axis Purple is zs clip performance, blue is just training with that data. Orange are the ones we finetuned with that data. Second figure: Wise-FT increases performance for kids with distribution shifts without reducing reference accuracy
If you look at the finetuned ones, the ones with distribution shift are not performing as well. The proposed WISE-FT shows better performance than ft even in the reference domain (86.2 -> 87.1), and the kids with distribution shift are also better.
clip itself tends to have too much performance wiggle room depending on hparam -> weight-space ensemble makes frontier!
Better performance than finetuning for each domain!
Analysis
zero-shot and linear classifier had different trends, and linear-classifier had similar trends. -> there seemed to be a larger ensemble effect
Ensembling weights rather than ensembling outputs was a better performance improvement!