[123] Robust fine-tuning of zero-shot models

paper

TL;DR

I read this because.. : CLIP A way to learn conservatively without losing pretrained abilities. LiT related articles found while searching
task : CLIP
problem : When CLIP finetunes for a reference domain, CLIP may lose knowledge of the general domain that it was originally trained on.
idea : Ensemble CLIP zero-shot capabilities with a model finetuned to the target domain -> ensemble via weight interpolate!
input/output : {image, text} -> score
architecture : CLIP, ViT, BASIC-L
objective : InfoNCE
baseline : zs-CLIP, finetuned CLIP.
data : WIT(clip), JFT-300M(vit) -> ImageNet, ImageNetV2, ImageNet-R, ImageNet sketch, ObjectNet, ImageNet-A
evaluation : Accuracy in the original domain and the shifted domain.
Result : Improved performance for kids with domain shifts while maintaining ImageNet performance.
contribution : simple idea + easy to implement, yet performs well
etc. :

Details

Stochastic Weight Averaging https://arxiv.org/pdf/1803.05407.pdf

Using a moving average of params has some sort of ensemble effect

domain shift data

Weight-space ensemble for finetuning

So simple…

Take a pretrianed CLIP and do ft. fully ft(end-to-end) for target domaind or just the last classifier(LC)
Average each element-wise with mixing coefficient

Here, alpha should be found greedily, but I set it to 0.5 and it came out pretty close to optimum.

Result

First figure: Datasets with ImageNet (reference distribution) on x-axis and distribution shift on y-axis Purple is zs clip performance, blue is just training with that data. Orange are the ones we finetuned with that data. Second figure: Wise-FT increases performance for kids with distribution shifts without reducing reference accuracy

If you look at the finetuned ones, the ones with distribution shift are not performing as well. The proposed WISE-FT shows better performance than ft even in the reference domain (86.2 -> 87.1), and the kids with distribution shift are also better.

clip itself tends to have too much performance wiggle room depending on hparam -> weight-space ensemble makes frontier!

Better performance than finetuning for each domain!

Analysis

zero-shot and linear classifier had different trends, and linear-classifier had similar trends. -> there seemed to be a larger ensemble effect

Ensembling weights rather than ensembling outputs was a better performance improvement!

TL;DR#

Details#

Related work#

domain shift data#

Weight-space ensemble for finetuning#

Result#

Analysis#