image

paper

TL;DR

  • I read this because.. : What is the performance when importing a pretrained vision backbone and training a clip?
  • task : Contrastive Learning
  • Problem :** Isn’t the {image - text} pair in CLIP too noisy, but I want the zero-shot transfer to be supervised. I want to import CLIPs from pretrained as well, but what works best?
  • idea : experiment with 6 different cases of vision encoder / text encoder: pretrained + freeze(lock), pretrained + learnable, randomly initialize
  • data :** CC12M, YFCC100M-CLIP(15M), 4B data of ALIGN type
  • input/output : image, text -> score
  • architecture : ViT-g/14 + BERT
  • result : Better than CLIP, fine-tuned, from sctrach. zs performance is good, especially in OOD. Locking the image encoder is the best performance. In this case, the vision encoder does not matter the architecture and it does not matter whether it is trained supervised or unsupervised. In other words, we utilize the vision encoder trained with relatively clean data, and the text encoder is only learning by reading the information from the vision encoder.
  • objective : InfoNCE
  • baseline : CLIP, from0scratch, fine-tuned, ALIGN
  • evaluation : zs OOD ImageNet classification, 7 VTAB-natural tasks
  • contribution : A paper exploring supervised + contrastive.
  • etc. : text encoder type. Feature aggregates are great because there’s a lot of experimentation: cache or global. multilingual. text encoder size.

Details

The basic idea is this I’m learning both a vision encoder and a text encoder from scratch in CLIP, and I want to bring a pretrained one with me. Here’s how to do it image

  • L: get pretained model and lock
  • U: import and unlock pretrained model
  • u : from scratch

This gives us a total of 6 cases, and the ablation for them is as follows (in order of vision tower, text tower) image

By ImageNet zs

  • LU best -> proposed LiT
  • LU > UU : Freezing the image encoder at all is surprisingly better than Uu.
  • LU ~ Lu: The text encoder performed similarly to from srcatch and from pretrained.
  • UU > uu: both importing pretrained performs better than from scratch
  • UL, uL, LL all bottom: freezing the text encoder is usually a bad performance. Worse than uu, which is supposed to be CLIP learning.

retreival is UU > luneng

image image

LU > UU: Why is it better to be locked?

image

The first row is the loss with LiT trained data, and the loss is high because the image encoder is locked. The second row is evaluated with data that is OOD, and the locked loss is the lowest -> i.e., the image encoder is OOD-resistant by locking. We conclude that contrastive finetuning is bad for visual representation. (At first glance, it seems to be the opposite of #134… he was saying that CLIP is already trained as contrastive, so it is robust to OOD and loses OOD ability in supervised set. But this seems to be because they were IN and the ViT they learned with supervised learning was learned with JFT). The last row is a few-shot linear regression with logit, with Lu performing best.

image encoderI tried locking it at the beginning and gradually unlocking it, but it didn’t perform that well.

At first glance, you might say that ViT is better because it was trained in a supervised setting, so we tried it on kids trained with other pretraining techniques. image

Children trained with other methods showed similar trends.

Ablations

  • text encoder image

(yfccm) BERT > T5 to ViT (not sure what it is) > mT5 (ours, in-house data) ViT > BERT

We consider four possible transformer-based text models [63]—the transformer from ViT-B [21] which also resembles that used in CLIP [46], T5-base [47], mT5-base [67], and the classic

  • text / image encoder scale image

Improved performance when enlarged text is enlarged.

  • multi-lingual training When they learned together without refining to english only, english’s performance didn’t get worse, but the other kids’ performance improved. image

I think it would have been better to start with the mT5 tokenizer and pretrained multilingual when learning.

  • local loss vs global loss
image

I like global losses, and I like BS to be big no matter what. Since LiT has a frozen image encoder, it was more memory efficient to do image precomputation.