problem : Experiment to see if self-supervised-learning works well with CLIP model structure solution : Use contrastive learning for images and text, and self-supervision learning for images to combine the losses. result : linear prediction (=classification by attaching FCN to a representation. Only linear is learned. ‘feature-based learning’ as mentioned in BERT), zero-shot learning, end-to-end finetuning to SOTA
