image

paper

TL;DR

  • I read this because.. : CLIP loss related
  • task : contrastive learning -> image classification, object detection, semantic segmentation
  • problem : I want to do both CLIP + MAE
  • idea : target the reconstruction not at the pixel level, but at the cosine similarity to the CLIP text feature! i.e. reconstruction on language semantic
  • input/output : {image, text} pair
  • architecture : ViT-B/16 and its equivalent text encoder (12 heads, 768 hid dim)
  • objective : InfoNCE(i2t, t2i), KL(cosine similarity of reconstructed patch and text feature, cosine similarity of original image patch and text feature)
  • baseline : CLIP, BEiT, MAE, MAE + CLIP, MAE -> CLIP, etc..
  • data : LAION-20M, LAION-50M -> COCO, LVIS, ADE20K
  • evaluation : ImageNet(zs, linear probing, finetuning), AP(COCO, LVIS), mIoU(ADE20K)
  • result : better performance than other objectvie for the same condition!
  • contribution : contrastive loss + reconstruction loss. two heterogeneous losses (one specialized for vision modal only). align this guy well? experimented a lot and wrote well.
  • etc. :

Details

Overview

image image

Masked Visual Reconstruction in Language Semantic Space

Vision Encoder / Text Encoder uses CLIPger + reconstruction uses MAE like

image image
image
  • $f_i^k$ : original image feature
  • $g_i^k$ : feature of image patch reconstructed with MAE
  • $\theta$ : proejction in vision encoder
image
  • $z_l^T$ : text feature in text embedding space (up to text projection)
  • text feature is used as a kind of “prototype”
image

KL divergence. $p_i^k$ is the stop gradient.

image

The final LOSS is the weighted sum. If we say 2:1

Result

image
  • L-20M means laion only saw 20M samples
image image image image image image

Ablation

image
  • Table 9: MIM -> LiT, MIM -> CLIP -> CLIP -> MIM was better than two-stage, such as MIM -> LiT, MIM -> CLIP -> MIM
  • Table 10: Performance was better than doing it at the pixel level, and better than doing it with kl divergence by similarity to a random vector (high-level vision space) instead of language features. Significantly better
image