TL;DR
- I read this because.. : CLIP loss related
- task : contrastive learning -> image classification, object detection, semantic segmentation
- problem : I want to do both CLIP + MAE
- idea : target the reconstruction not at the pixel level, but at the cosine similarity to the CLIP text feature! i.e. reconstruction on language semantic
- input/output : {image, text} pair
- architecture : ViT-B/16 and its equivalent text encoder (12 heads, 768 hid dim)
- objective : InfoNCE(i2t, t2i), KL(cosine similarity of reconstructed patch and text feature, cosine similarity of original image patch and text feature)
- baseline : CLIP, BEiT, MAE, MAE + CLIP, MAE -> CLIP, etc..
- data : LAION-20M, LAION-50M -> COCO, LVIS, ADE20K
- evaluation : ImageNet(zs, linear probing, finetuning), AP(COCO, LVIS), mIoU(ADE20K)
- result : better performance than other objectvie for the same condition!
- contribution : contrastive loss + reconstruction loss. two heterogeneous losses (one specialized for vision modal only). align this guy well? experimented a lot and wrote well.
- etc. :
Details
Overview
Masked Visual Reconstruction in Language Semantic Space
Vision Encoder / Text Encoder uses CLIPger + reconstruction uses MAE like
- $f_i^k$ : original image feature
- $g_i^k$ : feature of image patch reconstructed with MAE
- $\theta$ : proejction in vision encoder
- $z_l^T$ : text feature in text embedding space (up to text projection)
- text feature is used as a kind of “prototype”
KL divergence. $p_i^k$ is the stop gradient.
The final LOSS is the weighted sum. If we say 2:1
Result
- L-20M means laion only saw 20M samples
Ablation
- Table 9: MIM -> LiT, MIM -> CLIP -> CLIP -> MIM was better than two-stage, such as MIM -> LiT, MIM -> CLIP -> MIM
- Table 10: Performance was better than doing it at the pixel level, and better than doing it with kl divergence by similarity to a random vector (high-level vision space) instead of language features. Significantly better