[125] RILS: Masked Visual Reconstruction in Language Semantic Space

TL;DR

I read this because.. : CLIP loss related
task : contrastive learning -> image classification, object detection, semantic segmentation
problem : I want to do both CLIP + MAE
idea : target the reconstruction not at the pixel level, but at the cosine similarity to the CLIP text feature! i.e. reconstruction on language semantic
input/output : {image, text} pair
architecture : ViT-B/16 and its equivalent text encoder (12 heads, 768 hid dim)
objective : InfoNCE(i2t, t2i), KL(cosine similarity of reconstructed patch and text feature, cosine similarity of original image patch and text feature)
baseline : CLIP, BEiT, MAE, MAE + CLIP, MAE -> CLIP, etc..
data : LAION-20M, LAION-50M -> COCO, LVIS, ADE20K
evaluation : ImageNet(zs, linear probing, finetuning), AP(COCO, LVIS), mIoU(ADE20K)
result : better performance than other objectvie for the same condition!
contribution : contrastive loss + reconstruction loss. two heterogeneous losses (one specialized for vision modal only). align this guy well? experimented a lot and wrote well.
etc. :

Vision Encoder / Text Encoder uses CLIPger + reconstruction uses MAE like

KL divergence. $p_i^k$ is the stop gradient.

The final LOSS is the weighted sum. If we say 2:1

Table 9: MIM -> LiT, MIM -> CLIP -> CLIP -> MIM was better than two-stage, such as MIM -> LiT, MIM -> CLIP -> MIM
Table 10: Performance was better than doing it at the pixel level, and better than doing it with kl divergence by similarity to a random vector (high-level vision space) instead of language features. Significantly better