image

paper

TL;DR

  • task : self-supervised learning -> image classification / object detection / segmentation
  • problem : I want to pretrain in a way that makes masked predictions like BERT
  • idea : Let’s do it like an autoencoder, and since images have less information in each token than text (we say spatial redundancy), let’s make the mask ratio really high (75% in the paper) instead.
  • architecture : encoder-decoder, where the encoder contains only unmasked tokens and the encoder output contains mask embeddings in their original positions, which the decoder sees and reconstructs. The encoder is ViT-L and the decoder can be of any choice, but in the paper we use a small decoder that requires about 10% of the computation of the encoder.
  • objective : mean squared error (MSE) over masked tokens
  • baseline : supervised learning, MoCov3, BeiT
  • data : self-supervised pretraining with ImageNet-1K. followed by linear probing/finetuning. Finetuning with COCO, ADE20K, iNaturalists, Places.
  • result : When transferred to another task, the SOTA
  • contribution : simple architecture with strong result!
  • Limitations or things I don’t understand :

Details

Architecture

image

Result

image

better to noramalize target (normalize to the mean and variance of the entire patch)

Comparison with other SSL methods

image

Increasing the mask ratio should also work. image I can’t believe I’m a zebra.

image

But it works 75% of the time