
TL;DR
- task : self-supervised learning -> image classification / object detection / segmentation
- problem : I want to pretrain in a way that makes masked predictions like BERT
- idea : Let’s do it like an autoencoder, and since images have less information in each token than text (we say spatial redundancy), let’s make the mask ratio really high (75% in the paper) instead.
- architecture : encoder-decoder, where the encoder contains only unmasked tokens and the encoder output contains mask embeddings in their original positions, which the decoder sees and reconstructs. The encoder is ViT-L and the decoder can be of any choice, but in the paper we use a small decoder that requires about 10% of the computation of the encoder.
- objective : mean squared error (MSE) over masked tokens
- baseline : supervised learning, MoCov3, BeiT
- data : self-supervised pretraining with ImageNet-1K. followed by linear probing/finetuning. Finetuning with COCO, ADE20K, iNaturalists, Places.
- result : When transferred to another task, the SOTA
- contribution : simple architecture with strong result!
- Limitations or things I don’t understand :
Details
Architecture

Result

better to noramalize target (normalize to the mean and variance of the entire patch)
Comparison with other SSL methods

Increasing the mask ratio should also work.
I can’t believe I’m a zebra.

But it works 75% of the time