[63] Masked Autoencoders Are Scalable Vision Learners

TL;DR

task : self-supervised learning -> image classification / object detection / segmentation
problem : I want to pretrain in a way that makes masked predictions like BERT
idea : Let’s do it like an autoencoder, and since images have less information in each token than text (we say spatial redundancy), let’s make the mask ratio really high (75% in the paper) instead.
architecture : encoder-decoder, where the encoder contains only unmasked tokens and the encoder output contains mask embeddings in their original positions, which the decoder sees and reconstructs. The encoder is ViT-L and the decoder can be of any choice, but in the paper we use a small decoder that requires about 10% of the computation of the encoder.
objective : mean squared error (MSE) over masked tokens
baseline : supervised learning, MoCov3, BeiT
data : self-supervised pretraining with ImageNet-1K. followed by linear probing/finetuning. Finetuning with COCO, ADE20K, iNaturalists, Places.
result : When transferred to another task, the SOTA
contribution : simple architecture with strong result!
Limitations or things I don’t understand :

better to noramalize target (normalize to the mean and variance of the entire patch)

Increasing the mask ratio should also work. I can’t believe I’m a zebra.

But it works 75% of the time