
TL;DR
- task : self-supervised learning -> image classification, object detection, image segmentation
- problem : Think about the strategy for selecting the tokens that are masked in the Masked Image Modling (MIM) you are using in SSL.
- idea : Let’s mask the high attention score when put into ViT!
- architecture : teacher ViT receives all input tokens and has a high attention score, masking the fact that it has a high attention score. student takes on the MIM task. teacher’s weight is updated with the exponential moving average (EMA) of student’s weight. The architecture is based on ViT-S/16
- objective : MIM loss(=reconstruction loss), distillation loss(difference in output for [CLS] tokens of student and teacher)
- baseline : iBOT, DINO, MST
- data : ImageNet-1k for pretraining, CIFAR-10, CIFAR-100, Oxford Flower, COCO, ADE20K
- result : higher performance than random masking
- contribution : Exploring masking strategies in MIM
- Limitations or things I don’t understand :
Details
