image

paper

TL;DR

  • task : self-supervised learning -> image classification, object detection, image segmentation
  • problem : Think about the strategy for selecting the tokens that are masked in the Masked Image Modling (MIM) you are using in SSL.
  • idea : Let’s mask the high attention score when put into ViT!
  • architecture : teacher ViT receives all input tokens and has a high attention score, masking the fact that it has a high attention score. student takes on the MIM task. teacher’s weight is updated with the exponential moving average (EMA) of student’s weight. The architecture is based on ViT-S/16
  • objective : MIM loss(=reconstruction loss), distillation loss(difference in output for [CLS] tokens of student and teacher)
  • baseline : iBOT, DINO, MST
  • data : ImageNet-1k for pretraining, CIFAR-10, CIFAR-100, Oxford Flower, COCO, ADE20K
  • result : higher performance than random masking
  • contribution : Exploring masking strategies in MIM
  • Limitations or things I don’t understand :

Details

image image