image

paper

TL;DR

  • task : image pretraining
  • problem : simple and effective pretraining
  • idea : Don’t put PE in input, predict it in output!
  • architecture : Basically ViT. cross-attention where m context patches are drawn out of n patches, Q=all patches, K=V=context patches.
  • objective : Cross Entropy Loss
  • baseline : ResNext, ViT-S, MOCOv3, MAE
  • data : CIFAR-100, ImageNet, ImagNet-1K
  • result : Efficient pretraining (without looking at all patches), better performance than ViT-S or MOCO at 100 epochs (but lower than ResNeXT). Performance is lower than MAE, but when I ensemble it, it performs better than MAE trained on 1600 epochs, claiming that it trained a different representation.
  • contribution : simple!
  • Limitations or things I don’t understand :

Details

image

image

image

image