idea : Don’t put PE in input, predict it in output!
architecture : Basically ViT. cross-attention where m context patches are drawn out of n patches, Q=all patches, K=V=context patches.
objective : Cross Entropy Loss
baseline : ResNext, ViT-S, MOCOv3, MAE
data : CIFAR-100, ImageNet, ImagNet-1K
result : Efficient pretraining (without looking at all patches), better performance than ViT-S or MOCO at 100 epochs (but lower than ResNeXT). Performance is lower than MAE, but when I ensemble it, it performs better than MAE trained on 1600 epochs, claiming that it trained a different representation.