[5] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

paper paper summary implementation

problem : Solve the image classification problem with a fully self-attentive structure, minimizing changes to the existing Transformer structure. Solution : Cut the image into P x P patches, flatten them, and make them D-dimensional with a linear projection. Put the resulting patch embedding into the transformer encoder input. Pretrain the MLP on the output by adding [CLS] tokens, then fine-tune by changing only the classification MLP. Result : It performed worse than the ResNet family on small data, but when trained on large data, it learned faster than ResNet and outperformed the SOTA