paper
paper summary
implementation
problem : Solve the image classification problem with a fully self-attentive structure, minimizing changes to the existing Transformer structure.
Solution : Cut the image into P x P patches, flatten them, and make them D-dimensional with a linear projection. Put the resulting patch embedding into the transformer encoder input. Pretrain the MLP on the output by adding [CLS] tokens, then fine-tune by changing only the classification MLP.
Result : It performed worse than the ResNet family on small data, but when trained on large data, it learned faster than ResNet and outperformed the SOTA