image

paper

TL;DR

  • task : image classification
  • problem : vision backbone without CNN and transformer
  • idea : Let’s follow ViT’s input method, but do it with MLP only, without attention or convolution!
  • architecture :** Cut the image into non-overlapping patches and send it to C dimensions as a single projection. This results in S C-dimensional matrix $\mathbb{R}^{S\times C}$, which is called a “token-mixing MLP” in column dimension and a “channel-mixing MLP” in row dimension.
  • objective : CrossEntropy Loss
  • baseline : BiT-R, Mixer-L, HaloNet
  • data : ILSVRC2012 ImageNet, CIFAR-10/100, Oxford-IIIT-pets, JFT-30
  • result : Similar performance, high throughput, FLOPS
  • contribution : O(n) complexity, simple architecture, MLP revisited!
  • Limitations or things I don’t understand :

Details

image image image image