problem : vision backbone without CNN and transformer
idea : Let’s follow ViT’s input method, but do it with MLP only, without attention or convolution!
architecture :** Cut the image into non-overlapping patches and send it to C dimensions as a single projection. This results in S C-dimensional matrix $\mathbb{R}^{S\times C}$, which is called a “token-mixing MLP” in column dimension and a “channel-mixing MLP” in row dimension.
objective : CrossEntropy Loss
baseline : BiT-R, Mixer-L, HaloNet
data : ILSVRC2012 ImageNet, CIFAR-10/100, Oxford-IIIT-pets, JFT-30
result : Similar performance, high throughput, FLOPS