[59] MLP-Mixer: An all-MLP Architecture for Vision

TL;DR

task : image classification
problem : vision backbone without CNN and transformer
idea : Let’s follow ViT’s input method, but do it with MLP only, without attention or convolution!
architecture :** Cut the image into non-overlapping patches and send it to C dimensions as a single projection. This results in S C-dimensional matrix $\mathbb{R}^{S\times C}$, which is called a “token-mixing MLP” in column dimension and a “channel-mixing MLP” in row dimension.
objective : CrossEntropy Loss
baseline : BiT-R, Mixer-L, HaloNet
data : ILSVRC2012 ImageNet, CIFAR-10/100, Oxford-IIIT-pets, JFT-30
result : Similar performance, high throughput, FLOPS
contribution : O(n) complexity, simple architecture, MLP revisited!
Limitations or things I don’t understand :