[58] MetaFormer Is Actually What You Need for Vision

TL;DR

task : image classification / object detection / instance segmentation / vision backbone
problem : Prior research shows that replacing the attention module that mixes information between tokens in the transformer with MLP works well.
idea : Consider having a token mixer like self-attention or mlp above as an abstract module.
architecture : token -> token embedding -> “token mixer” -> FFN. Suggest token mixer to be pooling here (PoolFormer)
objective : loss for each task
baseline : RSB-ResNet, ViT, DeiT, PVT, MLP-Micer, ResMLP, Swin-Mixer,…
data : ImageNet-1K, COCO, ADE20K
result : Performance comparable to SOTA models. ImageNet10K top-1 accuracy is higher with lower parameters than DeiT or ResMLP.
contribution : Solved MLP mixer in general?