
paper
TL;DR#
- task : image classification / object detection / instance segmentation / vision backbone
- problem : Prior research shows that replacing the attention module that mixes information between tokens in the transformer with MLP works well.
- idea : Consider having a token mixer like self-attention or mlp above as an abstract module.
- architecture : token -> token embedding -> “token mixer” -> FFN. Suggest token mixer to be pooling here (PoolFormer)
- objective : loss for each task
- baseline : RSB-ResNet, ViT, DeiT, PVT, MLP-Micer, ResMLP, Swin-Mixer,…
- data : ImageNet-1K, COCO, ADE20K
- result : Performance comparable to SOTA models. ImageNet10K top-1 accuracy is higher with lower parameters than DeiT or ResMLP.
- contribution : Solved MLP mixer in general?
Details#
