image

paper

TL;DR

  • task : image classification / object detection / instance segmentation / vision backbone
  • problem : Prior research shows that replacing the attention module that mixes information between tokens in the transformer with MLP works well.
  • idea : Consider having a token mixer like self-attention or mlp above as an abstract module.
  • architecture : token -> token embedding -> “token mixer” -> FFN. Suggest token mixer to be pooling here (PoolFormer)
  • objective : loss for each task
  • baseline : RSB-ResNet, ViT, DeiT, PVT, MLP-Micer, ResMLP, Swin-Mixer,…
  • data : ImageNet-1K, COCO, ADE20K
  • result : Performance comparable to SOTA models. ImageNet10K top-1 accuracy is higher with lower parameters than DeiT or ResMLP.
  • contribution : Solved MLP mixer in general?

Details

image