image

paper

TL;DR

  • task : image classification
  • problem : vision backbone without CNN and transformer
  • idea : ViT์˜ input ๋ฐฉ์‹์„ ๋”ฐ๋ผ๊ฐ€๋˜, attention์ด๋‚˜ convolution ์—†์ด MLP๋กœ๋งŒ ํ•ด๋ณด์ž!
  • architecture : ์ด๋ฏธ์ง€๋ฅผ ๊ฒน์น˜์ง€ ์•Š๋Š” ํŒจ์น˜ ๋‹จ์œ„๋กœ ์ž๋ฅด๊ณ , ํ•˜๋‚˜์˜ projection์œผ๋กœ C์ฐจ์›์œผ๋กœ ๋ณด๋ƒ„. ๊ทธ๋Ÿฌ๋ฉด S๊ฐœ์˜ C์ฐจ์›์˜ matrix $\mathbb{R}^{S\times C}$ ๊ฐ€ ์ƒ๊ธฐ๋Š”๋ฐ ์ด๋ฅผ ์—ด ์ฐจ์›์—์„œ ํ•˜๋ฉด “token-mixing MLP”, ํ–‰ ์ฐจ์›์—์„œ ํ•˜๋ฉด “channel-mixing MLP"์ด ๋˜๊ฒŒ ๋จ.
  • objective : CrossEntropy Loss
  • baseline : BiT-R, Mixer-L, HaloNet
  • data : ILSVRC2012 ImageNet, CIFAR-10/100, Oxford-IIIT-pets, JFT-30
  • result : ๋น„์Šทํ•œ ์„ฑ๋Šฅ, ๋†’์€ throughput, FLOPS
  • contribution : O(n) complexity, simple architecture, MLP revisited!
  • limitation or ์ดํ•ด ์•ˆ๋˜๋Š” ๋ถ€๋ถ„ :

Details

image image image image