image

paper

TL;DR

  • I read this because.. : Quoted by DETR. The FFN of the transformer looks like a 1x1 convolution, so I was curious to read that the encoder can be viewed as “attention augmented convolutional networks” via FFN.
  • task : image classification / object detection
  • Problem :** CNN can only see local information, but self-attention can see long-range.
  • IDEA : Let’s combine the two!
  • architecture : As the image comes in, apply MSA (hidden vector = channel dimension) on the (h, w) dimension. For each pixel, relative poisitonal embedding. Concat this with the Conv result to get the Attention-augmented convolution
  • baseline : ResNet50, RetinaNet50, channel wise reweighing(Squeeze-and-Excitation, Gather-Excite), channel / spatial reweighing independently(BAM, CBAM)
  • data : ImageNet, COCO
  • evaluation : accuracy / mAP
  • result : ImageNet / ResNet50 increased by 1.3%, and COCO / RetinaNet increased by 1.4 mAP.
  • contribution :
  • limitation / things I cannot understand :

Details

Architecture

image image

Result

image image image image image image