[102] Attention Augmented Convolutional Networks

TL;DR

I read this because.. : Quoted by DETR. The FFN of the transformer looks like a 1x1 convolution, so I was curious to read that the encoder can be viewed as “attention augmented convolutional networks” via FFN.
task : image classification / object detection
Problem :** CNN can only see local information, but self-attention can see long-range.
IDEA : Let’s combine the two!
architecture : As the image comes in, apply MSA (hidden vector = channel dimension) on the (h, w) dimension. For each pixel, relative poisitonal embedding. Concat this with the Conv result to get the Attention-augmented convolution
baseline : ResNet50, RetinaNet50, channel wise reweighing(Squeeze-and-Excitation, Gather-Excite), channel / spatial reweighing independently(BAM, CBAM)
data : ImageNet, COCO
evaluation : accuracy / mAP
result : ImageNet / ResNet50 increased by 1.3%, and COCO / RetinaNet increased by 1.4 mAP.
contribution :
limitation / things I cannot understand :