I read this because.. : Quoted by DETR. The FFN of the transformer looks like a 1x1 convolution, so I was curious to read that the encoder can be viewed as “attention augmented convolutional networks” via FFN.
task : image classification / object detection
Problem :** CNN can only see local information, but self-attention can see long-range.
IDEA : Let’s combine the two!
architecture : As the image comes in, apply MSA (hidden vector = channel dimension) on the (h, w) dimension. For each pixel, relative poisitonal embedding. Concat this with the Conv result to get the Attention-augmented convolution