image

paper

TL;DR

  • I read this because.. : DETR이 인용. transformer의 FFN은 1x1 convolution 같아서 encoder가 FFN을 통해 “attention augmented convolutional networks"로 볼 수 있다고 얘기해서 궁금해서 읽음.
  • task : image classification / object detection
  • problem : CNN은 local한 정보밖에 못보나 self-attention은 long-range를 볼 수 있다.
  • idea : 둘이 결합해보자!
  • architecture : 이미지가 들어오면 (h, w) 차원에서 MSA (hidden vector = channel 차원) 적용. 각 pixel에 대해서는 relative poisitonal embedding. 이걸 Conv 결과랑 Concat 하는게 Attention-augmented convolution
  • baseline : ResNet50, RetinaNet50, channel wise reweighing(Squeeze-and-Excitation, Gather-Excite), channel / spatial reweighing independently(BAM, CBAM)
  • data : ImageNet, COCO
  • evaluation : accuracy / mAP
  • result : ImageNet / ResNet50에 적용하니 1.3%올랐고, COCO / RetinaNet에 올리니 1.4 mAP 올랐다.
  • contribution :
  • limitation / things I cannot understand :

Details

Architecture

image image

Result

image image image image image image