image

paper , code

TL;DR

  • task : object detection, efficient DETR
  • Problem :** deformable DETR reduces the key when given a query with deformable attention, but because it uses multi-scale features, the number of tokens in the encoder input is 20 times larger, which makes inference rather slow.
  • Idea :** Images have a lot of background and only salient objects need attention. Let’s make the token that goes into the encoder sparse!
  • architecture : Create a score network that is a deformable DETR that measures the objectness of inputs entering the encoder. The score network can be trained like auxiliary loss by 1) adding detection heads to the backbone feature map, or 2) Decoder Attention Map (DAM), which is a pseudo-label with 1 for p% of tokens in the cross attention map and 0 for the rest.
  • objective : DETR loss + add auxiliary loss by putting detection head also in encoder
  • baseline : Faster R-CNN, DETR, DETR-DC5, Deformable DETR
  • data : COCO 2017
  • result : Performance similar to deformable with only 10% of encoder tokens used
  • contribution : more efficient DETR than deformable DETR
  • Limitations or things I don’t understand :

Details

  • encoder token sparsity image
image
  • Decoder cross-Attention Map(DAM) image
image
  • Architecture overall image

  • Detection result image

  • selection criteria image

  • ablation of num encoder layer image

encoder layer 12 does not learn without auxiliary losses