[72] Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity

TL;DR

task : object detection, efficient DETR
Problem :** deformable DETR reduces the key when given a query with deformable attention, but because it uses multi-scale features, the number of tokens in the encoder input is 20 times larger, which makes inference rather slow.
Idea :** Images have a lot of background and only salient objects need attention. Let’s make the token that goes into the encoder sparse!
architecture : Create a score network that is a deformable DETR that measures the objectness of inputs entering the encoder. The score network can be trained like auxiliary loss by 1) adding detection heads to the backbone feature map, or 2) Decoder Attention Map (DAM), which is a pseudo-label with 1 for p% of tokens in the cross attention map and 0 for the rest.
objective : DETR loss + add auxiliary loss by putting detection head also in encoder
baseline : Faster R-CNN, DETR, DETR-DC5, Deformable DETR
data : COCO 2017
result : Performance similar to deformable with only 10% of encoder tokens used
contribution : more efficient DETR than deformable DETR
Limitations or things I don’t understand :