[69] End-to-End Object Detection with Transformers

TL;DR

task : object detection
problem : the need for many hand-designed components like a non-maximum suppression procedure or anchor generation that explicitly encode our prior knowledge about the task
idea : predict directly object set with bipartite matching
architecture : CNN + transformer encoder + transformer decoder with object queries(=random PE) + bbox / cls prediction head
objective : IoU loss + CE Loss
baseline : Faster R-CNN
data : COCO
result : SOTA
contribution : transformer based od model without nms!
Limitations or things I don’t understand : longer training time, low performance on small object