image

paper

TL;DR

  • task : object detection
  • problem : the need for many hand-designed components like a non-maximum suppression procedure or anchor generation that explicitly encode our prior knowledge about the task
  • idea : predict directly object set with bipartite matching
  • architecture : CNN + transformer encoder + transformer decoder with object queries(=random PE) + bbox / cls prediction head
  • objective : IoU loss + CE Loss
  • baseline : Faster R-CNN
  • data : COCO
  • result : SOTA
  • contribution : transformer based od model without nms!
  • Limitations or things I don’t understand : longer training time, low performance on small object

Details

notion