image

paper , code

TL;DR

  • task : anchor-free object detection
  • Problem :** Anchor-based object detection is 1) hyper-parameter sensitive, 2) the scale/aspect ratio of the anchor is fixed (although it does do a relative regression), 3) the anchor boxes are dense in the image (180K anchor boxes in an image with a shortening of 800 or so), and 4) IoUs are involved in matching GT boxes, which complicates the calculation.
  • idea : Let’s do object detection per pixel with fully convolutional network like semantic segmentation
  • architecture : Create a feature pyramid with P3, P4, P5 with 1 x 1 convolution on C3, C4, C5 of CNN backbone (ResNet-50), and P6, P7 with stride 2 convolution on P5. We train with a 0-1 sigmoid with center-ness as the head, as it is ambiguous which box to predict if the objects overlap too much when making a prediction for each pixel.
  • objective : focal loss for cls, IoU loss for bbox regression
  • baseline : Faster R-CNN, YOLOv2, SSD, DSSD, RetinaNet, CornerNet
  • data : COCO
  • result : SOTA!
  • contribution : raises the question, “Do I really need to use an anchor box?” and solves it with awesome performance @contribution
  • Limitations or things I don’t understand : How is the BPR (upper bound of recall rate that a detector can achieve) measured?

Details

Architecture

image

I found out later that it would have performed better to just separate the center-ness branch lol image

Loss

image - L_cls is the focal loss - L_reg is the IoU loss

Center-ness

image image image

When Center-ness was multiplied by the classification score, the confidence score showed more meaningful results.

Main Result

image