
TL;DR
- task : anchor-free object detection
- Problem :** Anchor-based object detection is 1) hyper-parameter sensitive, 2) the scale/aspect ratio of the anchor is fixed (although it does do a relative regression), 3) the anchor boxes are dense in the image (180K anchor boxes in an image with a shortening of 800 or so), and 4) IoUs are involved in matching GT boxes, which complicates the calculation.
- idea : Let’s do object detection per pixel with fully convolutional network like semantic segmentation
- architecture : Create a feature pyramid with P3, P4, P5 with 1 x 1 convolution on C3, C4, C5 of CNN backbone (ResNet-50), and P6, P7 with stride 2 convolution on P5. We train with a 0-1 sigmoid with center-ness as the head, as it is ambiguous which box to predict if the objects overlap too much when making a prediction for each pixel.
- objective : focal loss for cls, IoU loss for bbox regression
- baseline : Faster R-CNN, YOLOv2, SSD, DSSD, RetinaNet, CornerNet
- data : COCO
- result : SOTA!
- contribution : raises the question, “Do I really need to use an anchor box?” and solves it with awesome performance @contribution
- Limitations or things I don’t understand : How is the BPR (upper bound of recall rate that a detector can achieve) measured?
Details
Architecture

I found out later that it would have performed better to just separate the center-ness branch lol

Loss
- L_cls is the focal loss
- L_reg is the IoU lossCenter-ness

When Center-ness was multiplied by the classification score, the confidence score showed more meaningful results.
Main Result
