[79] FCOS: Fully Convolutional One-Stage Object Detection

TL;DR

task : anchor-free object detection
Problem :** Anchor-based object detection is 1) hyper-parameter sensitive, 2) the scale/aspect ratio of the anchor is fixed (although it does do a relative regression), 3) the anchor boxes are dense in the image (180K anchor boxes in an image with a shortening of 800 or so), and 4) IoUs are involved in matching GT boxes, which complicates the calculation.
idea : Let’s do object detection per pixel with fully convolutional network like semantic segmentation
architecture : Create a feature pyramid with P3, P4, P5 with 1 x 1 convolution on C3, C4, C5 of CNN backbone (ResNet-50), and P6, P7 with stride 2 convolution on P5. We train with a 0-1 sigmoid with center-ness as the head, as it is ambiguous which box to predict if the objects overlap too much when making a prediction for each pixel.
objective : focal loss for cls, IoU loss for bbox regression
baseline : Faster R-CNN, YOLOv2, SSD, DSSD, RetinaNet, CornerNet
data : COCO
result : SOTA!
contribution : raises the question, “Do I really need to use an anchor box?” and solves it with awesome performance @contribution
Limitations or things I don’t understand : How is the BPR (upper bound of recall rate that a detector can achieve) measured?