[85] Dynamic Head: Unifying Object Detection Heads with Attentions

TL;DR

task : object detection
Problem :** Previous work has improved 1) scale-aware : image pyramid, feature pyramid … 2) spatial-aware : convolution, deformable conv … 3) task-aware : od with segmentation, two-stage, FCOS(center instead of bbox) … but no paper has tried to do all three well!
idea : Let’s give attention to each of the dimensions L(=num of feature level) x S(=spatial. W x H) x C(=num of channel, task)!
architecture : scale is hard sigmoid for 1x1 conv -> spatial uses deformable attention -> task is slicing the cth channel so that it is on-off for that task via max. You can put this dynamic head attention anywhere if you sandwich it between 2-stage or one-stage.
objective : object detection loss
baseline : Mask-RCNN, Cascade-RCNN, FCOS, ATSS, BorderDet, DETR, …
data : MS-COCO
result : Applying DyHead to the object detection model unconditionally improves performance. Almost SOTA.
Diversify the contribution : attention to each dimension.
limitation or something I don’t understand : The shape of the result of exactly 3 attention is not drawn.

If this is normal self-attention This is why the dynamic head is paying attention to L, S, and C respectively!

Using deformable attention.

$F_c$ : cth channel sliced from feature map
$\theta$ : Global average pooling over L x S dimensions and thesholding implemented as 2 fcn -> normalizing -> sigmoid (as if omitted in the formula?)
$\alpha$, $\beta$: output of the activation thesholding function $theta$ above.

one stage detector Prior research shows that cls subnetworks and bbox regressors behave very differently. Predicting cls, bbox with unified branch to backbone, unlike this conventional approach. This is thanks to DyHead!
two stage detector Apply DyHead before RoI pooling