
TL;DR
- task : object detection
- Problem :** Previous work has improved 1) scale-aware : image pyramid, feature pyramid … 2) spatial-aware : convolution, deformable conv … 3) task-aware : od with segmentation, two-stage, FCOS(center instead of bbox) … but no paper has tried to do all three well!
- idea : Let’s give attention to each of the dimensions L(=num of feature level) x S(=spatial. W x H) x C(=num of channel, task)!
- architecture : scale is hard sigmoid for 1x1 conv -> spatial uses deformable attention -> task is slicing the cth channel so that it is on-off for that task via max. You can put this dynamic head attention anywhere if you sandwich it between 2-stage or one-stage.
- objective : object detection loss
- baseline : Mask-RCNN, Cascade-RCNN, FCOS, ATSS, BorderDet, DETR, …
- data : MS-COCO
- result : Applying DyHead to the object detection model unconditionally improves performance. Almost SOTA.
- Diversify the contribution : attention to each dimension.
- limitation or something I don’t understand : The shape of the result of exactly 3 attention is not drawn.
Details
architecture

Dynamic Head

- $F\in R^{LxSxC}$ : feature tensor. from backbone’s feature pyramid
If this is normal self-attention
This is why the dynamic head is paying attention to L, S, and C respectively!
Scale-aware Attention

- $f$ : linear function. Implemented as a 1x1 conv
- $\sigma$ : $max(0, min(1, \frac{x+1}{2}))$
Spatial-aware Attention

Using deformable attention.
- K : # of sparse sampling locations
- $\delta p_k$ : sampling location
- $\delta m_k$ : self-learned importance scalar at location $p_k$
Task-aware Attention

- $F_c$ : cth channel sliced from feature map
- $\theta$ : Global average pooling over L x S dimensions and thesholding implemented as 2 fcn -> normalizing -> sigmoid (as if omitted in the formula?)
- $\alpha$, $\beta$: output of the activation thesholding function $theta$ above.

Generalizing to Existing Detectors
one stage detector Prior research shows that cls subnetworks and bbox regressors behave very differently. Predicting cls, bbox with unified branch to backbone, unlike this conventional approach. This is thanks to DyHead!
two stage detector Apply DyHead before RoI pooling
Result

Ablation

