image

paper , code

TL;DR

  • task : object detection
  • Problem :** Previous work has improved 1) scale-aware : image pyramid, feature pyramid … 2) spatial-aware : convolution, deformable conv … 3) task-aware : od with segmentation, two-stage, FCOS(center instead of bbox) … but no paper has tried to do all three well!
  • idea : Let’s give attention to each of the dimensions L(=num of feature level) x S(=spatial. W x H) x C(=num of channel, task)!
  • architecture : scale is hard sigmoid for 1x1 conv -> spatial uses deformable attention -> task is slicing the cth channel so that it is on-off for that task via max. You can put this dynamic head attention anywhere if you sandwich it between 2-stage or one-stage.
  • objective : object detection loss
  • baseline : Mask-RCNN, Cascade-RCNN, FCOS, ATSS, BorderDet, DETR, …
  • data : MS-COCO
  • result : Applying DyHead to the object detection model unconditionally improves performance. Almost SOTA.
  • Diversify the contribution : attention to each dimension.
  • limitation or something I don’t understand : The shape of the result of exactly 3 attention is not drawn.

Details

architecture

image

Dynamic Head

image

  • $F\in R^{LxSxC}$ : feature tensor. from backbone’s feature pyramid

If this is normal self-attention image This is why the dynamic head is paying attention to L, S, and C respectively!

Scale-aware Attention

image

  • $f$ : linear function. Implemented as a 1x1 conv
  • $\sigma$ : $max(0, min(1, \frac{x+1}{2}))$

Spatial-aware Attention

image

Using deformable attention.

  • K : # of sparse sampling locations
  • $\delta p_k$ : sampling location
  • $\delta m_k$ : self-learned importance scalar at location $p_k$

Task-aware Attention

image

  • $F_c$ : cth channel sliced from feature map
  • $\theta$ : Global average pooling over L x S dimensions and thesholding implemented as 2 fcn -> normalizing -> sigmoid (as if omitted in the formula?)
  • $\alpha$, $\beta$: output of the activation thesholding function $theta$ above.

image

Generalizing to Existing Detectors

  • one stage detector Prior research shows that cls subnetworks and bbox regressors behave very differently. Predicting cls, bbox with unified branch to backbone, unlike this conventional approach. This is thanks to DyHead!

  • two stage detector Apply DyHead before RoI pooling

Result

image image image

Ablation

image

image