[84] DiffusionDet: Diffusion Model for Object Detection

paper

TL;DR

task : object detection
Problem :** Most object detection models rely on predefined object candidates such as anchor boxes, and DETRs have the concept of object queries, so they cannot extract more objects than they train.
idea : use diffusion to pull out the image bbox!
architecture : GT bbox + gaussian noise into encoder (ResNet-50, Swin-b) and pull features with RoI pooling decoder receives features and bbox from previous step and predicts bbox/cls
objective : hungarian loss(=DETR loss)
baseline : DETR, deformable DETR, Sparse R-CNN
data : MS-COCO, LVIS
result : SOTA ?!
contribution : First paper applying diffusion to object detection
limitation or part I don’t understand : I don’t read the diffusion so I don’t understand exactly, but I’m surprised the performance is coming out… even SOTA? Are they using the device to make the results look good?

Details

motivation

Preliminaries : diffusion model

$z_0$ : data sample
$z_t$ : latent noisy sample
$t$ : step
$\bar a_t$ =

The loss learned is the output of the neural network $f_\theta$ and the MSE of $z_0$.

In this paper, $z_0$ is referred to as GT bbox.

architecture

encoder Applying the feature pyramid to ResNet, Swin
decoder Crop the proposal boxes and use them as RoI features similar to sparse R-CNN #58

Difference from Sparse R-CNN

(1) Because we start with random bboxes, we can use more bboxes in the infer step than we used in training (2) Unlike sparse RCNN, it only gets the first RoI pooled feature (3) Reuse the detector head

Training

GT + gaussian noise to make it a Noisy bbox and start with that.

padding : padding because the number of GT bboxes is different. I tried padding with 1) copy GT bbox 2) random box 3) image size box, etc. but Gaussian random box padding is the best.
box corruption : noise $a_t$ gets smaller and smaller with step t. The signal-to-noise ratio(?) was important, which should have a higher signal scaling value than image generation.
training losses : DETR loss wrote

Inference

Just start with a random Gaussian bbox

ddim : I used DDIM (non-markov, unlike DDPM, which is a model that requires you to give the initial value as well as the previous step) to pull the bbox and pass it to the next step.
box_renewal : in step t, I filtered the ugly bboxes by score and replaced them with random boxes.

Result

COCO 2017
LVIS v1.0 val

TL;DR#

Details#

motivation#

Preliminaries : diffusion model#

architecture#

Difference from Sparse R-CNN#

Training#

Inference#

Result#