image

paper

TL;DR

  • task : object detection
  • Problem :** Most object detection models rely on predefined object candidates such as anchor boxes, and DETRs have the concept of object queries, so they cannot extract more objects than they train.
  • idea : use diffusion to pull out the image bbox!
  • architecture : GT bbox + gaussian noise into encoder (ResNet-50, Swin-b) and pull features with RoI pooling decoder receives features and bbox from previous step and predicts bbox/cls
  • objective : hungarian loss(=DETR loss)
  • baseline : DETR, deformable DETR, Sparse R-CNN
  • data : MS-COCO, LVIS
  • result : SOTA ?!
  • contribution : First paper applying diffusion to object detection
  • limitation or part I don’t understand : I don’t read the diffusion so I don’t understand exactly, but I’m surprised the performance is coming out… even SOTA? Are they using the device to make the results look good?

Details

motivation

image image

Preliminaries : diffusion model

image
  • $z_0$ : data sample
  • $z_t$ : latent noisy sample
  • $t$ : step
  • $\bar a_t$ =
    image

The loss learned is the output of the neural network $f_\theta$ and the MSE of $z_0$. image

In this paper, $z_0$ is referred to as GT bbox.

architecture

image
  • encoder Applying the feature pyramid to ResNet, Swin
  • decoder Crop the proposal boxes and use them as RoI features similar to sparse R-CNN #58

Difference from Sparse R-CNN

(1) Because we start with random bboxes, we can use more bboxes in the infer step than we used in training (2) Unlike sparse RCNN, it only gets the first RoI pooled feature (3) Reuse the detector head

Training

GT + gaussian noise to make it a Noisy bbox and start with that. image

  • padding : padding because the number of GT bboxes is different. I tried padding with 1) copy GT bbox 2) random box 3) image size box, etc. but Gaussian random box padding is the best.
  • box corruption : noise $a_t$ gets smaller and smaller with step t. The signal-to-noise ratio(?) was important, which should have a higher signal scaling value than image generation.
  • training losses : DETR loss wrote

Inference

Just start with a random Gaussian bbox image

  • ddim : I used DDIM (non-markov, unlike DDPM, which is a model that requires you to give the initial value as well as the previous step) to pull the bbox and pass it to the next step.
  • box_renewal : in step t, I filtered the ugly bboxes by score and replaced them with random boxes.

Result

  • COCO 2017 image

  • LVIS v1.0 val image