image

paper

TL;DR

  • I read this because.. : NeurIPS, open-vocab object detection
  • task : open-vocab object detection
  • Problem :** CLIP is an image-level representation and is not aligned well for the detection task.
  • idea : 1) Extend vocab by creating pseudo-label with image classification dataset with class agnostic Object detection model 2) KD to make region feature and CLIP close 3) Tie the weight of 1 and 2 since they are moving in opposite directions.
  • architecture : Faster RCNN’s Region proposal, but instead of a classifier, we put the image feature into the CLIP image encoder and classify it as closest to the CLIP text embedding of a photo of {category}.
  • objective : 1) point-wise embedding matching loss 2) inter-embedding relationship matching loss 3) image-level supervision loss
  • baseline : supervised, OVR-CNN, ViLD, RegionCLIP, Detic …
  • data : COCO, LVIS v1.0, ImageNet-21K, COCO-captions, LMDET
  • evaluation : $AP_{base}$, $AP_{novel}$
  • result : Decent performance
  • contribution : Propose a learning framework that utilizes image-level data for detection
  • limitation / things I cannot understand :

Details

Preliminaries

image

Detection Pipeline

image

image

image

Loss

  • Point-wise embedding matching loss image

  • Inter-embedding relationship matching loss image

  • Image-level Supervision with Pseudo Box Labels …

  • Weight Transfer Function image

Result

image

image