
TL;DR
- I read this because.. : NeurIPS, open-vocab object detection
- task : open-vocab object detection
- Problem :** CLIP is an image-level representation and is not aligned well for the detection task.
- idea : 1) Extend vocab by creating pseudo-label with image classification dataset with class agnostic Object detection model 2) KD to make region feature and CLIP close 3) Tie the weight of 1 and 2 since they are moving in opposite directions.
- architecture : Faster RCNN’s Region proposal, but instead of a classifier, we put the image feature into the CLIP image encoder and classify it as closest to the CLIP text embedding of
a photo of {category}. - objective : 1) point-wise embedding matching loss 2) inter-embedding relationship matching loss 3) image-level supervision loss
- baseline : supervised, OVR-CNN, ViLD, RegionCLIP, Detic …
- data : COCO, LVIS v1.0, ImageNet-21K, COCO-captions, LMDET
- evaluation : $AP_{base}$, $AP_{novel}$
- result : Decent performance
- contribution : Propose a learning framework that utilizes image-level data for detection
- limitation / things I cannot understand :
Details
Preliminaries
- Multimodal ViT (MViT)
https://arxiv.org/pdf/2111.11430.pdf
class-agnostic object detector


Detection Pipeline



Loss
Point-wise embedding matching loss

Inter-embedding relationship matching loss

Image-level Supervision with Pseudo Box Labels …
Weight Transfer Function

Result

