[98] Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection

TL;DR

I read this because.. : NeurIPS, open-vocab object detection
task : open-vocab object detection
Problem :** CLIP is an image-level representation and is not aligned well for the detection task.
idea : 1) Extend vocab by creating pseudo-label with image classification dataset with class agnostic Object detection model 2) KD to make region feature and CLIP close 3) Tie the weight of 1 and 2 since they are moving in opposite directions.
architecture : Faster RCNN’s Region proposal, but instead of a classifier, we put the image feature into the CLIP image encoder and classify it as closest to the CLIP text embedding of a photo of {category}.
objective : 1) point-wise embedding matching loss 2) inter-embedding relationship matching loss 3) image-level supervision loss
baseline : supervised, OVR-CNN, ViLD, RegionCLIP, Detic …
data : COCO, LVIS v1.0, ImageNet-21K, COCO-captions, LMDET
evaluation : $AP_{base}$, $AP_{novel}$
result : Decent performance
contribution : Propose a learning framework that utilizes image-level data for detection
limitation / things I cannot understand :