[98] Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection

TL;DR

I read this because.. : NeurIPS, open-vocab object detection
task : open-vocab object detection
problem : CLIP은 이미지 레벨의 표현이어서 detection task를 잘 하도록 align이 되어있지 않다.
idea : 1) class agnostic한 Object detection 모델로 image classification dataset으로 pseudo-label을 만들어 vocab을 확장하자 2) region feature와 CLIP이 가까워 지도록 KD를 하자 3) 1, 2가 반대 방향으로 움직이니 둘의 weight를 tie 시키자
architecture : Faster RCNN에서 Region proposal한거에다가 classifier 대신 image feature를 CLIP image encoder에 넣고 a photo of {category}의 CLIP text embedding과 가장 가까운 것으로 분류하는 방식
objective : 1) point-wise embedding matching loss 2) inter-embedding relationship matching loss 3) image-level supervision loss
baseline : supervised, OVR-CNN, ViLD, RegionCLIP, Detic …
data : COCO, LVIS v1.0, ImageNet-21K, COCO-captions, LMDET
evaluation : $AP_{base}$, $AP_{novel}$
result : 괜찮은 성능
contribution : image-level data를 detection에 활용하는 학습 프레임워크 제안
limitation / things I cannot understand :