problem : Existing object detection models predict with a closed set, which is difficult to scale. To solve this problem, open vocab object detection uses PRN first and then class prediction, making it difficult to predict bboxes for new classes.
idea : Let’s use DETR to do object detection with end2end! Let’s use it as a class and send it as a text embed using CLIP.
architecture : image and text(=class) are embedded through CLIP and then combined with object queries to create a conditional query. Since there can be multiple objects in one image, we copy N objects. Afterward, bipartite matching is done with [matched], [not matched] when given the input image and conditional query, not [obj], [no obj].
result : OV OD model vs just AP, AP for novel class both SOTA
contribution : end2end open vocab object detection
Limitations or things I don’t understand : I already have embeddings for all base classes/novel classes (R in the paper), and I’m supposed to match all of them to make predictions? Confused. So I’m supposed to do in batch negative or something like that for training?