[73] Simple Open-Vocabulary Object Detection with Vision Transformers

TL;DR

task : open vocab object detection
problem : no od annotation for novel class
idea : Use CLIP embedding
architecture : CLIP is used to make the class as text embedding, and the tokens of ViT as query, bipartite matching, and DETR loss are used to learn.
objective : DETR loss but sigmoid focal loss for class label
baseline : ViLD, GLIP
data : OI, VG, Object 365 -> LVIS(long-tail)
result : looks better than GLIP
contribution : Solved Open vocab OD with very simple architecture
limitation or something I don’t understand : GLIP is not made for Open vocab?

Initializing one bbox Prediction on each image token so that its x, y are within the coordinates of that image token initially converges faster
Apply various augmentation / cleaning