
paper
TL;DR#
- task : open vocab object detection
- problem : no od annotation for novel class
- idea : Use CLIP embedding
- architecture : CLIP is used to make the class as text embedding, and the tokens of ViT as query, bipartite matching, and DETR loss are used to learn.
- objective : DETR loss but sigmoid focal loss for class label
- baseline : ViLD, GLIP
- data : OI, VG, Object 365 -> LVIS(long-tail)
- result : looks better than GLIP
- contribution : Solved Open vocab OD with very simple architecture
- limitation or something I don’t understand : GLIP is not made for Open vocab?
Details#
Architecture#

training details#
- Initializing one bbox Prediction on each image token so that its x, y are within the coordinates of that image token initially converges faster
- Apply various augmentation / cleaning

one-shot image-conditioned result#

