image

paper

TL;DR

  • task : open vocab object detection
  • problem : no od annotation for novel class
  • idea : Use CLIP embedding
  • architecture : CLIP is used to make the class as text embedding, and the tokens of ViT as query, bipartite matching, and DETR loss are used to learn.
  • objective : DETR loss but sigmoid focal loss for class label
  • baseline : ViLD, GLIP
  • data : OI, VG, Object 365 -> LVIS(long-tail)
  • result : looks better than GLIP
  • contribution : Solved Open vocab OD with very simple architecture
  • limitation or something I don’t understand : GLIP is not made for Open vocab?

Details

Architecture

image

training details

  • Initializing one bbox Prediction on each image token so that its x, y are within the coordinates of that image token initially converges faster
  • Apply various augmentation / cleaning

zero-shot performance

image

one-shot image-conditioned result

image

one-/few-shot performance

image