image

paper , code

TL;DR

  • I read this because.. : It was mentioned a lot in the thesis meeting. But I read this because…
  • task : object detection -> phrase grounding problem to learn
  • problem : Image classification models are difficult to apply in the real world because they classify within fixed categories. CLIP solved this problem with image-text pairs, but that’s for image classification, and we want to solve the object detection level task as well!
  • idea : Replace the object detection problem with a phrase grounding problem where classes are given in the form of prompts and the image is good at aligning with the words in the prompt.
  • architecture : 1) Visual Encoder(Swin) + DyHead 2) Pretrained BERT 3) Early fusion of 1 and 2.
  • objective : cls loss(with alignment score!) + regressor loss
  • baseline : Faster RCNN, DyHead
  • data : COCO, LVIS, Flickr30K, Object365, GoldG, OpenImages, Visual Genome, ImageNetBoxes
  • evaluation : AP
  • Results : 1) Outperformed supervised baseline on COCO, LVIS datasets not given at training 2) Achieved SOTA when finetuned on COCO 3) 1-shot GLIP outperformed supervised Dynamic Head on 13 object detection downstream tasks.
  • contribution : CLIP in object detection
  • limitation / things I cannot understand :

Details

preliminaries

Data

  • COCO : 80 object categories, training 118K, valid 5K, test 41K
  • LVIS : long tail object detection. 1000 categories.
  • Flickr30K: image and 5 reference sentences about it. data for image captioning
    • Objects365 : 365 categories, 2 million images, 30 million bounding boxes
  • GoldG : Grounding data created by human annotation in MDETR paper with 0.8M data
    • OpenImages : 15,851,536 boxes on 600 categories, 478,000 crowdsourced images with 6,000+ categories
    • Visual Genome : 108,077 Images, 5.4 M Region Descriptions, 2.3M Relationships
    • ImageNetBoxes : ?
  • architecture Object detection consists of two losses: a localization loss and a classification loss. In this case, localization is out of the scope of this thesis. We will only tackle the classification problem.

For a typical object detection problem, classification loss is defined as image

Instead of classification, we have a separate Image Encoder and a separate Language Encoder that handles prompts, and its inner product is the alignment score. This will replace the classifier logit. image

And you can put the same in LOSS, it will just add a dimension rather than a class: (multiple data, tokenization,[no_obj] token).

LOSS can be achieved by using binary sigmoid LOSS.

image

FasterRCNN and DynamicHead (SOTA) as detection models, Swin-T and Swin-L as image encoders, and BERT as textual encoder. image

Deep fusion is not so much about combining the output from different encoders (called late fusion), but about exchanging information as layers are added. In this case, BERT adds a new layer on top of the existing layer and exchanges the output of the layers above it.

Result

image