[29] Grounded Language-Image Pre-training

paper , code

TL;DR

I read this because.. : It was mentioned a lot in the thesis meeting. But I read this because…
task : object detection -> phrase grounding problem to learn
problem : Image classification models are difficult to apply in the real world because they classify within fixed categories. CLIP solved this problem with image-text pairs, but that’s for image classification, and we want to solve the object detection level task as well!
idea : Replace the object detection problem with a phrase grounding problem where classes are given in the form of prompts and the image is good at aligning with the words in the prompt.
architecture : 1) Visual Encoder(Swin) + DyHead 2) Pretrained BERT 3) Early fusion of 1 and 2.
objective : cls loss(with alignment score!) + regressor loss
baseline : Faster RCNN, DyHead
data : COCO, LVIS, Flickr30K, Object365, GoldG, OpenImages, Visual Genome, ImageNetBoxes
evaluation : AP
Results : 1) Outperformed supervised baseline on COCO, LVIS datasets not given at training 2) Achieved SOTA when finetuned on COCO 3) 1-shot GLIP outperformed supervised Dynamic Head on 13 object detection downstream tasks.
contribution : CLIP in object detection
limitation / things I cannot understand :

Details

preliminaries

Dynamic Head : #94
MDETR
visual grounding : https://cvml.tistory.com/4

Data

COCO : 80 object categories, training 118K, valid 5K, test 41K
LVIS : long tail object detection. 1000 categories.
Flickr30K: image and 5 reference sentences about it. data for image captioning
- Objects365 : 365 categories, 2 million images, 30 million bounding boxes
GoldG : Grounding data created by human annotation in MDETR paper with 0.8M data
- OpenImages : 15,851,536 boxes on 600 categories, 478,000 crowdsourced images with 6,000+ categories
- Visual Genome : 108,077 Images, 5.4 M Region Descriptions, 2.3M Relationships
- ImageNetBoxes : ?
architecture Object detection consists of two losses: a localization loss and a classification loss. In this case, localization is out of the scope of this thesis. We will only tackle the classification problem.

For a typical object detection problem, classification loss is defined as

Instead of classification, we have a separate Image Encoder and a separate Language Encoder that handles prompts, and its inner product is the alignment score. This will replace the classifier logit.

And you can put the same in LOSS, it will just add a dimension rather than a class: (multiple data, tokenization,[no_obj] token).

LOSS can be achieved by using binary sigmoid LOSS.

FasterRCNN and DynamicHead (SOTA) as detection models, Swin-T and Swin-L as image encoders, and BERT as textual encoder.

Deep fusion is not so much about combining the output from different encoders (called late fusion), but about exchanging information as layers are added. In this case, BERT adds a new layer on top of the existing layer and exchanges the output of the layers above it.

TL;DR#

Details#

preliminaries#

Data#

Result#

TL;DR

Details

preliminaries

Data

Result