image

paper , code

TL;DR

  • task : Scene Graph Generation. task to pull out objects and relation between objects in the image.
  • problem : I want to do one-stage SGG without object detector.
  • idea : I want to get the query idea of DETR. Can’t we just predict the set {S, P, O} instead of picking entities and predicting the triplet in the middle?
  • architecture : Using object queries from DETR. Similarly, we create subject / object queries, pay attention to self-attention, pay attention to visual information, and pay attention to object queries (= entities in this paper) from DETR to get the final subject / object representation. Finding a relation with a heatmap.
  • objective : Get triplet prediction <{subject cls, subject bbox}, relation class, {object cls, object bbox}> and get triplet loss by bipartite matching with GT. Combine this with the entity loss from DETR and calculate.
  • baseline : two-stage SGG models, FCSGG
  • data : Visual Genome, Open Image V6
  • result : Better performance than FCSGG. Some models perform worse than those that use prior information, but performance is acceptable.
  • contribution : one-stage SGG model with comparable performance!

Details

architecture

image

Overall, there are three architectures A) feature encoder extracting the visual feature context -> Z B) entity decoder capturing visual feature context -> Q_e C) triplet decoder with subject and object branches -> Q_s, Q_o, E_t

  1. subject and object queries Something like an object query in DETR. An embedding in d dimensions. It has nothing to do with triplets at all. There is a separate thing called triplet encoding to represent triplets.

  2. Coupled Self-Attention(CSA) Add subject encoding and object encoding of the same size as triplet encoding. Then use the following operation to get the CSA. image

  3. Decoupled Visual Attention(DVA) Create a feature context Z that focuses on the visual feature. The word decouple is used here because it doesn’t matter whether the object is a subject or an object. The following DVA operations occur on both the subject and object branches. image

  4. Decoupled Entity Attention(DEA) It bridges the gap between entity detection and triplet detection. The reason for giving entities separately is that there are no constraints on SPO relationships, so we expect better localization. image

The output of the DEA is made into the final desired output via FFN. The bbox regression predicts the center, width, and height via the FFN below. image

The subject attention heatmap M_s and object attention heatmap M_o from the DVA layer are concatenated and turned into a spatial feature vector through the convolutional mask head below. -> Get the relation by looking at where the visual encoder looked at each S and O selection. image

Set Prediction Loss for Triplet Detection

Make a triplet prediction y_sub, c_prd, y_obj. where y_sub, y_obj are two bboxes and a class. In this triplet, the matching cost is obtained by bipartite matching with the GT triplet (padded with <background-no relation-background>), and the matching cost is Consists of 1) subject cost 2) object cost 3) predicate cost. The subjet / object costs are composed of class loss and bbox loss, class cost looks like this image bbox cost is equal to image

The triplet prediction cost is obtained by adding the class cost for the predicate to this. These costs are then used to perform bipartite matching with the hungarian method, and LOSS is obtained. After learning a few times, it spits out a meaningless triplet like the one below, so we added an additional IoU-based rule to prevent this. image

When sub / obj class is background and IoU btw. gt and pred are above threshold, don’t add loss for subject or object. image

  • In proposal C, the blue box is better off not imposing a loss because it picked bbox well
  • In proposal D, the blue and orange boxes are better off not imposing a loss because they picked bbox well

The final loss is calculated by combining the loss for the entity and the triplet loss, which probably came from the DETR.

Dataset

  • Visual Genome
    • 108k images, 150 entities, 50 predicates
    • Predicate classification(PredCLS) : given GT bbox and cls, predict predicates
    • Scene graph classification(SGCLS) : given GT bbox, predict predicates and object class
    • Scene graph detection : predict all!
    • Recall@k, mean RecallR@k
  • Open Image V6
    • 126k images, 288 entities, 30 predicates
  • Recall@50, weight mean average precision, phrase detection ?? not sure what it says.

Implementation Details

  • 8 x 2080 Ti, bs=2, AdamW, weight decay 10-4, gradient clipping, Transformer LR 10-4, RestNet LR 10-5, lr dropping 0.1 by 100 epochs

  • Using auxiliary loss It’s a loss applied by DETR, but you should read the original . -> Stack the layers high and create a predictor that is shared by each layer, so that they each predict and sum their losses. image

  • 6 layers encoder, 6 triplet decoder layer, 8 head attentions

  • num of entities 100, num of queries 200

  • IoU threshold 0.7

  • When doing inference 2080 ti/eval, time is not included in the learning time.

Results

image

Job thoughts/questions

  • Relation can be small or large in nature, but we need to select only the relation that is annotated or that we are targeting.
  • Who gets to be an S and who gets to be an O seems a bit arbitrary too. Eye above the nose. Nose under the eye. -> I can’t help it.
  • I’ll have to look up some studies on self-attention graph learn.
  • Is it okay to exclude objects without relation in SGG => only objects with relation are in the answer sheet?
  • I need to actually see some of this Visual Genome data.
  • What is the current metric based on -> Scene graph detection