
TL;DR
- task : one-stage scene graph generation
- problem : Factorization that picks an object and then picks a relation based on it is limited. Given a relation, we can do a better job of picking out subject and object.
- idea : 1) Model object conditionally given subject and predicate conditionally given subject, object 2) t layers output sub, obj, predicate respectively and this propagates to the next layer.
- architecture : CNN backbone + 3 transformer decoders ({s, p, o}). positional encoding creates a conditional PE where subject is the learned PE, object is the MHSA of subject’s PE and object PE, and predicate is the MHSA of subject and object together. Each query for {s, p, o} is also MHSA’d by fetching the queries for {s, p, o} in the previous layer.
- objective : ce loss + bbox loss for subject / object / predicate + loss re-weighting for tail
- baseline : MOTIF, HOTR, SGTR
- data : Visual Genome, Action Genome
- result : SOTA. mR performance is very good.
- contribution : Good performance with just the transformer structure!
- Limitation or something I don’t understand: Object detection was just snapped. So no Detr Decoder.
Details

Architecture

Conditional Positional Encodings

- $\tilde q^t_{x,i} =q^t_{x,i} +p^t_{x,i}$ ; x is one of {s, o, p}.
Conditional Queries

- $q^t_{x,i}$ is the feature representation of the i-th index of the t-th layer
Result

harmonic Recall is the evaluation metric that they came up with, which is a combination of recall and mR. AP didn’t rate you! You’re a man!
bipartite matching
Padding the ground truth relation with no relation and finding the graph that minimizes the total joint matching cost. (Why? Hmm..)


Our loss!

Implementation Details
- ResNet-101
- 6 layers, feature size 256
- 300 queries
- bs = 12, lr=10e-4 gradually decaying
- Using NMS
- Each class has an NMS attached to it, and it’s attached to a post-NMS bbox while checking for IoU overlap.
- 50 epochs. T4 Chapter 4.
Ablation
number of queries

Larger num_queries doesn’t make a difference.
Effect of refinement
