[68] Iterative Scene Graph Generation

TL;DR

task : one-stage scene graph generation
problem : Factorization that picks an object and then picks a relation based on it is limited. Given a relation, we can do a better job of picking out subject and object.
idea : 1) Model object conditionally given subject and predicate conditionally given subject, object 2) t layers output sub, obj, predicate respectively and this propagates to the next layer.
architecture : CNN backbone + 3 transformer decoders ({s, p, o}). positional encoding creates a conditional PE where subject is the learned PE, object is the MHSA of subject’s PE and object PE, and predicate is the MHSA of subject and object together. Each query for {s, p, o} is also MHSA’d by fetching the queries for {s, p, o} in the previous layer.
objective : ce loss + bbox loss for subject / object / predicate + loss re-weighting for tail
baseline : MOTIF, HOTR, SGTR
data : Visual Genome, Action Genome
result : SOTA. mR performance is very good.
contribution : Good performance with just the transformer structure!
Limitation or something I don’t understand: Object detection was just snapped. So no Detr Decoder.

harmonic Recall is the evaluation metric that they came up with, which is a combination of recall and mR. AP didn’t rate you! You’re a man!

Padding the ground truth relation with no relation and finding the graph that minimizes the total joint matching cost. (Why? Hmm..)

Our loss!

ResNet-101
6 layers, feature size 256
300 queries
bs = 12, lr=10e-4 gradually decaying
Using NMS
Each class has an NMS attached to it, and it’s attached to a post-NMS bbox while checking for IoU overlap.
50 epochs. T4 Chapter 4.

Larger num_queries doesn’t make a difference.