image

paper

TL;DR

  • task : one-stage scene graph generation
  • problem : Factorization that picks an object and then picks a relation based on it is limited. Given a relation, we can do a better job of picking out subject and object.
  • idea : 1) Model object conditionally given subject and predicate conditionally given subject, object 2) t layers output sub, obj, predicate respectively and this propagates to the next layer.
  • architecture : CNN backbone + 3 transformer decoders ({s, p, o}). positional encoding creates a conditional PE where subject is the learned PE, object is the MHSA of subject’s PE and object PE, and predicate is the MHSA of subject and object together. Each query for {s, p, o} is also MHSA’d by fetching the queries for {s, p, o} in the previous layer.
  • objective : ce loss + bbox loss for subject / object / predicate + loss re-weighting for tail
  • baseline : MOTIF, HOTR, SGTR
  • data : Visual Genome, Action Genome
  • result : SOTA. mR performance is very good.
  • contribution : Good performance with just the transformer structure!
  • Limitation or something I don’t understand: Object detection was just snapped. So no Detr Decoder.

Details

image

Architecture

image

Conditional Positional Encodings

image
  • $\tilde q^t_{x,i} =q^t_{x,i} +p^t_{x,i}$ ; x is one of {s, o, p}.

Conditional Queries

image
  • $q^t_{x,i}$ is the feature representation of the i-th index of the t-th layer

Result

image

harmonic Recall is the evaluation metric that they came up with, which is a combination of recall and mR. AP didn’t rate you! You’re a man!

bipartite matching

Padding the ground truth relation with no relation and finding the graph that minimizes the total joint matching cost. (Why? Hmm..) image

image

Our loss! image

Implementation Details

  • ResNet-101
  • 6 layers, feature size 256
  • 300 queries
  • bs = 12, lr=10e-4 gradually decaying
  • Using NMS
  • Each class has an NMS attached to it, and it’s attached to a post-NMS bbox while checking for IoU overlap.
  • 50 epochs. T4 Chapter 4.

Ablation

number of queries

image

Larger num_queries doesn’t make a difference.

Effect of refinement

image