[36] SGTR: End-to-end Scene Graph Generation with Transformer

TL;DR

task : one-stage SGG
problem : N entity proposal -> O(n**2) predicate proposal -> inefficient!
idea : Let’s solve the SGG problem as a bipartite graph. Let’s represent entities and predicates as nodes and connect them with directed edges!
architecture : First, extract visual features with ResNet like DETR and create entity nodes with learnable query. For predicate node, we concatenate the embedding of visual feature and selected entity node to get attention. With these, we cross attention each of predicate/entity indicators with byproducts from above, and then fuse them on top of L-layer, stacking layers as the fusion progresses. For the final output, we recreate it as a bipartite graph and format it into the final output format.
objective : loss for entity(=DETR loss) + loss for predicate. predicate creates a matching matrix and finds in hungarian, localization + categories of entities, localization of objects related to predicate + loss for categories in relation
baseline : FCSGG, …
data : Visual Genome, Open Image V6
result : SOTA with more efficient inference
contribution : tackle graph problem with transformer structure? subject / object not split and not O(n**2)? ah but structure is too complex…

Details

Architecture

1) Backbone and Entity Node Generator

Put the features that went through ResNet like DETR into the transformer. viusl feature Z comes out. Like the DETR decoder, entities are put into a learnable query. Each comes in with a feature map Z, which contains the entity’s bbox B and class scores P and Outputs the associated feature representation H.

2) Predicate Node Generator

predicate encoder Using a Transformer encoder to extract predicate-specific image features. The result is Z^p.
predicate query initialization If you just put a learnable query, you can’t put a compositional property, so you need to concatenate the subject and object queries.

And when learning the representation for this query, it pays attention to the feature H and bbox B from 1 together.

3) Structural Predicate Node Generator

Perform the final attention operation on the matrix received from above a) predicate sub-decoder Extracting predicate expressions from image features

b) entity indicator sub-decoders Pulling entity indicators for predicate queries

c) predicate indicator fusion To connect predicates and indicators so that they can reference each other up the layer.

The end result of this process is the output shown below.

class categorization for the predicate and the bboxes of subject, object associated with the predicate + categories

Bipartite Graph Assembling

Replace it with a bipartite graph consisting of N entities and N_r predicates. Create an adjacency matrix between entity nodes and predicate nodes, and then create a correspondence matrix.

As you can see in the figure, we have entity, subject (light green), and object (blue), and these distances are used to create a matching! For example, for subject?

Define the distance between entity and subject as below and select only the top K of the distance matrix.

Learning and Inference

DETR entity generator loss. Localization + classification loss for entity indicator, localization for entities related to the predicate, classification loss for the predicate

Results

Random thoughts / questions

I put the visual information and the object’s location information in there, but shouldn’t I have?

TL;DR#

Details#

Architecture#

1) Backbone and Entity Node Generator#

2) Predicate Node Generator#

3) Structural Predicate Node Generator#

Bipartite Graph Assembling#

Learning and Inference#

Results#

Random thoughts / questions#