
TL;DR
- task : one-stage SGG
- problem : N entity proposal -> O(n**2) predicate proposal -> inefficient!
- idea : Let’s solve the SGG problem as a bipartite graph. Let’s represent entities and predicates as nodes and connect them with directed edges!

- architecture : First, extract visual features with ResNet like DETR and create entity nodes with learnable query. For predicate node, we concatenate the embedding of visual feature and selected entity node to get attention. With these, we cross attention each of predicate/entity indicators with byproducts from above, and then fuse them on top of L-layer, stacking layers as the fusion progresses. For the final output, we recreate it as a bipartite graph and format it into the final output format.
- objective : loss for entity(=DETR loss) + loss for predicate. predicate creates a matching matrix and finds in hungarian, localization + categories of entities, localization of objects related to predicate + loss for categories in relation
- baseline : FCSGG, …
- data : Visual Genome, Open Image V6
- result : SOTA with more efficient inference
- contribution : tackle graph problem with transformer structure? subject / object not split and not O(n**2)? ah but structure is too complex…
Details
Architecture

1) Backbone and Entity Node Generator
Put the features that went through ResNet like DETR into the transformer. viusl feature Z comes out.
Like the DETR decoder, entities are put into a learnable query. Each comes in with a feature map Z, which contains the entity’s bbox B and class scores P and
Outputs the associated feature representation H.

2) Predicate Node Generator
- predicate encoder Using a Transformer encoder to extract predicate-specific image features. The result is Z^p.
- predicate query initialization
If you just put a learnable query, you can’t put a compositional property, so you need to concatenate the subject and object queries.

And when learning the representation for this query, it pays attention to the feature H and bbox B from 1 together.


3) Structural Predicate Node Generator
Perform the final attention operation on the matrix received from above
a) predicate sub-decoder
Extracting predicate expressions from image features

b) entity indicator sub-decoders
Pulling entity indicators for predicate queries

c) predicate indicator fusion
To connect predicates and indicators so that they can reference each other up the layer.

The end result of this process is the output shown below.

class categorization for the predicate and the bboxes of subject, object associated with the predicate + categories
Bipartite Graph Assembling
Replace it with a bipartite graph consisting of N entities and N_r predicates. Create an adjacency matrix between entity nodes and predicate nodes, and then create a correspondence matrix.

As you can see in the figure, we have entity, subject (light green), and object (blue), and these distances are used to create a matching!
For example, for subject?

Define the distance between entity and subject as below and select only the top K of the distance matrix.

Learning and Inference

DETR entity generator loss. Localization + classification loss for entity indicator, localization for entities related to the predicate, classification loss for the predicate
Results

Random thoughts / questions
- I put the visual information and the object’s location information in there, but shouldn’t I have?