
TL;DR
- I read this because.. : SGG Early Papers
- task : Scene Graph Generation
- problem : Pick an object and handle quadratic relations well. Create an enhanced graph representation.
- idea : Put a module in the middle that prunes the relation between objects. apply an attentive GCN.
- architecture : 1) Extract object with Faster RCNN 2) concat object cls logit values to prune relation 3) apply attentive GCN to enrich the representation of object, relation node -> attach classifier to each subject, object, relation representation as predicted?
- objective : 1) bbox loss + cls loss 2) bce for relationship score 3) ce for object cls and predicate cls
- baseline : IMP, MSDN, NeuralMotif
- data : Visual Genome
- evaluation : PredCls, PhrCls, SGGen, SGGen+(proposed in this paper)
- result : SOTA
- contribution : Probably the first paper to apply GCN?
- limitation/things I can’t understand : Does SGG really have such graphical characteristics that I should write GCN?
Details
Architecture

Break it down into 3 steps
- Object Region Proposal : Select nodes(=vertex, V) when given an image => Faster RCNN
- Relationship Proposal : Given an image and a node, prune the relation that exists in all cases n*(n-1)
- Graph Labeling: Finding relation and object given image, node, and edge
Relation Proposal Network
Measure “relatedness” using the object’s class logit.
Give some sort of soft prior (e.g., can’t be person-ride-chicken?)

The implementation catches and then stacks MLPs.
After scoring and sorting, we pick out K pairs. Since it is a faster RCNN, there will be a lot of pairs, so we do NMS on the pairs to keep only the top m pairs.

Attentional GCN
Vanilla GCN looks like this

- $z_i$ : Representation of the i-th node
- $N(i)$ : neighbors of the ith node
- $\alpha_{ij}$: connection coefficient created by the adjacency matrix of i and j
If we express this as a matrix called $Z\in \mathbb{R}^{d\times T_n}$, then we get

We are trying to learn $\alpha_{ij}$ here, not given it

2-layer MLP + softmax to learn $\alpha_{ij}$
aGCN for SGG
Create N object regions and m relationships, each with a node, and connect the edges from the network above. Additionally, add direct edges between objects.
The representation for an object node is as follows

The representation for a relation node is shown below.

Result

Ablation for modules

