[45] BGT-Net: Bidirectional GRU Transformer Network for Scene Graph Generation

TL;DR

task : two-stage SGG
PROBLEM : Let’s learn objects and the relationships between them well.
idea : Let’s use bi-GRU to communicate between objects.
architecture : FasterRCNN for objects and visual / coordinate / class features and put them into bi-GRU. Put the hidden output for each object into a transformer encoder. Predict relation by pruning and frying for n(n-1) pairs of objects.
objective : cross-entropy loss
baseline : Neural Motif, IMP, Graph R-CNN
data : Visual Genome
result : SOTA
contribution : Not sure.
Limitations or things I don’t understand : If you make n region proposals, you’ll get $O(n^2)$ as many

object proposal to predict a relation, which is described in Section 3.3 BA. $d = W_p * u_{i,j}$

$u_{i,j}$ union feature of a 2048-dimensional subjet-object pair? It doesn’t say how it was created.

$p_{i,j} = softmax(W_r(o_i’*o_j’*u_{i,j}) + d \odot \tilde p_{i->j}$

Finally, argmaxed to get the relation.

Since the VG dataset is long-tailed, take the log of the probability of the last softmax step