
TL;DR
- task : two-stage SGG
- PROBLEM : Let’s learn objects and the relationships between them well.
- idea : Let’s use bi-GRU to communicate between objects.
- architecture : FasterRCNN for objects and visual / coordinate / class features and put them into bi-GRU. Put the hidden output for each object into a transformer encoder. Predict relation by pruning and frying for n(n-1) pairs of objects.
- objective : cross-entropy loss
- baseline : Neural Motif, IMP, Graph R-CNN
- data : Visual Genome
- result : SOTA
- contribution : Not sure.
- Limitations or things I don’t understand : If you make n region proposals, you’ll get $O(n^2)$ as many
Details
Architecture

object proposal to predict a relation, which is described in Section 3.3 BA. $d = W_p * u_{i,j}$
- $u_{i,j}$ union feature of a 2048-dimensional subjet-object pair? It doesn’t say how it was created.
$p_{i,j} = softmax(W_r(o_i’*o_j’*u_{i,j}) + d \odot \tilde p_{i->j}$
- $\odot$ is the HadaMard Product
- is

Finally, argmaxed to get the relation.

Frequency Softening
Since the VG dataset is long-tailed, take the log of the probability of the last softmax step

Results
