image

paper

TL;DR

  • task : two-stage SGG
  • PROBLEM : Let’s learn objects and the relationships between them well.
  • idea : Let’s use bi-GRU to communicate between objects.
  • architecture : FasterRCNN for objects and visual / coordinate / class features and put them into bi-GRU. Put the hidden output for each object into a transformer encoder. Predict relation by pruning and frying for n(n-1) pairs of objects.
  • objective : cross-entropy loss
  • baseline : Neural Motif, IMP, Graph R-CNN
  • data : Visual Genome
  • result : SOTA
  • contribution : Not sure.
  • Limitations or things I don’t understand : If you make n region proposals, you’ll get $O(n^2)$ as many

Details

Architecture

image

object proposal to predict a relation, which is described in Section 3.3 BA. $d = W_p * u_{i,j}$

  • $u_{i,j}$ union feature of a 2048-dimensional subjet-object pair? It doesn’t say how it was created.

$p_{i,j} = softmax(W_r(o_i’*o_j’*u_{i,j}) + d \odot \tilde p_{i->j}$

  • $\odot$ is the HadaMard Product
  • is image

Finally, argmaxed to get the relation. image

Frequency Softening

Since the VG dataset is long-tailed, take the log of the probability of the last softmax step image

Results

image