image

paper

TL;DR

  • task : Generate a Scene Graph given a bbox
  • PROBLEM : I want to better model the relationship between pairs of objects.
  • idea : It’s helpful to learn the context between objects, and it’s even better if all other objects are referenced when predicting a predicate, not just the object in question.
  • architecture : Model the relationship between nodes as self-attention and the relationship between edges as cross-attention, following the transformer encoder-decoder example.
  • Given a bbox, 1) extract visual features with Faster RCNN, 2) put it into the transformer encoder to extract self-attention output and predict the class in bbox, 3) put Edge Query into the transformer decoder to cross-attention with features from 1 and 2, and 4) burn FCN to predict subject, object, and relation.
  • objective : cross entropy loss of subject, object, relation
  • baseline : IMP, … etc. two-stage models.
  • data : Visual Genome, CQA, VRD
  • result : SOTA
  • contribution : Hard use of the Transformer structure.

Details

why two-stage?

Unlike the one-stage we read about before, bbox goes to input! That is, the difference between having a bbox in the final output or not.

Architecture

image

Problem Decomposition

  • We can express the conditional probability of going from an image I to the scene graph G we want. (R is a relation, O is an object class, B is a bbox, I is an image) $P(R|O,B,I)P(O|B,I)P(B|I)$
  • Subtracting $P(B|I)$ where $P(B|I)$ is what object detection does
  • To draw $P(O|B,I)$, we use the N2N module, which can learn context between objects.
  • The $P(R|O,B,I)$ part is where the E2N module creates undirected edges to the entities and query, picks candidates, and the RPM gets the edge direction and relation type.
  • Assumed a unique relation type between the two objects.

Object Detection

We’ll use Faster RCNN on the VGG16 backbone. Given a node $n_i$, extract the bbox coordinate, which is the spatial embodiment, and extract the feature map $I_{feature}$ from the top layer of VGG16 to extract the 4096-dimensional feature vector $v_i$. For $o_i^{init}\in\mathbb{R}^C$ (where C is the # of classes), we initialized it using a GloVE embedding.

Encoder N2N Attention

Learning the context of objects is not only helpful for object detection, but also for relation classification. For this purpose, objects are put into a transformer encoder. The inputs are as follows v_i$ is the image feature vector $s_i$ is the GloVe vector for the class label from the object detection $b_i$ is the bounding box. image

Then we burn the network below, and in the last layer we do the categorization for that bbox. image

Also, $f_i^{final}$ enters the decoder cross-attention.

Decoder Edge Positional Encoding

I want to put an edge query in the transformer decoder, but it’s tricky to put a PE because there is no ordering over edges. image

So, if we do the above, we know what is subject and what is object. (?? Actually, the expression doesn’t make sense)

Decoder E2N Attention

The $e_{ij}$ in Edge Queries and the collateral vectors above image

image

Self-attention between edges didn’t help performance, so we went straight to cross-attention.

Directed Relation Prediction Module(RPM)

relation is directional, so I took the rich embeddings from above and created a directional relational embedding like below. image

We then put the above embedding into a module called RPM (relation prediction module) to predict the final relation.

image

LayerNorm -> Linear -> ReLU -> Linear(final relation categories) -> Take softmax

I also added a value for frequency. image

Implementation Details

We used the top 64 object labels from NMS (IoU > 0.3) in the object detector and only considered node pairs with overlapping bounding boxes to reduce the computational cost of relation classification (huge inductive bias..)

Result

Visual Genome

image

Qualitative

image

(b) is the N2N attention heatmap, which shows how much one object influenced another. (c) is the E2N attention heatmap. How much did the objects affect the relation.

I often mistake on for of, but of is more natural, like Face, of, Woman. -> multi-predict is needed for this reason.

image

It was better to use our own decoder than to just use the transformer one (with self-attention).

image

The performance drop when each feature was subtracted from the decoder was as follows. The frequency is a bit high.

The authors’ explanation of each image