[44] Context-Aware Scene Graph Generation With Seq2Seq Transformers

TL;DR

task : two-stage Scene Graph Generator
Problem :** Existing studies assume that triplets are independent and make parallel predictions.
idea : If you look at other predicted relations and make auto-regressive predictions, you’ll do better! (see above)
architecture : It is a transformer encoder-decoder structure, where the decoder puts the value from the encoder into [S, P, O] with the embedding for the relation, which is self-attentive, and the value from the encoder is cross-attentive.
objective : add a reinforcement learning approach to cross entropy loss + recall, mRecall
baseline : Graph R-CNN, …
data : VRD, Visual Genome
result : SOTA
contribution : An auto-regressive approach first seen at SGG
limitation or part I don’t understand : It’s interesting to learn… -> (after discussion) multi-object detection also had a case of putting them sequentially. (Learning that if there is a cat in this picture, there will be a dog.) Transformer decoders don’t just look at the input information, but also cross-attention, so the input doesn’t necessarily have to be related to what I want to select.

Details

Architecture

Object Encoder

Just a transformer encoder. But I’m not sure what you put as input. Is it just a visual feature map? X_b$ is the output of the bth transformer block

Relationship Decoder

Takes the contextualized object features $X_B\in \mathbb{R}^{N\times D}$ (where N is the number of objects and D is the embedding dimension) and the predicted relationship $\hat Y_{1:m}$ up to the previous step, and picks the m(+1) relationship.

The input to the decoder is the concatenation of the contextualized embedding of the subject, the contextualized embedding of the object, and the previously extracted embeddings for the relation. $(X_B[i], E[r], X_B[j])$ So it’s an unusual structure where you embed the previous prediction and get the next one. Concatenate the D-dimensional ffn and pass it through self-attention and cross-attention. Initially, we just put in a D-dimensional <SOS>. For cross-attention, it comes from hanging $Y_k$ from the decoder’s self-attention and $X_B$ from the encoder.

Predict the next relationship triplet with the output $Y_K$ from the last Kth decoder layer. Predicted for all remaining pairs as shown below. And the one with the highest softmax is selected.

$i$ : subject indices, $j$ : object indicies

Training scheme

Triplet ordering is learned by shuffling.
The loss is originally only added for positive pairs, but VRD also adds negative pairs because it is important to predict no relation.

Reinforcement Learning

In training, we get input history as GT (teacher-forcing), but not in inference 2) There is a gap between cross entropy loss and recall. -> Add a reinforcement learning component when decoding. Recall and mRecall tend to move in opposite directions. So we add alpha and define it as reward.

Here, action is what to choose when the logit value is given for all pairs. state is the state of selecting m pairs. RL is better than greedy decoding.

Expreiments

Qualitative Results

The probability of guessing GT is higher than predicting it independently.

TL;DR#

Details#

Architecture#

Object Encoder#

Relationship Decoder#

Training scheme#

Reinforcement Learning#

Expreiments#

Qualitative Results#