image

paper

TL;DR

  • task : two-stage Scene Graph Generator
  • Problem :** Existing studies assume that triplets are independent and make parallel predictions. image
  • idea : If you look at other predicted relations and make auto-regressive predictions, you’ll do better! (see above)
  • architecture : It is a transformer encoder-decoder structure, where the decoder puts the value from the encoder into [S, P, O] with the embedding for the relation, which is self-attentive, and the value from the encoder is cross-attentive.
  • objective : add a reinforcement learning approach to cross entropy loss + recall, mRecall
  • baseline : Graph R-CNN, …
  • data : VRD, Visual Genome
  • result : SOTA
  • contribution : An auto-regressive approach first seen at SGG
  • limitation or part I don’t understand : It’s interesting to learn… -> (after discussion) multi-object detection also had a case of putting them sequentially. (Learning that if there is a cat in this picture, there will be a dog.) Transformer decoders don’t just look at the input information, but also cross-attention, so the input doesn’t necessarily have to be related to what I want to select.

Details

Architecture

image

Object Encoder

Just a transformer encoder. But I’m not sure what you put as input. Is it just a visual feature map? X_b$ is the output of the bth transformer block

Relationship Decoder

Takes the contextualized object features $X_B\in \mathbb{R}^{N\times D}$ (where N is the number of objects and D is the embedding dimension) and the predicted relationship $\hat Y_{1:m}$ up to the previous step, and picks the m(+1) relationship.

The input to the decoder is the concatenation of the contextualized embedding of the subject, the contextualized embedding of the object, and the previously extracted embeddings for the relation. $(X_B[i], E[r], X_B[j])$ So it’s an unusual structure where you embed the previous prediction and get the next one. Concatenate the D-dimensional ffn and pass it through self-attention and cross-attention. Initially, we just put in a D-dimensional <SOS>. For cross-attention, it comes from hanging $Y_k$ from the decoder’s self-attention and $X_B$ from the encoder. image

Predict the next relationship triplet with the output $Y_K$ from the last Kth decoder layer. Predicted for all remaining pairs as shown below. And the one with the highest softmax is selected. image

$i$ : subject indices, $j$ : object indicies

Training scheme

  • Triplet ordering is learned by shuffling.
  • The loss is originally only added for positive pairs, but VRD also adds negative pairs because it is important to predict no relation. image

Reinforcement Learning

  1. In training, we get input history as GT (teacher-forcing), but not in inference 2) There is a gap between cross entropy loss and recall. -> Add a reinforcement learning component when decoding. Recall and mRecall tend to move in opposite directions. So we add alpha and define it as reward. image

image

Here, action is what to choose when the logit value is given for all pairs. state is the state of selecting m pairs. RL is better than greedy decoding.

Expreiments

image

Qualitative Results

image

The probability of guessing GT is higher than predicting it independently.