image

paper , code

TL;DR

  • I read this because.. : What else can you do with SG? SG annotation has its limitations. Is it possible to parsing a scene graph from a caption or image - text pair?
  • task : Given a (proposed) image and caption, create a dependency tree for the caption and predict bboxes for the objects in the tree.
  • idea : encoder - decoder form
  • architecture :** text is word embedding + pos embedidng concat, image is object first, then attribute/relation prediction, then attention to create context encoding. Given the context encoding, generate a parse tree and tag sequence.
  • objective : EM + contrastive loss with MLE (representation of each node and whether image I is positive or negative)
  • baseline : DMV(dependancy structure induction), MAF(Visual Grounding)
  • data : (proposed) VLParse
  • evaluation : Directed / Undirected Dependency Accuracy (DDA/UDA), Zero-Order Alignment Accuracy (is “a table” in the caption well matched with bbox?, IoU + attribute), First/Second-Order Alignment Accuracy (first is the text in the caption and second is the relationship of the caption text to the object bbox = zero + first combined)
  • result : Better performance compared to Language Structure Induction / Evaluation on Visual Phrase Grounding task
  • contribution : Propose a new dataset/baseline
  • limitation / things I cannot understand : What do you mean by decoder architecture?

Details

image

This is an illustration in the introduction, but it doesn’t actually create a Scene Graph. It just utilizes the scene graph data when creating it.

proposed data: VLParse

image image

Built with heuristics + human refinement

proposed task: Unsupervised Vision-Language Parsing

input : image $\mathbf{I}$, sentence $\mathbf{w} = {w_1, w_2, … w_N}$ output : parse tree $\mathbf{pt}$. Each object should also predict the box region. In this paper, we use faster rcnn to select candidates and map them.

image

architecture

Feature Extraction

  • Visual Feature
  • Faster RCNN -> RoI -> $\{ V_i^o \}^M_{i=1}$ is the feature of node OBJECT
  • Each OBJECT node is tagged with an ATTRIBUTE. This is created by $v_i^a= MLP(v_i^o)$.
  • For the two `OBJECTs’ we add a zero-order node called $RELATIONSHIP$ $v^img_{i->j,0}%$
  • All but the features of OBJECT are random initialize
  • Textual Feature
  • For each word $w_i$, use POS tag embedding and pretrained word embedding in cat for each word $w_i$.
  • Biaffine score for the representation $w_{i->j}$ between two words image

Structure Construction

  • encoder

  • Create a contextual encoding c by performing attention operations on the text feature and visual feature.

  • Perform attention operations on the tokens $\{w_i\}$ in the caption and the scene graph representation $\{v_i, v_{i->j}\}$$ and add them together to create a context vector $c_i$.

  • As if $Q=v_i, K=w_i, V=w_i$. image

  • Create a global context vector $s$ by averaging pooling over all $c_i$ to create an overall context vector $s$

  • decoder

  • Create a tag sequence $t$ and a parse tree $\mathbf{pt}$. Use dynamic programming to create the parse tree

Cross-Modality Matching

  • matching score image
image

You can get posterior with the above image

Learning

MLE loss image

  • $t_i$ : tag sequence.
  • $\mathbf{pt}$ : parse tree. In this case, MLE loss is trained with the EM algorithm without targets! E step: generate parse trees given $\theta$ M step: learn $\theta$ by gradient descent in terms of likelihood given a parse tree

where tag is a methodology for expressing dependency parsing as a tag.

image

c.f. Parsing as Tagging

image

Contrastive loss image

  • $\mathbf{\hat{I}}$ : negative image
  • $c$ : contextual encoding. $c_i = \sum Attn(w_i, v_i)w_i$
  • $w_i$ : the i-th token in the caption
  • $v_i$: image feature of the object.
  • The sim function is just defined internally

Inference image

Create all possible parse trees and find the most likely one And if we find the closest $v$ to each contextual encoding c, we can also create a scene graph (the relation is the one in the caption, right?)

image

Result

image

etc.