
TL;DR
- I read this because.. : What else can you do with SG? SG annotation has its limitations. Is it possible to parsing a scene graph from a caption or image - text pair?
- task : Given a (proposed) image and caption, create a dependency tree for the caption and predict bboxes for the objects in the tree.
- idea : encoder - decoder form
- architecture :** text is word embedding + pos embedidng concat, image is object first, then attribute/relation prediction, then attention to create context encoding. Given the context encoding, generate a parse tree and tag sequence.
- objective : EM + contrastive loss with MLE (representation of each node and whether image I is positive or negative)
- baseline : DMV(dependancy structure induction), MAF(Visual Grounding)
- data : (proposed) VLParse
- evaluation : Directed / Undirected Dependency Accuracy (DDA/UDA), Zero-Order Alignment Accuracy (is “a table” in the caption well matched with bbox?, IoU + attribute), First/Second-Order Alignment Accuracy (first is the text in the caption and second is the relationship of the caption text to the object bbox = zero + first combined)
- result : Better performance compared to Language Structure Induction / Evaluation on Visual Phrase Grounding task
- contribution : Propose a new dataset/baseline
- limitation / things I cannot understand : What do you mean by decoder architecture?
Details

This is an illustration in the introduction, but it doesn’t actually create a Scene Graph. It just utilizes the scene graph data when creating it.
proposed data: VLParse

Built with heuristics + human refinement
proposed task: Unsupervised Vision-Language Parsing
input : image $\mathbf{I}$, sentence $\mathbf{w} = {w_1, w_2, … w_N}$ output : parse tree $\mathbf{pt}$. Each object should also predict the box region. In this paper, we use faster rcnn to select candidates and map them.

architecture
Feature Extraction
- Visual Feature
- Faster RCNN -> RoI -> $\{ V_i^o \}^M_{i=1}$ is the feature of node
OBJECT - Each
OBJECTnode is tagged with anATTRIBUTE. This is created by $v_i^a= MLP(v_i^o)$. - For the two `OBJECTs’ we add a zero-order node called $RELATIONSHIP$ $v^img_{i->j,0}%$
- All but the features of
OBJECTare random initialize - Textual Feature
- For each word $w_i$, use POS tag embedding and pretrained word embedding in cat for each word $w_i$.
- Biaffine score for the representation $w_{i->j}$ between two words

Structure Construction
encoder
Create a contextual encoding c by performing attention operations on the text feature and visual feature.
Perform attention operations on the tokens $\{w_i\}$ in the caption and the scene graph representation $\{v_i, v_{i->j}\}$$ and add them together to create a context vector $c_i$.
As if $Q=v_i, K=w_i, V=w_i$.

Create a global context vector $s$ by averaging pooling over all $c_i$ to create an overall context vector $s$
decoder
Create a tag sequence $t$ and a parse tree $\mathbf{pt}$. Use dynamic programming to create the parse tree
Cross-Modality Matching
- matching score


You can get posterior with the above

Learning
MLE loss

- $t_i$ : tag sequence.
- $\mathbf{pt}$ : parse tree. In this case, MLE loss is trained with the EM algorithm without targets! E step: generate parse trees given $\theta$ M step: learn $\theta$ by gradient descent in terms of likelihood given a parse tree
where tag is a methodology for expressing dependency parsing as a tag.

c.f. Parsing as Tagging

Contrastive loss

- $\mathbf{\hat{I}}$ : negative image
- $c$ : contextual encoding. $c_i = \sum Attn(w_i, v_i)w_i$
- $w_i$ : the i-th token in the caption
- $v_i$: image feature of the object.
- The sim function is just defined internally
Inference

Create all possible parse trees and find the most likely one And if we find the closest $v$ to each contextual encoding c, we can also create a scene graph (the relation is the one in the caption, right?)

Result

- Since there is no gt structure for caption, I made it an external parser and used Visually Grounded Neural Syntax Acquisition https://arxiv.org/pdf/1906.02890.pdf
etc.
- Finish reading and cleaning up the model part
- Is SGG used for visual grounding? - https://github.com/TheShadow29/awesome-grounding
- Just as the dataset Flickr30k Entities / RefCOCO/RefCOCO+/RefCOCOg has a lot of VGs, so does the dataset Flickr30k Entities / RefCOCO+/RefCOCOg.
- There is a study called scene-graph grounding, which does object location when given sgg + image, but what is it used for… - https://openaccess.thecvf.com/content/WACV2023/papers/Tripathi_Grounding_Scene_Graphs_on_Natural_Images_via_Visio-Lingual_Message_Passing_WACV_2023_paper.pdf
- As if it wasn’t.