[108] Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene Graphs with Language Structures via Dependency Relationships

TL;DR

I read this because.. : What else can you do with SG? SG annotation has its limitations. Is it possible to parsing a scene graph from a caption or image - text pair?
task : Given a (proposed) image and caption, create a dependency tree for the caption and predict bboxes for the objects in the tree.
idea : encoder - decoder form
architecture :** text is word embedding + pos embedidng concat, image is object first, then attribute/relation prediction, then attention to create context encoding. Given the context encoding, generate a parse tree and tag sequence.
objective : EM + contrastive loss with MLE (representation of each node and whether image I is positive or negative)
baseline : DMV(dependancy structure induction), MAF(Visual Grounding)
data : (proposed) VLParse
evaluation : Directed / Undirected Dependency Accuracy (DDA/UDA), Zero-Order Alignment Accuracy (is “a table” in the caption well matched with bbox?, IoU + attribute), First/Second-Order Alignment Accuracy (first is the text in the caption and second is the relationship of the caption text to the object bbox = zero + first combined)
result : Better performance compared to Language Structure Induction / Evaluation on Visual Phrase Grounding task
contribution : Propose a new dataset/baseline
limitation / things I cannot understand : What do you mean by decoder architecture?

Details

This is an illustration in the introduction, but it doesn’t actually create a Scene Graph. It just utilizes the scene graph data when creating it.

proposed data: `VLParse`

Built with heuristics + human refinement

proposed task: Unsupervised Vision-Language Parsing

input : image $\mathbf{I}$, sentence $\mathbf{w} = {w_1, w_2, … w_N}$ output : parse tree $\mathbf{pt}$. Each object should also predict the box region. In this paper, we use faster rcnn to select candidates and map them.

architecture

Feature Extraction

Visual Feature
Faster RCNN -> RoI -> $\{ V_i^o \}^M_{i=1}$ is the feature of node OBJECT
Each OBJECT node is tagged with an ATTRIBUTE. This is created by $v_i^a= MLP(v_i^o)$.
For the two `OBJECTs’ we add a zero-order node called $RELATIONSHIP$ $v^img_{i->j,0}%$
All but the features of OBJECT are random initialize
Textual Feature
For each word $w_i$, use POS tag embedding and pretrained word embedding in cat for each word $w_i$.
Biaffine score for the representation $w_{i->j}$ between two words

Structure Construction

encoder
Create a contextual encoding c by performing attention operations on the text feature and visual feature.
Perform attention operations on the tokens $\{w_i\}$ in the caption and the scene graph representation $\{v_i, v_{i->j}\}$$ and add them together to create a context vector $c_i$.
As if $Q=v_i, K=w_i, V=w_i$.
Create a global context vector $s$ by averaging pooling over all $c_i$ to create an overall context vector $s$
decoder
Create a tag sequence $t$ and a parse tree $\mathbf{pt}$. Use dynamic programming to create the parse tree

Cross-Modality Matching

matching score

You can get posterior with the above

Learning

MLE loss

$t_i$ : tag sequence.
$\mathbf{pt}$ : parse tree. In this case, MLE loss is trained with the EM algorithm without targets! E step: generate parse trees given $\theta$ M step: learn $\theta$ by gradient descent in terms of likelihood given a parse tree

where tag is a methodology for expressing dependency parsing as a tag.

c.f. Parsing as Tagging

Contrastive loss

$\mathbf{\hat{I}}$ : negative image
$c$ : contextual encoding. $c_i = \sum Attn(w_i, v_i)w_i$
$w_i$ : the i-th token in the caption
$v_i$: image feature of the object.
The sim function is just defined internally

Inference

Create all possible parse trees and find the most likely one And if we find the closest $v$ to each contextual encoding c, we can also create a scene graph (the relation is the one in the caption, right?)

Result

Since there is no gt structure for caption, I made it an external parser and used Visually Grounded Neural Syntax Acquisition https://arxiv.org/pdf/1906.02890.pdf

etc.

Finish reading and cleaning up the model part
Is SGG used for visual grounding? - https://github.com/TheShadow29/awesome-grounding
Just as the dataset Flickr30k Entities / RefCOCO/RefCOCO+/RefCOCOg has a lot of VGs, so does the dataset Flickr30k Entities / RefCOCO+/RefCOCOg.
There is a study called scene-graph grounding, which does object location when given sgg + image, but what is it used for… - https://openaccess.thecvf.com/content/WACV2023/papers/Tripathi_Grounding_Scene_Graphs_on_Natural_Images_via_Visio-Lingual_Message_Passing_WACV_2023_paper.pdf
As if it wasn’t.

TL;DR#

Details#

proposed data: VLParse#

proposed task: Unsupervised Vision-Language Parsing#

architecture#

Learning#

Result#