
TL;DR
- I read this because.. : SGG Early Papers
- task : one-stage SGG
- problem : Retrieve object without RPN and also retrieve relation
- IDEA : Borrowing the idea of associate embeddings from multiperson pose estimation. A network that associates children with similar embeddings at body joints as the same person.
- architecture : hourglass + CNN + 1D CNN to generate a heatmap of each likely object and likely relation. for GT for train and top k activated pixels for infer. object predicts anchor based box regressor, cls id. relation predicts relation class, subject object id.
- objective : bbox regression loss + sigmoid loss for heatmap + ce for subject / object id +pull together loss + push apart loss
- baseline : VRD with lanugage prior , Scene Graph Generation by Iterative Message Passing
- data : Visual Genome
- evaluation : SGGen, SGCls, PredCls
- result : SOTA
- contribution : first one-stage SGG
- limitation/things I cannot understand : It seems that the feature vector has to predict and has additional losses that are close and far from each other, but they seem to be in different directions. It’s interesting to learn in one space.
Details

A cut just because the picture is cute
Preliminaries : Hourglass network
https://deep-learning-study.tistory.com/617

A network similar to u-net. Used because both local and global information is needed for pose estimation.
Architecture

Detecting graph elements image -> hourglass network -> CNN -> 1 x 1 conv + sigmoid to draw heatmap for object and relation (define bbox as median of sbj, obj) -> (for training) GT vertices, edges to draw features and then 1) obj predicts anchor based offset regression, cls, id with faster RCNN method 2) rel predicts rel cls, sbj (src in paper) id, obj (dest in paper) id
Connecting elements with associative embeddings Above, we only picked out object and relation ids, now we need to combine them. For each vertex, we get a vector embedding, which needs to be learned to vary from vertex to vertex, and for edges, it needs to be an embedding that can represent the ids of subject and object. So I added a pull together, push apart loss
pull together loss

$h_i\in\mathbb{R}^d$: embedding of vertex $v_i$ $h_{ik}’$ : embedding of all edges connected to vertex $v_i$. For $k=1,…K_i$.
push apart loss

To allow different Nodes to have different embeddings, you can use the
Result
