
TL;DR
- I read this because.. : SGG two-stage early papers
- task : two-stage SGG
- problem : One of the previous studies. Before this paper, I think there was neural motfis, #104, SGG with iterative message passing, etc.
- idea : Make each object an enhanced embedding to predict!
- architecture : Faster-RCNN + Create an embedding representing the object and use it to classify relation cls for $O(n^2)$ pairs. Global features + embeddings for cls drawn by od + RoI visual features + relative geometric information.
- objective : 1) multi-label loss of object class at image level 2) cls loss for each object 3) relation classification loss
- baseline : neural motfis, #104, SGG with iterative message passing
- data : Visual Genome
- evaluation : SGdet, SGcls, PredCls
- result : sota
- contribution : simple !
Details
Architecture

Global Context Encoding Module AvgPool for feature followed by FC for multi-label classification
Relation Embedding Module To create an objective feature $O_i$, we use the embedding of OD’s predicted cls $l_i$, features drawn from RoI pooling, and the image-wide context feature $c$ to create the embedding, and stack FCNs to predict cls.



Include geometric features when getting a relation

Loss

Result
