
TL;DR
- task : image-to-graph generation
- problem : The two-stage image-to-graph generation model is also complex and has a complexity of O(n**2).
- idea : Instead of pair-wise interaction of entities (=>O(n**2)), let’s use interaction of relation token and entity.

- architecture : CNN backbone + deformable DETR(Encoder, Decoder with N + 1(=relation) tokens) + Object Detection Head and Relation Prediction Head.

- objective : bbox loss(gIOU + regression loss) + cross-entropy for entity class + cross-entropy loss for relation to object picked as hungarian.
- baseline : two-stage models, FCSGG, #40
- data : Toulouse, 20 US Cities, DeepVesselNet, and Visual Genome.
- result : SGG) without the extra features (glove vector of words, knowledge graph), the SOTA
- contribution : simple architecture with inductive bias!
Details
Parameter

Added log softmax, frequency-bias.

Relation Prediction Head
pair-wise [obj] token, shared [rln]-token -> $MLP_{rln}({o^i, r, o^j})_{i!=j}$
- For k objects drawn from object detection, concat the output of the [rln] token for k(k-1) pairs with a 3-layer FCN run to get the relation. -> still $O(n^2)$!
MLP -> 3 layer FCN + LN In cases like SGG, order determines subject, object
The authors’ claims about [rln] tokens
- object has a higher order topology than the object, so it requires additional expressive capacity
- [obj] tokens to reduce the burden of pulling relation
- [obj] tokens engage in global semantic reasoning as [rln] tokens compete for attention with [obj] tokens
Compared to SGTR,
- No distinction between entity and subject / object -> LOSS for entity only once!
- In SGTR, the image feature was still explicitly put into the model, which is not the case here.
Loss
Stochastic Relation Loss

For the objects matched with gt object by hungarian matcher, we got the cross entropy loss for pair-wise relation. I have a relation called valid if it exists and background if it doesn’t, and since I have a lot of backgrounds, I’ve made it a 1:3 ratio.
Ablation
[rln] ablation for token with and without

Large performance differences
Results
