image

paper , code

TL;DR

  • task : image-to-graph generation
  • problem : The two-stage image-to-graph generation model is also complex and has a complexity of O(n**2).
  • idea : Instead of pair-wise interaction of entities (=>O(n**2)), let’s use interaction of relation token and entity. image
  • architecture : CNN backbone + deformable DETR(Encoder, Decoder with N + 1(=relation) tokens) + Object Detection Head and Relation Prediction Head. image
  • objective : bbox loss(gIOU + regression loss) + cross-entropy for entity class + cross-entropy loss for relation to object picked as hungarian.
  • baseline : two-stage models, FCSGG, #40
  • data : Toulouse, 20 US Cities, DeepVesselNet, and Visual Genome.
  • result : SGG) without the extra features (glove vector of words, knowledge graph), the SOTA
  • contribution : simple architecture with inductive bias!

Details

Parameter

image image

Added log softmax, frequency-bias.

image

Relation Prediction Head

pair-wise [obj] token, shared [rln]-token -> $MLP_{rln}({o^i, r, o^j})_{i!=j}$

  • For k objects drawn from object detection, concat the output of the [rln] token for k(k-1) pairs with a 3-layer FCN run to get the relation. -> still $O(n^2)$!

MLP -> 3 layer FCN + LN In cases like SGG, order determines subject, object

The authors’ claims about [rln] tokens

  • object has a higher order topology than the object, so it requires additional expressive capacity
  • [obj] tokens to reduce the burden of pulling relation
  • [obj] tokens engage in global semantic reasoning as [rln] tokens compete for attention with [obj] tokens

Compared to SGTR,

  • No distinction between entity and subject / object -> LOSS for entity only once!
  • In SGTR, the image feature was still explicitly put into the model, which is not the case here.

Loss

Stochastic Relation Loss

image

For the objects matched with gt object by hungarian matcher, we got the cross entropy loss for pair-wise relation. I have a relation called valid if it exists and background if it doesn’t, and since I have a lot of backgrounds, I’ve made it a 1:3 ratio.

Ablation

[rln] ablation for token with and without

image

Large performance differences

Results

image