[51] Structured Sparse R-CNN for Direct Scene Graph Generation

paper , code

TL;DR

task : one-stage Scene Graph Generation
problem : When solving SGG, I want to make a set prediction at once with an integrated model consisting of object detection, relation graph construction, and relation prediction.
idea : Have triplet queries like sparse R-CNN with region proposal query, inject a prior for object pair and relation into it, and then have a triplet detector called triplet detector that predicts OD and relation in parallel.
architecture : CNN with FPN as backbone. The triplet query consists of bbox, obj vec, and rel vec. The box is ROI aligned to extract features, and the rest of the features are extracted through MHSA, and then the features for bbox and obj vec are OD, and the features for rel vec are fused with the above features to predict relation.
objective : bbox loss, CE loss for relation and object cls
baseline : IMP, G-RCNN, MOTIF, transformer, vctree …
data : Visual Genome, Open Image
result : SOTA.
contribution : Looks like a two-stage SGG, but is it one-stage? If it’s one-stage, it performs very well.
limitation or something I don’t understand : When I do a parameter share with siamese sparse R-CNN… is this like distilation? 🤔? The question of whether we could have matched GT objects in one model and lost object detection is not resolved… or could we have combined unrelated object pairs to create a triplet?!

Details

Sparse R-CNN

https://github.com/long8v/PTIR/issues/58

Architecture

Triplet query

Express the general distribution of tiplets like queries in a sparse RCNN
2 proposal boxes coordinates : 4d
2 object content vectors (representing appearance, acting like proposal features in a sparse RCNN) : 1024, 256
one relation content vector(structure information between objects) : 1024, 256

Triplet detection head

Object pair detection I’m doing MSA with object vectors, but I want to do it better by applying pair fusion module

Using $X_s’$, $X_o’$ as query, key. value will be the two object vectors themselves, just as Here, the enhanced object feature is used for Dynamic Conv.

Relation recognition

The relation was also taken from the largest region of the bbox and fused with E2R on top of DyConv+.

Learning with Siamese Sparse R-CNN

The objects are too sparse to learn with ground-truth triplets alone. Siamese Sparse R-CNN with parameters shared with structured sparse R-CNN as object detector and virtual object pairs as pseudo-labels for knowledge distillation.

two-stage triplet label assignment

Matching ground-truth triplets with predicted triplets
For triplets not matched by gt, match them with object pairs spit out by siamese sparse R-CNN For the remaining triplets, leave the box as is and replace only the object classification score with label. impose hungarian on the pseudo-label spouted by siamese sparse R-CNN and the object of the rest of the triplet with the following matching cost

And padding with background for relation, then calculating loss

Imbalance Class Distribution

Adaptive focusing parameter Reduce weight for classes that are too major in object classification.
logit adjustment

TL;DR#

Details#

Sparse R-CNN#

Architecture#

Triplet query#

Triplet detection head#

Relation recognition#

Learning with Siamese Sparse R-CNN#

two-stage triplet label assignment#

Imbalance Class Distribution#

Results#