
TL;DR
- task : one-stage Scene Graph Generation
- problem : When solving SGG, I want to make a set prediction at once with an integrated model consisting of object detection, relation graph construction, and relation prediction.
- idea : Have triplet queries like sparse R-CNN with region proposal query, inject a prior for object pair and relation into it, and then have a triplet detector called triplet detector that predicts OD and relation in parallel.
- architecture : CNN with FPN as backbone. The triplet query consists of bbox, obj vec, and rel vec. The box is ROI aligned to extract features, and the rest of the features are extracted through MHSA, and then the features for bbox and obj vec are OD, and the features for rel vec are fused with the above features to predict relation.
- objective : bbox loss, CE loss for relation and object cls
- baseline : IMP, G-RCNN, MOTIF, transformer, vctree …
- data : Visual Genome, Open Image
- result : SOTA.
- contribution : Looks like a two-stage SGG, but is it one-stage? If it’s one-stage, it performs very well.
- limitation or something I don’t understand : When I do a parameter share with siamese sparse R-CNN… is this like distilation? 🤔? The question of whether we could have matched GT objects in one model and lost object detection is not resolved… or could we have combined unrelated object pairs to create a triplet?!
Details
Sparse R-CNN
https://github.com/long8v/PTIR/issues/58
Architecture

Triplet query
- Express the general distribution of tiplets like queries in a sparse RCNN
- 2 proposal boxes coordinates : 4d
- 2 object content vectors (representing appearance, acting like proposal features in a sparse RCNN) : 1024, 256
- one relation content vector(structure information between objects) : 1024, 256
Triplet detection head
- Object pair detection I’m doing MSA with object vectors, but I want to do it better by applying pair fusion module


Using $X_s’$, $X_o’$ as query, key. value will be the two object vectors themselves, just as Here, the enhanced object feature is used for Dynamic Conv.

Relation recognition

The relation was also taken from the largest region of the bbox and fused with E2R on top of DyConv+.

Learning with Siamese Sparse R-CNN
The objects are too sparse to learn with ground-truth triplets alone. Siamese Sparse R-CNN with parameters shared with structured sparse R-CNN as object detector and virtual object pairs as pseudo-labels for knowledge distillation.

two-stage triplet label assignment
Matching ground-truth triplets with predicted triplets

For triplets not matched by gt, match them with object pairs spit out by siamese sparse R-CNN For the remaining triplets, leave the box as is and replace only the object classification score with label. impose hungarian on the pseudo-label spouted by siamese sparse R-CNN and the object of the rest of the triplet with the following matching cost

And padding with background for relation, then calculating loss

Imbalance Class Distribution
Adaptive focusing parameter Reduce weight for classes that are too major in object classification.

logit adjustment

Results
