
TL;DR
- I read this because.. : Mentioned in #58. I have a feeling the name is related to SGG
- task : object detection
- problem : There was intuition that modeling the relation within an object would improve object recognition, but no research proved it. The SOTA object detection study models each instance individually.
- idea : Use the attention module to get the relation between objects and do a weighted sum to strengthen the vector
- architecture : CNN -> RPN -> RoI -> FC -> object relation module -> fc -> object relation module -> cls / bbox prediction -> duplicate removal network
- objective : bce for duplicate removal network, cross entropy loss
- baseline : fasterRCNN, feature pyramid network(FPN), deformable convolutional network(DCN)
- data : COCO
- evaluation : mAP, mAP50, mAP75
- result : SOTA. best on mAP, mAP50 best if trained with threshold 0.5, mAP75 best if trained with 0.75
- contribution : first fully end-to-end object detector (without NMS)
- limitation / things I cannot understand : duplicate removal network
Details

Object Relation Module

- $f_R$ : relation feature
- $f_G$ : geometric feature
- $f_A$ : appearance feature
- $w_{mn}$ : How much does the mth object affect the nth object?

w_A^{mn}$ is just like scaled dot attention

w_G^{mn}$ is obtained by extracting the features (combining them into $\varepsilon_G$), embedding with sine/cosine, multiplying by $W_g$, and taking ReLU

Pull features

In the end, $f^n_a$ is the concatenation of the nm object relations we picked.
Relation for Instance Recognition

Relation for Duplicate Removal

No big deal, just predict {0, 1} to predict. But since we have a relation module, we can remove the duplicates well.
- rank feature : It was better to get and embed rank than to predict directly with score.
- Depending on the threshold, correct and duplicate are given as labels, and depending on what theshold is given, the best is different for AP50, AP75…
Result
