
TL;DR
- I read this because.. : #75 had quite a bit of performance improvement. There was a paper in AAAI recently where this was the baseline.
- task : two-stage SGG
- PROBLEM : SGG data is long-tailed.
- idea : confidence-aware bipartite graph neural network proposal. bi-level data resampling strategy.
- architecture :** A combination of relationship confidence estimation (RCE) and confidence-aware message propagation (CMP)
- objective : ce loss of predicate and entity, loss for relation confidence estimation(class-specific / overall)
- baseline : graph-RCNN, GPS-Net, Motif, …
- data : Visual Genome, Open Images V4/6
- evaluation : PredCls, SGCls, SGGen(head, body, tail), OI evaluation
- result : sota. tail score improved a lot.
- contribution : confidence aware? gnn for sgg I’m not familiar with the papers, so I don’t know what is a contribution
- limitation / things I cannot understand : What does confidence do? It looks like you gave loss directly to confidence, but how did you give it? Did you give it like “relatedness” in graph-RCNN?
Details
Architecture

Proposal generation network
Select objects with Faster RCNN and create an entity representation $e_i$ from them with visual feature $v_i$, geometric feature $g_i$, and class word embedding feature $w_i$.

The relation representation $r_{i->j}$ is constructed by defecting the entity representations $e_i$, $e_j$. Let $u_i,j$ be the convolutional feature of the union region of two entities.

Bipartite Graph Neural Network
- Relationship Confidence Estimation Module
Find the confidence given the class probability of each entity $e_i$, $e_j$.

(???) I don’t understand this part, at what point is it global?

- Confidence-aware message

entity-to-predicate

predicate-to-entity

The $\alpha$, $\beta$ are theshold parameter.
each entity node $e_i$ by aggregating neighbors’ messages

Scene Graph Prediction

Bi-level Resampling

- image-level over-sampling Like getting the repeat factor and pulling more images for a class that didn’t appear. $r^c=max(1, \sqrt(t/f^c))$
- $c$ : category
- $f_c$ : frequency of category c on the entire dataset
- $t$ : hyperparam
- instance-level under-sampling Like removing instances based on different predicate classes for each image. -> Iterative SGG is one-stage, how did you do this? Did you just remove it from the gt label?
Result
