
TL;DR
- task : Visual Relationship Detection
- problem : I want to make triplet prediction in one-stage by fully utilizing transformer.
- Idea :** Use a query in the form of a Tensor representing the parts, subject, object, and predicate, and a query in the form of a vector predicting the sum of the final triplet, called a composite query. Attach an Attention.
- architecture : CNN + DETR encoder + part-and-sum Transformer (PST). PSTs are eventually converted to

- objective : bbox / cls for part, bbox, cls for sum
- baseline : Zoom-Net, …
- data : Visual Relationship Detection dataset , HICO-DET
- result : SOTA
- contribution : one-stage SGG. simple architecture.
Details
Part-and-Sum Transformer Decoder

(SA : self attention, CA : cross attention )
Part-and-Sum separate decoding
It is divided into two stream architectures: part and sum query decoding, each consisting of SA, CA, and FFN.
In part query decoding, we self-attend to the SPOs of all queries and cross-attend to the tokenized image feature (=I).

Do the same for the sum query.

By going through both SA -> CA -> FFN, you can do both part and global embedding at the same time. In particular, SA looks at all queries, so it can help us 1) predict “person” in a part query so that the predicate can have “eat” or “hold”, and 2) predict the triplet “person read book” in a sum query.
Factorized self-attention layer In order to get more structure information when performing the above SA, we don’t SA all part queries from the beginning, but we do intra-relation first and then inter-relation.
Part-Sum interaction
Both are passed through the FFN and then fused together. Summarize each s, o, and p query summation with a sum query.

Composite Prediction

- bbox for s, o
- CLS for S, P, O
- cls for spo(=triplet)

- bbox for s, o for sum query
- CLS for S, P, O
Composite bipartite matching


Training Loss

Result
