[38] Visual Relationship Detection Using Part-and-Sum Transformers with Composite Queries

TL;DR

task : Visual Relationship Detection
problem : I want to make triplet prediction in one-stage by fully utilizing transformer.
Idea :** Use a query in the form of a Tensor representing the parts, subject, object, and predicate, and a query in the form of a vector predicting the sum of the final triplet, called a composite query. Attach an Attention.
architecture : CNN + DETR encoder + part-and-sum Transformer (PST). PSTs are eventually converted to
objective : bbox / cls for part, bbox, cls for sum
baseline : Zoom-Net, …
data : Visual Relationship Detection dataset , HICO-DET
result : SOTA
contribution : one-stage SGG. simple architecture.

Details

Part-and-Sum Transformer Decoder

(SA : self attention, CA : cross attention )

Part-and-Sum separate decoding It is divided into two stream architectures: part and sum query decoding, each consisting of SA, CA, and FFN. In part query decoding, we self-attend to the SPOs of all queries and cross-attend to the tokenized image feature (=I).

Do the same for the sum query.

By going through both SA -> CA -> FFN, you can do both part and global embedding at the same time. In particular, SA looks at all queries, so it can help us 1) predict “person” in a part query so that the predicate can have “eat” or “hold”, and 2) predict the triplet “person read book” in a sum query.

Factorized self-attention layer In order to get more structure information when performing the above SA, we don’t SA all part queries from the beginning, but we do intra-relation first and then inter-relation.

Part-Sum interaction Both are passed through the FFN and then fused together. Summarize each s, o, and p query summation with a sum query.

[38] Visual Relationship Detection Using Part-and-Sum Transformers with Composite Queries

TL;DR

Details

Part-and-Sum Transformer Decoder

Composite Prediction

Composite bipartite matching

Training Loss

Result

TL;DR#

Details#

Part-and-Sum Transformer Decoder#

Composite Prediction#

Composite bipartite matching#

Training Loss#

Result#

TL;DR

Details

Part-and-Sum Transformer Decoder

Composite Prediction

Composite bipartite matching

Training Loss

Result