image

paper

TL;DR

  • task : Visual Relationship Detection
  • problem : I want to make triplet prediction in one-stage by fully utilizing transformer.
  • Idea :** Use a query in the form of a Tensor representing the parts, subject, object, and predicate, and a query in the form of a vector predicting the sum of the final triplet, called a composite query. Attach an Attention.
  • architecture : CNN + DETR encoder + part-and-sum Transformer (PST). PSTs are eventually converted to image
  • objective : bbox / cls for part, bbox, cls for sum
  • baseline : Zoom-Net, …
  • data : Visual Relationship Detection dataset , HICO-DET
  • result : SOTA
  • contribution : one-stage SGG. simple architecture.

Details

Part-and-Sum Transformer Decoder

image

(SA : self attention, CA : cross attention )

Part-and-Sum separate decoding It is divided into two stream architectures: part and sum query decoding, each consisting of SA, CA, and FFN. In part query decoding, we self-attend to the SPOs of all queries and cross-attend to the tokenized image feature (=I). image

Do the same for the sum query. image

By going through both SA -> CA -> FFN, you can do both part and global embedding at the same time. In particular, SA looks at all queries, so it can help us 1) predict “person” in a part query so that the predicate can have “eat” or “hold”, and 2) predict the triplet “person read book” in a sum query.

Factorized self-attention layer In order to get more structure information when performing the above SA, we don’t SA all part queries from the beginning, but we do intra-relation first and then inter-relation.

Part-Sum interaction Both are passed through the FFN and then fused together. Summarize each s, o, and p query summation with a sum query. image

Composite Prediction

image

  • bbox for s, o
  • CLS for S, P, O
  • cls for spo(=triplet)

image

  • bbox for s, o for sum query
  • CLS for S, P, O

Composite bipartite matching

image

image

Training Loss

image

Result

image