image

paper

TL;DR

  • I read this because.. : I was going to read the RLIP of NeurIPS 2022, but this is a preliminary study.
  • task : Human Object Interaction(HOI)
  • problem : Disadvantages of two-stage HOI 1) Time complexity is high because M x N pairs of M people and N objects are used for action classification 2) Imbalance because few of M x N have actual relation 3) The feature that draws the bounding box focuses on the edge rather than the content of the object, so the performance is not good if the relation is used to predict <-> Disadvantages of one-stage HOI : It is difficult to generalize because it tries to solve two different tasks with one feature representation.
  • idea : go one-stage but separate the decoders. One Human-Object Pair Decoder that asks an object query and gets a human-object-interaction score, and one Interaction Decoder that takes the output representation from that decoder and categorizes the action class.
  • architecture : DETR
  • objective : detr loss + bce for interaction score (1 if relationship exists, 0 if not)
  • baseline : QPIC, AS-Net, HOTR, ATL, …
  • data : HICO-Det, V-COCO
  • evaluation : mAP for triplet (must be IoU 0.5 or higher to fit box)
  • result : 9.32% improvement on SOTA. HICO-Det

Details

image image

Result

image