image

paper

TL;DR

  • I read this because.. : sgg ์ดˆ๊ธฐ ๋…ผ๋ฌธ
  • task : Scene Graph Generation
  • problem : object ๋ฝ‘๊ณ  quadraticํ•œ relation์„ ์ž˜ ๋‹ค๋ค„๋ณด์ž. ๊ฐ•ํ™”๋œ ๊ทธ๋ž˜ํ”„ ํ‘œํ˜„์„ ๋งŒ๋“ค์–ด๋ณด์ž.
  • idea : object ๊ฐ„์˜ relation์„ pruningํ•˜๋Š” ๋ชจ๋“ˆ์„ ์ค‘๊ฐ„์— ๋„ฃ์ž. attentional GCN์„ ์ ์šฉํ•˜์ž.
  • architecture : 1) Faster RCNN์œผ๋กœ Object ๋ฝ‘๊ณ  2) object cls logit ๊ฐ’๋“ค concatํ•ด์„œ relation pruning 3) attentional GCN์„ ์ ์šฉํ•ด์„œ object, relation ๋…ธ๋“œ์˜ ํ‘œํ˜„์„ ๊ฐ•ํ™” -> ๊ฐ subject, object, relation ํ‘œํ˜„์— classifier ๋ถ™์—ฌ์„œ ์˜ˆ์ธกํ•œ๋“ฏ?
  • objective : 1) bbox loss + cls loss 2) bce for relationship score 3) ce for object cls and predicate cls
  • baseline : IMP, MSDN, NeuralMotif
  • data : Visual Genome
  • evaluation : PredCls, PhrCls, SGGen, SGGen+(proposed in this paper)
  • result : SOTA
  • contribution : ์•„๋งˆ GCN์„ ์ ์šฉํ•œ ์ตœ์ดˆ์˜ ๋…ผ๋ฌธ์ด ์•„๋‹๋Ÿฐ์ง€?
  • limitation / things I cannot understand : SGG์€ ์ •๋ง GCN์„ ์“ธ ์ •๋„๋กœ graph ์ ์ธ ํŠน์„ฑ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋Š”๊ฐ€?

Details

Architecture

image image image

3๋‹จ๊ณ„๋กœ ๋‚˜๋ˆ”

  1. Object Region Proposal : image๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ node(=vertex, V)๋“ค ๋ฝ‘๊ธฐ => Faster RCNN
  2. Relationship Proposal : image์™€ node๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ ๋ชจ๋“  ๊ฒฝ์šฐ์˜ ์ˆ˜ n*(n-1)์—์„œ ์žˆ์„๋งŒํ•œ relation pruning
  3. Graph Labeling : image, node, edge๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ relation๊ณผ object ์ฐพ๊ธฐ

Relation Proposal Network

object์˜ class logit์„ ์‚ฌ์šฉํ•˜์—ฌ “relatedness"๋ฅผ ์ธก์ •. ์ผ์ข…์˜ softํ•œ prior๋ฅผ ์ฃผ๋Š” ํ˜•์‹(๊ฐ€๋ น, person-ride-chicken์€ ๋  ์ˆ˜ ์—†์œผ๋‹ˆ?) image

๊ตฌํ˜„์€ catํ•œ ๋’ค์— MLP ์Œ“์Œ. score๋ฅผ ๋งค๊ฒจ์„œ sorting์„ ํ•œ ๋’ค์— K๊ฐœ์˜ pair๋ฅผ ๋ฝ‘์Œ. Faster RCNN์ด๊ธฐ ๋•Œ๋ฌธ์— ๋งŽ์ด ๋‚˜์˜ฌ๊ฑฐ๋ผ์„œ pair์— ๋Œ€ํ•œ NMS๋ฅผ ํ•ด์„œ top m๊ฐœ์˜ pair๋งŒ ๋‚จ๊น€ image

Attentional GCN

Vanilla GCN์€ ์•„๋ž˜์™€ ๊ฐ™์Œ image

  • $z_i$ : i๋ฒˆ์งธ node์˜ ํ‘œํ˜„
  • $N(i)$ : i๋ฒˆ์งธ node์˜ neighbor๋“ค
  • $\alpha_{ij}$ : i์™€ j์˜ adjacency matrix์— ์˜ํ•ด ๋งŒ๋“ค์–ด์ง€๋Š” connection coefficient

์ด๋ฅผ $Z\in \mathbb{R}^{d\times T_n}$๋ผ๋Š” matrix๋กœ ํ‘œํ˜„ํ•˜๋ฉด image

์šฐ๋ฆฌ๋Š” ์—ฌ๊ธฐ์„œ $\alpha_{ij}$๋ฅผ ์ฃผ์–ด์ง€๋Š”๊ฒŒ ์•„๋‹ˆ๋ผ ํ•™์Šตํ•˜๋ ค๊ณ  ํ•จ image

2 layer MLP + softmax๋กœ $\alpha_{ij}$๊ฐ€ ํ•™์Šต

aGCN for SGG

N๊ฐœ์˜ Object region๋“ค๊ณผ m๊ฐœ์˜ relationship์„ ๊ฐ๊ฐ node๋กœ ๋งŒ๋“ค๊ณ  ์œ„์˜ ๋„คํŠธ์›Œํฌ์—์„œ ๋‚˜์˜จ๊ฑธ๋กœ edge๋“ค์„ ์—ฐ๊ฒฐํ•ด์คŒ. ์ถ”๊ฐ€์ ์œผ๋กœ object๊ฐ„์—๋Š” direct edge๋“ค์„ ์ถ”๊ฐ€ํ•ด์คŒ.

object node์— ๋Œ€ํ•œ ํ‘œํ˜„์€ ์•„๋ž˜์™€ ๊ฐ™์Œ image

relation node์— ๋Œ€ํ•œ ํ‘œํ˜„์€ ์•„๋ž˜์™€ ๊ฐ™์Œ. image

Result

image image

Ablation for modules

image

image