image

paper

TL;DR

  • task : two-stage Scene Graph Generator
  • problem : ๊ธฐ์กด ์—ฐ๊ตฌ๋“ค์€ triplet๋“ค์ด ๋…๋ฆฝ์ ์ด๋ผ๊ณ  ๊ฐ€์ •ํ•˜๊ณ  parallel ํ•˜๊ฒŒ ์˜ˆ์ธกํ•œ๋‹ค image
  • idea : ๋‹ค๋ฅธ ์˜ˆ์ธก๋œ relations๋“ค์„ ๋ณด๊ณ  auto-regressive ํ•˜๊ฒŒ ์˜ˆ์ธกํ•˜๋ฉด ๋” ์ž˜ ํ•  ๊ฒƒ์ด๋‹ค! (์œ„์˜ ๊ทธ๋ฆผ ์ฐธ๊ณ )
  • architecture : ํŠธ๋žœ์Šคํฌ๋จธ ์ธ์ฝ”๋”-๋””์ฝ”๋” ๊ตฌ์กฐ์ธ๋ฐ, ๋””์ฝ”๋”์—์„œ encoder์—์„œ ๋‚˜์˜จ ๊ฐ’์„ relation์— ๋Œ€ํ•œ ์ž„๋ฒ ๋”ฉ๊ณผ ํ•จ๊ป˜ [S, P, O]๋กœ ๋„ฃ์–ด์„œ self-attention์„ ํ•ด์ฃผ๊ณ , encoder์—์„œ ๋‚˜์˜จ ๊ฐ’์„ cross-attention๋„ ํ•ด์ค€๋‹ค.
  • objective : cross entropy loss + recall, mRecall์— ๋Œ€ํ•œ reinforcement learning ์ ‘๊ทผ๋ฒ• ์ถ”๊ฐ€
  • baseline : Graph R-CNN, …
  • data : VRD, Visual Genome
  • result : SOTA
  • contribution : SGG์—์„œ ์ฒ˜์Œ ๋ณด๋Š” auto-regressive ํ•œ ์ ‘๊ทผ๋ฒ•
  • limitation or ์ดํ•ด๊ฐ€ ์•ˆ ๋˜๋Š” ๋ถ€๋ถ„ : ํ•™์Šต์ด ๋˜๋Š”๊ฒŒ ์‹ ๊ธฐํ•จ.. -> (ํ† ๋ก  ํ›„) multi-object detection์—๋„ sequentialํ•˜๊ฒŒ ๋„ฃ์–ด์ฃผ๋Š” ๊ฒฝ์šฐ ์žˆ์—ˆ์Œ. (์ด ์‚ฌ์ง„์— ๊ณ ์–‘์ด๊ฐ€ ์žˆ์—ˆ์œผ๋ฉด ๊ฐœ๋„ ์žˆ์„ ๊ฒƒ์ด๋‹ค. ๋ผ๋Š”๊ฑธ ํ•™์Šต) ํŠธ๋žœ์Šคํฌ๋จธ ๋””์ฝ”๋”์—์„œ๋Š” input ์ •๋ณด๋งŒ ๋ณด๋Š”๊ฒŒ ์•„๋‹ˆ๋ผ cross-attention๋„ ๊ฑธ๋ฆฌ๊ณ  ํ•˜๋‹ˆ๊นŒ input์ด ๊ผญ ๋‚ด๊ฐ€ ๋ฝ‘๊ณ ์‹ถ์€๊ฑฐ๋ž‘ ๊ด€๋ จ์ด ์žˆ์„ ํ•„์š”๋Š” ์—†๋Š”๋“ฏ.

Details

Architecture

image

Object Encoder

๊ทธ๋ƒฅ ํŠธ๋žœ์Šคํฌ๋จธ ์ธ์ฝ”๋”. ๊ทผ๋ฐ input์œผ๋กœ ๋ญ˜ ๋„ฃ์–ด์คฌ๋‹ค๋Š”์ง€ ์ž˜ ๋ชจ๋ฅด๊ฒ ์Œ. ๊ทธ๋ƒฅ visual feature map์ด๋ ค๋‚˜? $X_b$๋Š” b๋ฒˆ์งธ ํŠธ๋žœ์Šคํฌ๋จธ block์˜ output

Relationship Decoder

contextualized object features $X_B\in \mathbb{R}^{N\times D}$(N์€ object ๊ฐœ์ˆ˜๊ณ  D๋Š” ์ž„๋ฒ ๋”ฉ ์ฐจ์›์ธ๋“ฏ)์™€ ๊ทธ ์ „ step๊นŒ์ง€ ์˜ˆ์ธก๋œ relationship $\hat Y_{1:m}$์„ ๋ฐ›์•„์„œ m(+1)๋ฒˆ์งธ relationship์„ ๋ฝ‘๋Š” ์ผ์„ ํ•จ.

์ด๋•Œ decoder์˜ input์€ subject์˜ contextualized embedding๊ณผ object์˜ contextualized embedding, ์ด์ „์— ๋ฝ‘ํžŒ relation์— ๋Œ€ํ•œ ์ž„๋ฒ ๋”ฉ๊ฐ’์„ concatํ•ด์„œ ๋“ค์–ด๊ฐ. $(X_B[i], E[r], X_B[j])$ ๊ทธ๋Ÿฌ๋‹ˆ๊นŒ ์ด์ „์— ์˜ˆ์ธกํ•œ ๊ฑธ ์ž„๋ฒ ๋”ฉํ•ด์„œ ๋„ฃ์–ด์ฃผ๋ฉด ๋‹ค์Œ๊ฑฐ๊ฐ€ ๋‚˜์˜ค๋Š” ํŠน์ดํ•œ ๊ตฌ์กฐ์ž„. concatํ•œ๊ฑธ D์ฐจ์›์„ ffn ํ•˜๊ณ  self-attention, cross-attention์„ ํ†ต๊ณผํ•จ. ์ฒ˜์Œ์—๋Š” ๊ทธ๋ƒฅ D์ฐจ์›์งœ๋ฆฌ <SOS>๋ฅผ ๋„ฃ์–ด์คŒ. cross-attention์˜ ๊ฒฝ์šฐ์— decoder์˜ self-attention์œผ๋กœ ๋‚˜์˜จ $Y_k$์™€ encoder์—์„œ ๋‚˜์˜จ $X_B$๋ž‘ ๊ฑธ์–ด์ค˜์„œ ๋‚˜์˜ด. image

๋งˆ์ง€๋ง‰ K๋ฒˆ์งธ decoder layer์˜ output $Y_K$๋ฅผ ๊ฐ€์ง€๊ณ  ๋‹ค์Œ relationship triplet์„ ์˜ˆ์ธกํ•จ. ๋ชจ๋“  ๋‚จ์€ pair์— ๋Œ€ํ•ด์„œ ์•„๋ž˜์™€ ๊ฐ™์ด ์˜ˆ์ธกํ•จ. ๊ทธ๋ฆฌ๊ณ  softmax๊ฐ€ ๊ฐ€์žฅ ๋†’์€ ๊ฒƒ์ด ์„ ํƒ๋จ. image

$i$ : subject indices, $j$ : object indicies

Training scheme

  • triplet ์ˆœ์„œ๋Š” shuffling ํ•ด์„œ ํ•™์Šตํ•จ.
  • loss๊ฐ€ ์›๋ž˜๋Š” positive pair์— ๋Œ€ํ•ด์„œ๋งŒ ๋ถ€๊ฐ€๋˜๋Š”๋ฐ VRD๋Š” no relation์„ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ๋„ ์ค‘์š”ํ•ด์„œ negative pair๋„ ์ถ”๊ฐ€ํ•จ. image

Reinforcement Learning

  1. training์‹œ์—๋Š” input history๋ฅผ GT๋กœ ๋ฐ›์ง€๋งŒ(teacher-forcing) inference์—์„œ๋Š” ๊ทธ๋ ‡์ง€ ์•Š์Œ 2) cross entropy loss์™€ recall ์‚ฌ์ด์˜ gap์ด ์žˆ์Œ. -> ๋””์ฝ”๋”ฉ ํ•  ๋•Œ ๊ฐ•ํ™”ํ•™์Šต ์š”์†Œ๋ฅผ ์ถ”๊ฐ€ํ•˜์ž. recall๊ณผ mRecall์€ ๋ฐ˜๋Œ€๋กœ ์›€์ง์ด๋Š” ์„ฑํ–ฅ์ด ์žˆ์Œ. ๊ทธ๋ž˜์„œ alpha ์ถ”๊ฐ€ํ•˜์—ฌ reward๋กœ ์ •์˜ํ•จ. image

image

์—ฌ๊ธฐ์„œ action์€ ๋ชจ๋“  pair์— ๋Œ€ํ•ด์„œ logit ๊ฐ’์ด ๋‚˜์™”์„ ๋•Œ ์–ด๋–ค๊ฑธ ์„ ํƒํ• ์ง€. state๋Š” m๊ฐœ๋ฅผ ์„ ํƒํ•œ ์ƒํƒœ. RL ์ ์šฉํ•˜๋‹ˆ greedy decoding๋ณด๋‹ค ๋‚˜์•˜๋‹ค.

Expreiments

image

Qualitative Results

image

independentํ•˜๊ฒŒ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค gt ๋งž์ถœ ํ™•๋ฅ ์ด ๋†’์•„์กŒ๋‹ค.