image

paper , code

TL;DR

  • task : graph representation
  • problem : Graph๋ฅผ ํ‘œํ˜„ํ•˜๊ธฐ ์œ„ํ•ด attention์„ ์‚ฌ์šฉํ•˜๋Š” GAT(Graph Attention Network) ๋ชจ๋ธ๋“ค์ด ์žˆ์ง€๋งŒ, ์‹ค์ œ ๊ทธ๋ž˜ํ”„๋Š” ๋งค์šฐ ํฌ๊ณ  noisyํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ผ๋ฐ˜ํ™”๋˜๊ธฐ ์–ด๋ ต๋‹ค.
  • idea : ์–ด๋–ค edge๊ฐ€ ์ค‘์š”ํ•œ์ง€๋ฅผ ์„ ํƒํ•˜๋Š” attention score๋ฅผ ๋ชจ๋“  ๋ ˆ์ด์–ด์—์„œ ํ•˜๋‚˜๋งŒ ์„ ํƒํ•˜๊ณ  ๋‚˜๋จธ์ง€๋Š” ๋‹ค 0์œผ๋กœ ๋‘์ž.
  • architecture : GAT์ธ๋ฐ attention score๋ฅผ ํ•œ๊ฐœ๋งŒ ์„ ํƒํ•จ.
  • objective : cross entropy loss์™€ edge์˜ ๊ฐœ์ˆ˜์— ๋Œ€ํ•œ loss
  • baseline : GCN, GAT, GraphSage
  • data : Cora, Citesser, Pubmed, Amazon computer, Amazon Photo, PPI, Reddit
  • result : GAT์™€ ์œ ์‚ฌํ•œ ์„ฑ๋Šฅ. layer ๋ณ„๋กœ ํ•˜๋‚˜๋งŒ ์„ ํƒ๋˜๊ธฐ ๋•Œ๋ฌธ์— ๋” ํ•ด์„ ๊ฐ€๋Šฅํ•จ.
  • contribution : ์ตœ์ดˆ๋กœ spareํ•œ GNN์ด ๊ทธ๋ ‡์ง€ ์•Š์€ GNN๋งŒํผ์˜ ์„ฑ๋Šฅ์ด ๋‚˜์˜จ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์ž„
  • limitation or ์ดํ•ด ์•ˆ๋˜๋Š” ๋ถ€๋ถ„ : GNN์ด transductive ํ•˜๋‹ค๋Š”๋ฐ ์ดํ•ด๋ฅผ ํ™•์‹คํžˆ ๋ชปํ•จ. $z_{ij}$๋Š” ์–ด๋–ป๊ฒŒ ํ•™์Šต์ด ๋˜๋Š”๊ฑด์ง€?

Details

GNN in general

  • $G$ = ( $V$, $E$ ) ๊ทธ๋ž˜ํ”„๋Š” node( $V$ )์™€ edge ( $E$ )๋กœ ๊ตฌ์„ฑ

  • node๋“ค์€ feature $X$๋กœ ํ‘œํ˜„๋จ. ์ฐจ์›์€ $N$(๋…ธ๋“œ ๊ฐœ์ˆ˜) x $D$(feature ์ฐจ์›)

  • adjacency matrix $A$๋Š” edge๋กœ ์—ฐ๊ฒฐ๋˜์–ด ์žˆ์œผ๋ฉด 1, ์•„๋‹ˆ๋ฉด 0์ธ matrix์ž„. ์ฐจ์›์€ $N$x$N$ ์šฐ๋ฆฌ ๊ฒฐ๊ตญ ํ•˜๊ณ ์ž ํ•˜๋Š” ๊ฒƒ์€ feature $X$์™€ $A$๋ฅผ ๋ฐ›๊ณ  ๊ฐ•ํ™”๋œ node ํ‘œํ˜„ $H$๋ฅผ ๋งŒ๋“œ๋Š” ๊ฒƒ์ด๋‹ค. ์ด๋•Œ ํ•จ์ˆ˜๋Š” ๋ณดํ†ต ์ด๋ ‡๊ฒŒ ์ •์˜๋œ๋‹ค. image

  • $f$๋Š” ๊ทธ๋ž˜ํ”„๋ฅผ ์ธ์ฝ”๋”ฉ ํ•˜๋Š” ํ•จ์ˆ˜๋‹ค.

  • ์—ฌ๊ธฐ์„œ loss๋Š” node ๋ถ„๋ฅ˜ ํƒœ์Šคํฌ์™€ ๊ฐ™์€ ๊ฑธ ํ•œ๋‹ค๋ฉด cross entropy loss๊ฐ€ ๊ฑธ๋ฆฌ๊ฒŒ ๋œ๋‹ค

  • ๊ฒฐ๊ตญ GNN์˜ ๋‹ค์–‘ํ•œ variants๋“ค์€ $f$๋ฅผ ์–ด๋–ป๊ฒŒ ๊ตฌ์„ฑํ• ๊นŒ?๊ฐ€ ๋ฌธ์ œ๋‹ค.

Neighbor Aggregation Methods

graph learning์„ ํ•  ๋•Œ ๊ฐ€์žฅ ํšจ์œจ์ ์ธ ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜๊ฐ€ neighbor aggregation mechanism์ธ๋ฐ, feature vector $x_i$์™€ ๊ทธ neighbor์ธ j๋“ค์— ๋Œ€ํ•ด์„œ feature vector๋ฅผ aggregateํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ๊ฐ€๋ น Graph Convolution Network(GCN)๋„ ๊ทธ ์ข…๋ฅ˜๋“ค ์ค‘ ํ•˜๋‚œ๋ฐ, ์‹์„ ์•„๋ž˜์™€ ๊ฐ™์ด ์“ธ ์ˆ˜ ์žˆ๋‹ค. image image

์ด๊ฑธ ์กฐ๊ธˆ ๋” general ํ•˜๊ฒŒ ์“ฐ๋ฉด ์•„๋ž˜์™€ ๊ฐ™์ด ์“ธ ์ˆ˜ ์žˆ๋‹ค. image

ํ•˜์ง€๋งŒ GCN์€ transductive ํ•˜๊ฒŒ ๋ฐ–์— ๋ชป์“ฐ๋Š”๋ฐ, ๊ทธ๋ž˜ํ”„ ๊ตฌ์กฐ๊ฐ€ ๋ฐ”๋€Œ๋ฉด ์ƒˆ๋กœ ํ•™์Šต์„ ํ•ด์ค˜์•ผํ•œ๋‹ค.

Graph Attention Networks(GAT)

GCN๊ณผ ์œ ์‚ฌํ•˜๊ฒŒ neighborhood๋ฅผ aggregateํ•˜๋Š”๋ฐ, attention ์„ ์‚ฌ์šฉํ•ด์„œ ์–ด๋–ค ege์— ์ง‘์ค‘ํ• ์ง€๋ฅผ attention score๋ฅผ ๊ตฌํ•˜๊ฒŒ ๋˜๋ฉด GAT๊ฐ€ ๋œ๋‹ค. image

๊ทธ๋Ÿฐ๋ฐ ์ด ๊ฒฝ์šฐ์—, attention score๋ฅผ ๊ฐ edge๋ณ„ ์ค‘์š”๋„๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋Š”๋ฐ layer๋งˆ๋‹ค attention score๊ฐ€ ๋‹ฌ๋ผ์ง€๊ธฐ ๋•Œ๋ฌธ์— ํ•ด์„์€ ์–ด๋ ต๋‹ค. ๊ทธ๋ž˜ํ”„๋ฅผ ๋งŒ๋“ค ๋•Œ noisy/task-irrevalentํ•œ edge๋“ค์„ ์ •๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด SGAT๋ฅผ ์ œ์•ˆํ•œ๋‹ค!

SparseGAT(SGAT)

์ค‘์š”ํ•œ edge๋งŒ ๋‚จ๊ธฐ๊ธฐ ์œ„ํ•ด์„œ binary gate $z_{ij}$๋ฅผ ๊ฐ ege ๋ณ„๋กœ ์ถ”๊ฐ€ํ•œ๋‹ค. ์ด $z_{ij}$๋Š” edge $e_{ij}$๋ฅผ ์‚ฌ์šฉํ• ์ง€ ๋ง์ง€์— ๋Œ€ํ•œ bianry masking์„ ํ•˜๊ฒŒ ๋œ๋‹ค. image

์ตœ๋Œ€ํ•œ ์ ์€ edge๋ฅผ ๋‚จ๊ธฐ๊ธฐ ์œ„ํ•ด loss term์— L0 loss๋ฅผ ์ถ”๊ฐ€ํ•œ๋‹ค. $z_{ij}$ ๊ฐ€ 1์ด๋ฉด 1 ์•„๋‹ˆ๋ฉด 0์ธ๊ฑธ sumํ•˜๋Š” term์ด๋‹ค.(edge ๊ฐœ์ˆ˜์— ๋Œ€ํ•œ loss) image

attention based aggregation function์€ ์•„๋ž˜์™€ ๊ฐ™์ด ์“ธ ์ˆ˜ ์žˆ๋Š”๋ฐ (GAT)์™€ ๋‹ค๋ฅธ ๊ฒŒ ์—†์Œ image

์ด๋•Œ attention score๋ฅผ ์•„๋ž˜์™€ ๊ฐ™์ด ๊ตฌํ•œ๋‹ค. image

-> $z_{ij}$๋Š” ์–ด๋–ป๊ฒŒ ํ•™์Šต์ด ๋˜๋Š”๊ฑด์ง€?

์ด๋ ‡๊ฒŒ sparse ํ•˜๊ฒŒ ๊ตฌ์„ฑํ•˜๊ฒŒ ๋œ ์‹œ์ž‘์€ attention score์— ๋Œ€ํ•œ head๋ณ„ layer ๋ณ„ ๋ถ„์‚ฐ์„ ๊ตฌํ•ด๋ดค๋Š”๋ฐ ์•„๋ž˜์™€ ๊ฐ™์ด ๊ฑฐ์˜ 0์— ๊ฐ€๊นŒ์› ๊ธฐ ๋•Œ๋ฌธ์ž„. image