image

paper

TL;DR

  • task : efficient Transformer -> Machine Translation, Language Modeling, Representation leaning in Graph, Image Classification
  • problem : self-attention ์—ฐ์‚ฐ์˜ $O(n^2)$์ด ๋น„ํšจ์œจ์ ์ด๋‹ค
  • idea : ์ธํ’‹ ์‹œํ€€์Šค๋ฅผ ๊ทธ๋ž˜ํ”„๋กœ ๋ณด๊ณ  attention ์—ฐ์‚ฐ์„ ์—ฐ๊ฒฐ๋œ node์— ๋Œ€ํ•ด์„œ๋งŒ ํ•˜์ž
  • architecture : LSTM์„ ํ†ตํ•ด source node๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ target edge predicting, ์ดํ›„ ์—ฐ๊ฒฐ๋œ edge๋“ค์— ๋Œ€ํ•ด์„œ๋งŒ self-attention ์ˆ˜ํ–‰
  • objective : ground truth edge๋ฅผ ์•Œ ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์— edge training์„ ํ•  ๋•Œ์—๋Š” self-attention ๊นŒ์ง€ ํ•œ ํ›„์˜ ์„ฑ๋Šฅ์„ reward๋กœ ์ฃผ๋Š” policy gradient ๋ฅผ ์ ์šฉ. self-attention์˜ ๊ฒฝ์šฐ ๊ฐ task์— ๋งž๋Š” loss.
  • baseline : Transformer, Sparse Graph Attention Networks , Reformer
  • data : newstest2013(WMT), Enwiki8/Text8(LM), CIFAR100/ImageNet(Image Classification)
  • result : SOTA์™€ ๊ฒฌ์ค˜๋ณผ๋งŒํ•œ ์„ฑ๋Šฅ. memory cost๋Š” ๋งค์šฐ ์ค„์ž„.
  • contribution : ํŠธ๋žœ์Šคํฌ๋จธ์˜ quadratic์„ graph๋กœ ๋ฐ”๊พผ ์ 
  • limitation or ์ดํ•ด ์•ˆ๋˜๋Š” ๋ถ€๋ถ„ : ํ•™์Šต์ด ์—„์ฒญ ๊นŒ๋‹ค๋กœ์šธ ๊ฒƒ ๊ฐ™๋‹ค. LSTM์—์„œ edge prediction ํ•  ๋•Œ latency๊ฐ€ ์—„์ฒญ ์ƒ๊ธฐ์ง€ ์•Š์„๊นŒ?

Details

image

image