image

paper , code problem : ํŠธ๋žœ์Šคํฌ๋จธ๋Š” ๋ฌธ์žฅ์˜ ๊ธธ์ด์— quadraticํ•˜๊ฒŒ ๋ณต์žก๋„๊ฐ€ ๋Š˜์–ด๋‚œ๋‹ค. solution : sliding window(+dilated)๋กœ attention์„ ๊ตฌํ•˜๊ณ  ์ด๋ฅผ stack์„ ์Œ“๋Š”๋‹ค. ํŠน์ • ํƒœ์Šคํฌ์— ๋งž๋Š” ์œ„์น˜์˜ token๋“ค์— ๋Œ€ํ•ด global attention์„ ์ถ”๊ฐ€ํ•œ๋‹ค. result : text8, enwik8์—์„œ SOTA, ๊ธด ๋ฌธ์„œ task์ธ WikiHop์ด๋‚˜ TriviaQA์—์„œ RoBERTa๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์ข‹์œผ๋ฉฐ SOTA . ์ธ์ฝ”๋” ๋””์ฝ”๋” ๋ชจ๋ธ์€ arXiv ์š”์•ฝ ๋ฐ์ดํ„ฐ์…‹์—์„œ ํšจ๊ณผ์ ์ž„์„ ํ™•์ธ. details :

  • windowed local-context self-attention์€ ๋ฌธ๋งฅ์ ์ธ ํ‘œํ˜„์„ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋˜๊ณ , global attention์€ ์˜ˆ์ธก์„ ์œ„ํ•ด ์ „์ฒด ์‹œํ€€์Šค์˜ ํ‘œํ˜„์„ ๋งŒ๋“œ๋Š”๋ฐ ์‚ฌ์šฉ๋œ๋‹ค.

  • auto-regressive ํƒœ์Šคํฌ๋กœ ํ‰๊ฐ€ํ–ˆ์„ ๋ฟ ์•„๋‹ˆ๋ผ, MLM ๊ฐ™์€ objective๋กœ ํ•™์Šตํ•˜๊ณ  SOTA์ž„์„ ํ™•์ธํ–ˆ๋‹ค.

  • encoder-decoder ๋ชจ๋ธ์ธ LED ๋ชจ๋ธ๋„ ์ œ์•ˆํ•œ๋‹ค.

  • long-document transformers ์ ‘๊ทผ๋ก ์œผ๋กœ 1) left-to-right ์ ‘๊ทผ๋ฒ•์ด ์žˆ๋Š”๋ฐ, ์™ผ์ชฝ์—์„œ ์˜ค๋ฅธ์ชฝ์œผ๋กœ ์›€์ง์ด๋ฉด์„œ chunk๋กœ ํ•™์Šตํ•˜๋Š” ๊ฒƒ. ์ด๊ฑด ๋‹ค๋ฅธ ํƒœ์Šคํฌ์— ์ ์šฉํ• ๋•Œ ์„ฑ๋Šฅ์ด ๋ถˆ์•ˆ์ •ํ•จ. 2) sparse attention์„ ํ•˜๋Š” ์ ‘๊ทผ๋ฒ•์ด ์žˆ๋Š”๋ฐ, Sparse Transformer๊ฐ€ ๋Œ€ํ‘œ์ .
    image

  • ๊ธด ๋ฌธ์žฅ์„ ๋‹ค๋ฃจ๋Š” ๋Œ€ํ‘œ์ ์ธ ๋ฐฉ๋ฒ•์€ ๋ฌธ์„œ๋ฅผ ์ตœ๋Œ€ ํ† ํฐ ๊ฐœ์ˆ˜์ธ 512๋กœ ์ž๋ฅด๊ฑฐ๋‚˜, ์ž๋ฅธ ๋’ค ๊ฒฐํ•ฉํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ์žˆ๋‹ค. ๋˜๋Š” multihop์ด๋‚˜ open QA์—์„œ ์‚ฌ์šฉ๋˜๋Š” ๋ฐฉ๋ฒ•์ธ๋ฐ, ๋จผ์ € ๊ด€๋ จ์žˆ๋Š” ๋ฌธ์„œ๋ฅผ retrieveํ•˜๊ณ  ๊ทธ ๋’ค์— answer extraction์„ ์œ„ํ•ด ์ „๋‹ฌํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.

  • Attention Pattern

    • Sliding Window : local context๊ฐ€ ์ค‘์š”ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ณ ์ •๋œ ํฌ๊ธฐ์˜ window attention์„ ์‚ฌ์šฉํ•˜๊ณ , ์ด๋ฅผ stackํ•˜์—ฌ (๋งˆ์น˜ CNN์ฒ˜๋Ÿผ) ๋” ํฐ receptive field์—์„œ ๋ณผ ์ˆ˜ ์žˆ๊ฒŒ ํ•œ๋‹ค. ์ „์ฒด ๋ชจ๋ธ์˜ receptive field๋Š” window size(=w) * # of layers(=l)๊ฐ€ ๋˜๊ฒŒ ๋œ๋‹ค.
    • Dilated Sliding Window : ๊ณ„์‚ฐ์˜ ์ฆ๊ฐ€ ์—†์ด ๋” receptive field๋ฅผ ๋Š˜๋ฆฌ๊ธฐ ์œ„ํ•ด์„œ, (๋งˆ์น˜ dilated CNN ์ฒ˜๋Ÿผ) sliding window๊ฐ€ dilated ๋  ์ˆ˜ ์žˆ๋‹ค. dilated๋ฅผ ํ†ตํ•ด ๋ชจ๋ธ์˜ receptive field๋Š” w * l * d ๊ฐ€ ๋˜๊ฒŒ ๋œ๋‹ค. multi-head attention์—์„œ ๊ฐ head๋ณ„๋กœ dilated size d๋ฅผ ๋‹ค๋ฅด๊ฒŒ ํ•˜๋Š”๊ฒŒ ์„ฑ๋Šฅ์— ํšจ๊ณผ์ ์ž„์„ ํ™•์ธํ–ˆ๋‹ค.
    • Global Attention : task์— ๋”ฐ๋ผ ์–ด๋–ค input์ด ๋“ค์–ด๊ฐ€์•ผ ์ตœ์ ์ธ์ง€๊ฐ€ ๋‹ค๋ฅธ๋ฐ (๋ถ„๋ฅ˜์—์„  [CLS] ํ† ํฐ, QA์—์„œ๋Š” question + docuemnt concatํ•œ ํ˜•ํƒœ ๋“ฑ) ์œ„์˜ ์–ดํ…์…˜๋“ค์ด ๋‹ค์–‘ํ•œ task์— ์•Œ๋งž๋Š” ํ‘œํ˜„์ด ์•„๋‹ˆ๊ธฐ ๋•Œ๋ฌธ์— ํŠน์ • ํƒœ์Šคํฌ์— ๋งž๋Š” ์œ„์น˜์— ์žˆ๋Š” token๋“ค์— ๋Œ€ํ•ด global attention์„ ์ถ”๊ฐ€ํ•˜์˜€๋‹ค.
    • Linear Projecton for Global Attention : sliding window์˜ linear projection๊ณผ global attention์˜ linear projection์„ ๋‹ค๋ฅด๊ฒŒ ๊ฐ€์ ธ๊ฐ€๋Š” ๊ฒƒ์ด ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ๋„์›€์ด ๋˜์—ˆ๋‹ค. global projection์€ sliding window์˜ projection์œผ๋กœ ์ดˆ๊ธฐํ™” ๋˜์—ˆ๋‹ค.