image

paper

TL;DR

  • I read this because.. : ๋…ผ๋ฌธ ๋ชจ์ž„์— ๋ฐœ์ œ๋จ. SA์˜ ํŠน์„ฑ ๋ถ„์„์ด๋ผ ์žฌ๋ฐŒ์–ด ๋ณด์ด๋„น~
  • task : ASR
  • problem : ASR์— transformer๊ฐ€ ์‚ฌ์šฉ๋˜๋‚˜ self-attention์ด ์–ด๋–ค ํŠน์„ฑ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋Š”์ง€๋Š” ๋ถ„์„๋œ ๋ฐ”๊ฐ€ ์—†์Œ
  • idea : diagonality๋ฅผ ์ธก์ •ํ•˜๋Š” measure๋ฅผ ์ธก์ •ํ•˜์—ฌ ๋ ˆ์ด์–ด ๋ณ„๋กœ ๋น„๊ต / ๋น„์Šทํ•œ ์Œ์†Œ๋ผ๋ฆฌ attend ํ•˜๋Š” ๊ฒฝํ–ฅ ๊ด€์ฐฐ / ๋ ˆ์ด์–ด๋ณ„ phoneme ๋ถ„๋ฅ˜ ํƒœ์Šคํฌ -> attention map ์žฌ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์„๋“ฏ
  • architecture : Conformer-M + attention map reuse
  • objective : CTC loss
  • baseline : Conformer-M w/o reuse
  • data : LibriSpeech
  • evaluation :
  • result : 1.96 times of speedup in inference and 33% reduced training time
  • contribution : ASR ๋ถ„์•ผ์—์„œ SA ์ตœ์ดˆ๋กœ ๋ถ„์„! ์ด ๋ถ„์„ ๋ฐฉ๋ฒ•๋ก ์œผ๋กœ ๋‹ค๋ฅธ ๋„๋ฉ”์ธ์—๋„ ์ ์šฉ์ด ๊ฐ€๋Šฅํ•˜๋ ค๋‚˜?
  • limitation / things I cannot understand : ์•„ํ‚คํ…์ณ์— ๋Œ€ํ•œ ์ž์„ธํ•œ ๋‚ด์šฉ์€ ๋ชจ๋ฆ„

Details

image
  • cumulative attention diagonality image
image

audio-to-text transition์„ ํ•  ๋•Œ ๊ทผ์ฒ˜์— ์žˆ๋Š” (neighbor) ๊ฒƒ๋“ค์— attendํ•˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ๋‹ค. -> neighbor์— ๋งŽ์ด attendํ•˜๋ฉด diagonality๊ฐ€ ์ปค์ง ๊ทผ๋ฐ upper layers์—์„œ diagnolatiy๊ฐ€ ์ฒ˜์น˜๋ฏ€๋กœ ์œ„์˜ ๋ ˆ์ด์–ด์—์„œ linguistic์„ ๋ณด๊ณ  ์žˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค

๊ทธ๋Ÿฌ๋ฉด ๋ฐ‘์— layer๋“ค์€ ๋ญ˜ ๋‹ด๋‹นํ•˜๋ƒ๋ฉด Phoneme์„ ๋‹ด๋‹นํ•˜๋Š”๋ฐ ์ด๊ฑด ์•„๋ž˜ ๋‘๊ฐœ ๊ทธ๋ฆผ์„ ๋ณด๋ฉด ์•Œ ์ˆ˜ ์žˆ๋‹ค image

์Œ์†Œ ๋‹จ์œ„๋กœ attention map์„ ๋ดค๋Š”๋ฐ ๋น„์Šทํ•œ ๋ฐœ์Œ๋ผ๋ฆฌ attendํ•˜๋Š” ๊ฒฝํ–ฅ์ด ์œ„์˜ ๋ ˆ์ด์–ด์—์„œ๋Š” ์•ˆ๋‚˜ํƒ€๋‚จ

(์Œ์†Œ๋‹จ์œ„๋กœ attention map ์ธก์ •ํ•˜๋Š” ์ˆ˜์‹ image )

image

์•„๋ž˜ ๋ ˆ์ด์–ด๋“ค์ด Phoneme classification๋ฅผ ๋” ์ž˜ํ•จ. ์œ„์˜ ๋ ˆ์ด์–ด ๊ฐ€์„œ ์„ฑ๋Šฅ์ด ์•ˆ์ข‹์•„์ง.

์ด๋Ÿฌํ•œ ๋ฐœ๊ฒฌ๋“ค์„ ๊ธฐ๋ฐ˜์œผ๋กœ SA๋ฅผ ์žฌ์‚ฌ์šฉํ•˜๋Š” ์•„ํ‚คํ…์ณ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. image

attention map reuse๋Š” ์—ฌ๊ธฐ์„œ ์ฒ˜์Œ ์ œ์•ˆ๋œ๊ฑด ์•„๋‹ˆ๊ณ  NLP ์ชฝ์—๋Š” ์žˆ์—ˆ๋Š”๋ฐ ์™œ ์žฌ์‚ฌ์šฉ๋˜๋Š”์ง€๋Š” ๋ถ„์„์„ ์•ˆํ–ˆ๋‹ค. ๊ทผ๋ฐ ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ๋ถ„์„ํ–ˆ์œผ๋‹ˆ ์˜๋ฏธ๊ฐ€ ์žˆ๋‹ค.

image

V๋งŒ ๋ ˆ์ด์–ด๋ณ„๋กœ ์ƒˆ๋กœ project๋˜๋Š” ๊ผด Sharing Attention Weights for Fast Transformer

https://arxiv.org/pdf/1906.11024.pdf

c.f. ConFormer conv + SA + conv image