[110] Understanding the Role of Self Attention for Efficient Speech Recognition

TL;DR

I read this because.. : 논문 모임에 발제됨. SA의 특성 분석이라 재밌어 보이넹~
task : ASR
problem : ASR에 transformer가 사용되나 self-attention이 어떤 특성을 가지고 있는지는 분석된 바가 없음
idea : diagonality를 측정하는 measure를 측정하여 레이어 별로 비교 / 비슷한 음소끼리 attend 하는 경향 관찰 / 레이어별 phoneme 분류 태스크 -> attention map 재사용할 수 있을듯
architecture : Conformer-M + attention map reuse
objective : CTC loss
baseline : Conformer-M w/o reuse
data : LibriSpeech
evaluation :
result : 1.96 times of speedup in inference and 33% reduced training time
contribution : ASR 분야에서 SA 최초로 분석! 이 분석 방법론으로 다른 도메인에도 적용이 가능하려나?
limitation / things I cannot understand : 아키텍쳐에 대한 자세한 내용은 모름

Details

cumulative attention diagonality

audio-to-text transition을 할 때 근처에 있는 (neighbor) 것들에 attend하는 경향이 있다. -> neighbor에 많이 attend하면 diagonality가 커짐 근데 upper layers에서 diagnolatiy가 처치므로 위의 레이어에서 linguistic을 보고 있음을 알 수 있다

그러면 밑에 layer들은 뭘 담당하냐면 Phoneme을 담당하는데 이건 아래 두개 그림을 보면 알 수 있다