[110] Understanding the Role of Self Attention for Efficient Speech Recognition

TL;DR

I read this because.. : Presented at a thesis meeting. It’s a characterization of SA, so it looks interesting~.
task : ASR
problem : transformers are used in ASR, but there is no analysis of the nature of self-attention.
idea : measure diagonality to compare by layer / observe tendency for similar phonemes to attend / phoneme classification task by layer -> attention map could be reused
architecture : Conformer-M + attention map reuse
objective : CTC loss
baseline : Conformer-M w/o reuse
data : LibriSpeech
evaluation :
result : 1.96 times of speedup in inference and 33% reduced training time
contribution : First SA analysis in ASR! Can this analysis methodology be applied to other domains?
limitation / things I cannot understand : I do not know the details of the architecture

Details

cumulative attention diagonality

When making audio-to-text transitions, we tend to attend to our neighbors. -> attending a lot of neighbors increases diagonality But since diagnolatiy is handled in the upper layers, you can see that we are looking at linguistic in the upper layers

And then the layers underneath are responsible for phonemes, which you can see in the two pictures below.

I looked at the attention map on a phoneme-by-phoneme basis, and the tendency for similar pronunciations to attend did not show up in the layer above.

(Formula to measure attention map in phonemes )

The layers below are better at phoneme classification. The layers above are performing worse.

Based on these findings, we propose an architecture for reusing SAs.

The attention map reuse was not proposed here for the first time, but in the NLP field, but they didn’t analyze why it was reused. But this paper analyzes it, so it makes sense.

Only V is newly projected per layer Sharing Attention Weights for Fast Transformer

https://arxiv.org/pdf/1906.11024.pdf

c.f. ConFormer conv + SA + conv

TL;DR#

Details#

TL;DR

Details