
TL;DR
- I read this because.. : Presented at a thesis meeting. It’s a characterization of SA, so it looks interesting~.
- task : ASR
- problem : transformers are used in ASR, but there is no analysis of the nature of self-attention.
- idea : measure diagonality to compare by layer / observe tendency for similar phonemes to attend / phoneme classification task by layer -> attention map could be reused
- architecture : Conformer-M + attention map reuse
- objective : CTC loss
- baseline : Conformer-M w/o reuse
- data : LibriSpeech
- evaluation :
- result : 1.96 times of speedup in inference and 33% reduced training time
- contribution : First SA analysis in ASR! Can this analysis methodology be applied to other domains?
- limitation / things I cannot understand : I do not know the details of the architecture
Details

- cumulative attention diagonality


When making audio-to-text transitions, we tend to attend to our neighbors. -> attending a lot of neighbors increases diagonality But since diagnolatiy is handled in the upper layers, you can see that we are looking at linguistic in the upper layers
And then the layers underneath are responsible for phonemes, which you can see in the two pictures below.

I looked at the attention map on a phoneme-by-phoneme basis, and the tendency for similar pronunciations to attend did not show up in the layer above.
(Formula to measure attention map in phonemes
)

The layers below are better at phoneme classification. The layers above are performing worse.
Based on these findings, we propose an architecture for reusing SAs.

The attention map reuse was not proposed here for the first time, but in the NLP field, but they didn’t analyze why it was reused. But this paper analyzes it, so it makes sense.

Only V is newly projected per layer Sharing Attention Weights for Fast Transformer
https://arxiv.org/pdf/1906.11024.pdf
c.f. ConFormer
conv + SA + conv
