image

paper

TL;DR

  • I read this because.. : Presented at a thesis meeting. It’s a characterization of SA, so it looks interesting~.
  • task : ASR
  • problem : transformers are used in ASR, but there is no analysis of the nature of self-attention.
  • idea : measure diagonality to compare by layer / observe tendency for similar phonemes to attend / phoneme classification task by layer -> attention map could be reused
  • architecture : Conformer-M + attention map reuse
  • objective : CTC loss
  • baseline : Conformer-M w/o reuse
  • data : LibriSpeech
  • evaluation :
  • result : 1.96 times of speedup in inference and 33% reduced training time
  • contribution : First SA analysis in ASR! Can this analysis methodology be applied to other domains?
  • limitation / things I cannot understand : I do not know the details of the architecture

Details

image
  • cumulative attention diagonality image
image

When making audio-to-text transitions, we tend to attend to our neighbors. -> attending a lot of neighbors increases diagonality But since diagnolatiy is handled in the upper layers, you can see that we are looking at linguistic in the upper layers

And then the layers underneath are responsible for phonemes, which you can see in the two pictures below. image

I looked at the attention map on a phoneme-by-phoneme basis, and the tendency for similar pronunciations to attend did not show up in the layer above.

(Formula to measure attention map in phonemes image )

image

The layers below are better at phoneme classification. The layers above are performing worse.

Based on these findings, we propose an architecture for reusing SAs. image

The attention map reuse was not proposed here for the first time, but in the NLP field, but they didn’t analyze why it was reused. But this paper analyzes it, so it makes sense.

image

Only V is newly projected per layer Sharing Attention Weights for Fast Transformer

https://arxiv.org/pdf/1906.11024.pdf

c.f. ConFormer conv + SA + conv image