[110] Understanding the Role of Self Attention for Efficient Speech Recognition2022Q1 ICLR 25min transformer
[89] Relational Attention: Generalizing Transformers for Graph-Structured Tasksmicrosoft graph 2022Q4 transformer
[71] Large Models are Parsimonious Learners: Activation Sparsity in Trained Transformers25min sparse 2022Q4 transformer