[48] SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection

TL;DR

task : efficient Transformer -> Machine Translation, Language Modeling, Representation leaning in Graph, Image Classification
problem : $O(n^2)$ of self-attention operations are inefficient
IDEA : Graph the input sequence and perform attention operations only on the connected nodes.
architecture : LSTM to predict target edge given source node, then perform self-attention only on connected edges
objective : Apply a policy gradient that rewards performance after self-attention when training edges because the ground truth edges are unknown. For self-attention, use a loss that is specific to each task.
baseline : Transformer, Sparse Graph Attention Networks , Reformer
data : newstest2013(WMT), Enwiki8/Text8(LM), CIFAR100/ImageNet(Image Classification)
result : Performance comparable to SOTA. Very low memory cost.
contribution : Replacing quadratic with graph in the transformer.
Limitations or things I don’t understand :** Learning seems to be really tricky. Wouldn’t LSTM introduce a lot of latency when predicting edges?