task : efficient Transformer -> Machine Translation, Language Modeling, Representation leaning in Graph, Image Classification
problem : $O(n^2)$ of self-attention operations are inefficient
IDEA : Graph the input sequence and perform attention operations only on the connected nodes.
architecture : LSTM to predict target edge given source node, then perform self-attention only on connected edges
objective : Apply a policy gradient that rewards performance after self-attention when training edges because the ground truth edges are unknown. For self-attention, use a loss that is specific to each task.