image

paper

TL;DR

  • task : efficient Transformer -> Machine Translation, Language Modeling, Representation leaning in Graph, Image Classification
  • problem : $O(n^2)$ of self-attention operations are inefficient
  • IDEA : Graph the input sequence and perform attention operations only on the connected nodes.
  • architecture : LSTM to predict target edge given source node, then perform self-attention only on connected edges
  • objective : Apply a policy gradient that rewards performance after self-attention when training edges because the ground truth edges are unknown. For self-attention, use a loss that is specific to each task.
  • baseline : Transformer, Sparse Graph Attention Networks , Reformer
  • data : newstest2013(WMT), Enwiki8/Text8(LM), CIFAR100/ImageNet(Image Classification)
  • result : Performance comparable to SOTA. Very low memory cost.
  • contribution : Replacing quadratic with graph in the transformer.
  • Limitations or things I don’t understand :** Learning seems to be really tricky. Wouldn’t LSTM introduce a lot of latency when predicting edges?

Details

image

image