image

paper

Introduction

  • In vanilla Transformer, doing dot product + softmax is necessary to model long-range dependencies, but the softmax operation is too heavy, especially for long sequence lengths.

  • Q (batch_size, query_seq_len, hid_dim) * transpose(K) (batch_size, hid_dim, key_seq_len) gives us (batch_size, query_seq_len, key_seq_len) and we take the softmax of the last dimension to get the attenion score. -> as key_seq_len gets longer, softmax also gets heavier

  • To reduce this, SparseTransformer, Random feature attention, ReFormer, etc. were proposed, but they (1) make assumptions about the attention matrix and (2) make approximations, so if the assumptions do not hold or the approximation error increases, they do not perform as well as vanilla Transformer. Also, some methodologies could not be applied to causal attention for LM. For example, Linformer and BigBird could only be used for cross attention. image

  • Against this backdrop, we wanted to see if we could replace softmax with a linear function while preserving its key properties.

  • (i) the attention matrix is non-negative

  • (ii) non-linear re-weighting scheme acts to stabilize attention weights

  • For example, linear transformer achieved (i) with exponential linear units, but did not re-weight (?), which is why it performed poorly in the Long-Range Arena.

  • We propose a CosFormer that satisfies the above properties: (i) passes the feature map through ReLU before obtaining the attention score; and (ii) uses the cos re-weighting scheme to amplify the local correlation a bit more.

  • This attribute decomposes exactly into a linear form.

  • Took first place in the Long-Range Arena.

Our Method

The idea of CosFormer is to replace the non-decomposable non-linear softmax operation with a decomposable non-linear re-weighting mechanism and a linear operation. image

A typical transformer can be as complex as O(L**2). The important thing to note is that you can choose any similarity function. To make S(Q, K) linear, we can make the similarity function a decomposable similarity function. -> \phi(Q)\phi(K) is the similarity, not exp(Q KT) image

After that, we can achieve linear complexity O(Nd^2) by first finding KV via the matrix property. In general, since N » d, we can express it in O(N). image

image image We characterized softmax as (i) non-negative and (ii) non-linear re-weighting, and used different functions to find the similarity matrix to verify this. We found that ReLU > LeakyReLU > Identity due to its non-negative nature, and softmax > ReLU due to its non-linear nature.

CosFormer

  1. linear projection kernel = ReLU, dot product. image

To remove negative values, the kernel used ReLU. image

The similarity function did a dot product by row. image

  1. cos-Based Re-weighting Mechanism The non-linear re-weighting that softmax does is important, as it focuses the distribution of attention weights and stabilizes the learning process. We also saw that it penalizes distant connections and enforces close ones. image

This cosine strategy is fully decomposable (formula omitted) image

Result

image image

papers

  • RFA
  • Transformers are rnns: Fast autoregressive transformers with linear attention
  • performer
  • One-vs-each approximation to softmax for scalable estimation of probabilities
  • Stochastic Positional Encoding
  • Rotary Position Embedding