[21] cosFormer: Rethinking Softmax in Attention

Introduction

In vanilla Transformer, doing dot product + softmax is necessary to model long-range dependencies, but the softmax operation is too heavy, especially for long sequence lengths.
Q (batch_size, query_seq_len, hid_dim) * transpose(K) (batch_size, hid_dim, key_seq_len) gives us (batch_size, query_seq_len, key_seq_len) and we take the softmax of the last dimension to get the attenion score. -> as key_seq_len gets longer, softmax also gets heavier
To reduce this, SparseTransformer, Random feature attention, ReFormer, etc. were proposed, but they (1) make assumptions about the attention matrix and (2) make approximations, so if the assumptions do not hold or the approximation error increases, they do not perform as well as vanilla Transformer. Also, some methodologies could not be applied to causal attention for LM. For example, Linformer and BigBird could only be used for cross attention.
Against this backdrop, we wanted to see if we could replace softmax with a linear function while preserving its key properties.
(i) the attention matrix is non-negative
(ii) non-linear re-weighting scheme acts to stabilize attention weights
For example, linear transformer achieved (i) with exponential linear units, but did not re-weight (?), which is why it performed poorly in the Long-Range Arena.
- long-range arena : A Benchmark for Efficient Transformers
We propose a CosFormer that satisfies the above properties: (i) passes the feature map through ReLU before obtaining the attention score; and (ii) uses the cos re-weighting scheme to amplify the local correlation a bit more.
This attribute decomposes exactly into a linear form.
Took first place in the Long-Range Arena.

Our Method

The idea of CosFormer is to replace the non-decomposable non-linear softmax operation with a decomposable non-linear re-weighting mechanism and a linear operation.

A typical transformer can be as complex as O(L**2). The important thing to note is that you can choose any similarity function. To make S(Q, K) linear, we can make the similarity function a decomposable similarity function. -> \phi(Q)\phi(K) is the similarity, not exp(Q KT)

After that, we can achieve linear complexity O(Nd^2) by first finding KV via the matrix property. In general, since N » d, we can express it in O(N).

We characterized softmax as (i) non-negative and (ii) non-linear re-weighting, and used different functions to find the similarity matrix to verify this. We found that ReLU > LeakyReLU > Identity due to its non-negative nature, and softmax > ReLU due to its non-linear nature.

CosFormer

linear projection kernel = ReLU, dot product.

To remove negative values, the kernel used ReLU.

The similarity function did a dot product by row.

cos-Based Re-weighting Mechanism The non-linear re-weighting that softmax does is important, as it focuses the distribution of attention weights and stabilizes the learning process. We also saw that it penalizes distant connections and enforces close ones.

This cosine strategy is fully decomposable (formula omitted)

Result

papers

RFA
Transformers are rnns: Fast autoregressive transformers with linear attention
performer
One-vs-each approximation to softmax for scalable estimation of probabilities
Stochastic Positional Encoding
Rotary Position Embedding

Introduction#

Our Method#

Result#

Introduction

Our Method

Result