
TL;DR
- I read this because.. : Thesis study. Your motivation was that you wanted to use an efficient transformer for your model, but there was no kernel implementation for it using RPE.
- task : positional embedding
- problem : absolute PEs don’t generalize well when something longer than the learned max_len comes in. relative PEs are additive, so tricks like LinFormer don’t apply
- idea : Let’s move the d-dimensional embedding into a complex space and view it as a vector with size and space, and replace PE with an affine transformation in the form of weights.
- input/output : token / token
- architecture : transformer
- objective : MLE
- baseline : BERT
- data : English Corpus, WMT-14(MT), CAIL2019-SCM(
- evaluation : GLUE,
- result : Fast convergence. Better performance than BERT on GLUE.
- contribution : Organized RPE families once and for all
Details
Related Work : PEs
absolute PE

Shaw et al.

clipping
- Transformer-XL


- T5

Proposed

The figure shows that when d=2

- f : token embedding + PE
- g : attention score
After rotating each by position idx * angle, we get relative position embedding when we get attention score Specifically, incorporating the relative position embedding is straightforward: simply rotate the affine-transformed word embedding vector by amount of angle multiples of its position index and thus interprets the intuition behind Rotary Position Embedding.
If you increase to d-dimensions, the

Result
