[112] RoFormer: Enhanced Transformer with Rotary Position Embedding

paper , code

TL;DR

I read this because.. : Thesis study. Your motivation was that you wanted to use an efficient transformer for your model, but there was no kernel implementation for it using RPE.
task : positional embedding
problem : absolute PEs don’t generalize well when something longer than the learned max_len comes in. relative PEs are additive, so tricks like LinFormer don’t apply
idea : Let’s move the d-dimensional embedding into a complex space and view it as a vector with size and space, and replace PE with an affine transformation in the form of weights.
input/output : token / token
architecture : transformer
objective : MLE
baseline : BERT
data : English Corpus, WMT-14(MT), CAIL2019-SCM(
evaluation : GLUE,
result : Fast convergence. Better performance than BERT on GLUE.
contribution : Organized RPE families once and for all

Details

absolute PE
Shaw et al.

clipping

Transformer-XL

Proposed

The figure shows that when d=2

f : token embedding + PE
g : attention score

After rotating each by position idx * angle, we get relative position embedding when we get attention score Specifically, incorporating the relative position embedding is straightforward: simply rotate the affine-transformed word embedding vector by amount of angle multiples of its position index and thus interprets the intuition behind Rotary Position Embedding.

If you increase to d-dimensions, the

TL;DR#

Details#

Related Work : PEs#

Proposed#

Result#

TL;DR

Details

Related Work : PEs

Proposed

Result