image

paper , code

TL;DR

  • I read this because.. : Thesis study. Your motivation was that you wanted to use an efficient transformer for your model, but there was no kernel implementation for it using RPE.
  • task : positional embedding
  • problem : absolute PEs don’t generalize well when something longer than the learned max_len comes in. relative PEs are additive, so tricks like LinFormer don’t apply
  • idea : Let’s move the d-dimensional embedding into a complex space and view it as a vector with size and space, and replace PE with an affine transformation in the form of weights.
  • input/output : token / token
  • architecture : transformer
  • objective : MLE
  • baseline : BERT
  • data : English Corpus, WMT-14(MT), CAIL2019-SCM(
  • evaluation : GLUE,
  • result : Fast convergence. Better performance than BERT on GLUE.
  • contribution : Organized RPE families once and for all

Details

  • absolute PE image

  • Shaw et al.

image

clipping

  • Transformer-XL image
image
  • T5 image

Proposed

image image image

The figure shows that when d=2

image image image
  • f : token embedding + PE
  • g : attention score

After rotating each by position idx * angle, we get relative position embedding when we get attention score Specifically, incorporating the relative position embedding is straightforward: simply rotate the affine-transformed word embedding vector by amount of angle multiples of its position index and thus interprets the intuition behind Rotary Position Embedding.

If you increase to d-dimensions, the image

Result

image image image