[220] VideoRoPE: What Makes for Good Video Rotary Position Embedding?

paper

TL;DR

I read this because.. : I didn’t understand the interleaved mRoPE that you used in Qwen3-VL.
task : RoPE in video LLM
problem : mRoPE uses the head dimension in thirds, but it’s strange that the temporal dimension is assigned to the front.
Idea :** assign temporal to the back of the line (lowest frequency).
input/output : {video, question} -> answer
architecture : ViT from Qwen2-7B, Qwen-7B LLM
objective : CE loss
baseline : Vanilla RoPE, mRoPE, RoPE-TIE
data : 1.3M video pair from LLaVA-Video-178k
evaluation : LongVideoBench, MLVU, Video-MME, V-NIAH, V-NIAH-D(proposed)
result : Better performance except for Video-MME, and extrapolate seems to work better.
contribution : simple sota
etc. :

Details

Exisiting

RoPE general

watch here

We give Q and K an angular transformation ([[cos, -sin], [sin, cos]]), but when we run the self-attention operation, the angular transformation only comes out as an angular transformation for (m-n) (relative distance). Since our Q and K vectors are n-dimensional instead of 2-dimensional, the trick is to divide each by 2 to perform the low rotation weight operation.

The theta that gets multiplied there gets smaller the higher the dimension (the later the dimension), resulting in a low-frequency that moves slightly with changes in (m-n).

mRoPE https://github.com/huggingface/transformers/blob/v4.57.1/src/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py#L545-L587

Only the formula above for finding the position ids(=m) in $R_{\theta,m}^d$ is divided by (w, h, t).

# Q = [batch size, n heads, query len, head dim]
# K = [batch size, n heads, key len, head dim]
# V = [batch size, n heads, value len, head dim]
		
# k.permute(0, 1, 3, 2) = [batch size, n heads, head dim, key len]
energy = torch.matmul(Q, K.permute(0, 1, 3, 2)) / self.scale
# energy = [batch size, n heads, query len, key len]

I’m embarrassed to say I was confused, but there’s a dot product between head_dim to get a scalar, so it’s the same thing if you divide it and add it with dot product lol;

why is this wrong?

Round and Round We Go! What makes Rotary Positional Encodings useful? https://arxiv.org/abs/2410.06205 The observation that high frequency pulls local information and low frequency pulls long context.

Proposed

Result

TL;DR#

Details#

Exisiting#

why is this wrong?#

Proposed#

Result#