Image

paper

TL;DR

  • I read this because.. : I didn’t understand the interleaved mRoPE that you used in Qwen3-VL.
  • task : RoPE in video LLM
  • problem : mRoPE uses the head dimension in thirds, but it’s strange that the temporal dimension is assigned to the front.
  • Idea :** assign temporal to the back of the line (lowest frequency).
  • input/output : {video, question} -> answer
  • architecture : ViT from Qwen2-7B, Qwen-7B LLM
  • objective : CE loss
  • baseline : Vanilla RoPE, mRoPE, RoPE-TIE
  • data : 1.3M video pair from LLaVA-Video-178k
  • evaluation : LongVideoBench, MLVU, Video-MME, V-NIAH, V-NIAH-D(proposed)
  • result : Better performance except for Video-MME, and extrapolate seems to work better.
  • contribution : simple sota
  • etc. :

Details

Exisiting

  • RoPE general Image
Image

watch here

We give Q and K an angular transformation ([[cos, -sin], [sin, cos]]), but when we run the self-attention operation, the angular transformation only comes out as an angular transformation for (m-n) (relative distance). Since our Q and K vectors are n-dimensional instead of 2-dimensional, the trick is to divide each by 2 to perform the low rotation weight operation.

The theta that gets multiplied there gets smaller the higher the dimension (the later the dimension), resulting in a low-frequency that moves slightly with changes in (m-n).

ImageImage

Only the formula above for finding the position ids(=m) in $R_{\theta,m}^d$ is divided by (w, h, t).

Image Image
# Q = [batch size, n heads, query len, head dim]
# K = [batch size, n heads, key len, head dim]
# V = [batch size, n heads, value len, head dim]
		
# k.permute(0, 1, 3, 2) = [batch size, n heads, head dim, key len]
energy = torch.matmul(Q, K.permute(0, 1, 3, 2)) / self.scale
# energy = [batch size, n heads, query len, key len]

I’m embarrassed to say I was confused, but there’s a dot product between head_dim to get a scalar, so it’s the same thing if you divide it and add it with dot product lol;

why is this wrong?

Round and Round We Go! What makes Rotary Positional Encodings useful? https://arxiv.org/abs/2410.06205 The observation that high frequency pulls local information and low frequency pulls long context.

Image

Proposed

Image

Result

Image Image