TL;DR
- I read this because.. : I didn’t understand the interleaved mRoPE that you used in Qwen3-VL.
- task : RoPE in video LLM
- problem : mRoPE uses the head dimension in thirds, but it’s strange that the temporal dimension is assigned to the front.
- Idea :** assign temporal to the back of the line (lowest frequency).
- input/output : {video, question} -> answer
- architecture : ViT from Qwen2-7B, Qwen-7B LLM
- objective : CE loss
- baseline : Vanilla RoPE, mRoPE, RoPE-TIE
- data : 1.3M video pair from LLaVA-Video-178k
- evaluation : LongVideoBench, MLVU, Video-MME, V-NIAH, V-NIAH-D(proposed)
- result : Better performance except for Video-MME, and extrapolate seems to work better.
- contribution : simple sota
- etc. :
Details
Exisiting
- RoPE general
watch here
We give Q and K an angular transformation ([[cos, -sin], [sin, cos]]), but when we run the self-attention operation, the angular transformation only comes out as an angular transformation for (m-n) (relative distance).
Since our Q and K vectors are n-dimensional instead of 2-dimensional, the trick is to divide each by 2 to perform the low rotation weight operation.
The theta that gets multiplied there gets smaller the higher the dimension (the later the dimension), resulting in a low-frequency that moves slightly with changes in (m-n).
Only the formula above for finding the position ids(=m) in $R_{\theta,m}^d$ is divided by (w, h, t).
# Q = [batch size, n heads, query len, head dim]
# K = [batch size, n heads, key len, head dim]
# V = [batch size, n heads, value len, head dim]
# k.permute(0, 1, 3, 2) = [batch size, n heads, head dim, key len]
energy = torch.matmul(Q, K.permute(0, 1, 3, 2)) / self.scale
# energy = [batch size, n heads, query len, key len]
I’m embarrassed to say I was confused, but there’s a dot product between head_dim to get a scalar, so it’s the same thing if you divide it and add it with dot product lol;
why is this wrong?
Round and Round We Go! What makes Rotary Positional Encodings useful? https://arxiv.org/abs/2410.06205 The observation that high frequency pulls local information and low frequency pulls long context.