Image

paper

TL;DR

  • I read this because.. : Qwen3-VL์—์„œ ์‚ฌ์šฉํ–ˆ๋‹ค๋Š” interleaved mRoPE ๊ฐ€ ์ดํ•ด๊ฐ€ ์•ˆ๋˜์–ด์„œ.
  • task : RoPE in video LLM
  • problem : mRoPE์—์„œ head dimension์„ 3๋“ฑ๋ถ„ ํ•ด์„œ ์‚ฌ์šฉํ•˜๋Š”๋ฐ, temporal ์ฐจ์›์ด ์•ž๋ถ€๋ถ„์— ํ• ๋‹น๋˜๋Š”๊ฒŒ ์ด์ƒํ•˜๋‹ค.
  • idea : temporal์„ ๊ฐ€์žฅ ๋’ค๋กœ (low frequency)๋กœ ํ• ๋‹นํ•˜์ž.
  • input/output : {video, question} -> answer
  • architecture : ViT from Qwen2-7B, Qwen-7B LLM
  • objective : CE loss
  • baseline : Vanilla RoPE, mRoPE, RoPE-TIE
  • data : 1.3M video pair from LLaVA-Video-178k
  • evaluation : LongVideoBench, MLVU, Video-MME, V-NIAH, V-NIAH-D(proposed)
  • result : Video-MME๋ฅผ ์ œ์™ธํ•˜๊ณ  ์„ฑ๋Šฅ์ด ๋” ์ข‹๊ณ , extrapolate๋„ ๋” ์ž˜ํ•œ๋Š”๋“ฏ.
  • contribution : ๊ฐ„๋‹จ sota
  • etc. :

Details

Exisiting

  • RoPE general Image
Image

watch here

Q์™€ K์— ๊ฐ๊ฐ ๊ฐ๋„ ๋ณ€ํ™˜ ([[cos, -sin], [sin, cos]]) ์„ ์‹œํ‚ค๋Š”๋ฐ, ์ด ๊ฐ๋„ ๋ณ€ํ™˜์„ ์‹œํ‚จ๊ฑธ self-attention ์—ฐ์‚ฐ์„ ํ•˜๊ฒŒ ๋˜๋ฉด (m-n) (์ƒ๋Œ€ ๊ฑฐ๋ฆฌ)์— ๋Œ€ํ•œ ๊ฐ๋„๋ณ€ํ™˜์œผ๋กœ๋งŒ ๋‚˜์˜ค๊ฒŒ ๋จ. ์ด๋•Œ ์šฐ๋ฆฌ์˜ Q, K ๋ฒกํ„ฐ๋Š” 2์ฐจ์›์ด ์•„๋‹ˆ๋ผ n์ฐจ์›์ด๊ธฐ ๋•Œ๋ฌธ์— ์ด์— ๋Œ€ํ•œ trick์œผ๋กœ ๊ฐ๊ฐ 2๋กœ ๋‚˜๋ˆ ์„œ ์ € rotation weight ์—ฐ์‚ฐ์„ ํ•ด์ฃผ๊ฒŒ ๋จ.

์ด๋•Œ ์ €๊ธฐ ๊ณฑํ•ด์ง€๋Š” theta๊ฐ€ dimension์ด ๋†’์„์ˆ˜๋ก (๋’ค์— ์žˆ๋Š” dimension์ผ ์ˆ˜๋ก) ์ž‘์•„์ ธ์„œ (m-n)์˜ ๋ณ€ํ™”์— ์กฐ๊ธˆ ์”ฉ ์›€์ง์ด๋Š” low-frequency๊ฐ€ ๋จ.

ImageImage

์œ„์˜ $R_{\theta,m}^d$ ์—์„œ position ids(=m)์„ ๊ตฌํ•˜๋Š” ๊ณ„์‚ฐ์‹๋งŒ (w, h, t)๋กœ ๋‚˜๋ˆ„์–ด์ง„๋‹ค๊ณ  ๋ณด๋ฉด ๋จ.

Image Image
# Q = [batch size, n heads, query len, head dim]
# K = [batch size, n heads, key len, head dim]
# V = [batch size, n heads, value len, head dim]
		
# k.permute(0, 1, 3, 2) = [batch size, n heads, head dim, key len]
energy = torch.matmul(Q, K.permute(0, 1, 3, 2)) / self.scale
# energy = [batch size, n heads, query len, key len]

๋ถ€๋„๋Ÿฝ์ง€๋งŒ ํ—ท๊ฐˆ๋ ธ๋Š”๋ฐ head_dim ๋ผ๋ฆฌ dot product๋ฅผ ํ•ด์„œ scalar๋ฅผ ๊ตฌํ•˜๋Š” ๊ฑฐ๊ธฐ ๋•Œ๋ฌธ์— ์ด๊ฑธ ๋‚˜๋ˆ ์„œ dot product๋กœ ํ•ด์„œ ๋”ํ•ด๋„ ๊ฐ™์€ ๊ฒƒ์ž„ ใ…Žใ…Ž;

why is this wrong?

Round and Round We Go! What makes Rotary Positional Encodings useful? https://arxiv.org/abs/2410.06205 high frequency๋Š” local ํ•œ ์ •๋ณด๋ฅผ ๋ฝ‘๊ณ  low frequency๋Š” long context๋ฅผ ๋ฝ‘๋Š”๋‹ค๋Š” ๊ด€์ฐฐ.

Image

Proposed

Image

Result

Image Image