image

paper

TL;DR

  • I read this because.. : TextSpan(https://github.com/long8v/PTIR/issues/172 ) ์—์„œ ์ด ๋…ผ๋ฌธ์—์„œ ์ด์šฉํ•œ OV circuit์„ ์ผ๋‹ค๊ณ  ํ•˜๊ณ  mean ablation์—์„œ ์‚ฌ์šฉ๋œ ๊ฒƒ ๊ฐ™์€๋ฐ ๋‚ด์šฉ์ด ์ดํ•ด๊ฐ€ ์•ˆ๋ผ์„œ ์ฝ์Œ.
  • problem : Transformer์˜ ๋™์ž‘ ๋ฐฉ์‹์„ circuit์„ ๋‚˜๋ˆ ์„œ ์ƒ๊ฐํ•ด๋ณด์ž.

Details

“circuit"์ด๋ž€ ๋‹จ์–ด๊ฐ€ ๋ญ”๊ฐ€ ํ•˜๊ณ  ๋ดค๋Š”๋ฐ ๋น„์Šทํ•œ ์ €์ž๋“ค์ด ๋‚ธ https://distill.pub/2020/circuits/zoom-in/ ์ด ๋…ผ๋ฌธ์ด ์‹œ์ž‘์ด์—ˆ์Œ. ๋‰ด๋Ÿด๋„คํŠธ์›Œํฌ ๋‚ด๋ถ€์—์„œ feature๋“ค์ด ์–ด๋–ป๊ฒŒ ์—ฐ๊ฒฐ๋˜์–ด ์žˆ๋Š”์ง€ sub-graph๋ฅผ ๋ถ„์„ํ•˜๋Š”๊ฑฐ๋ผ๊ณ  ํ•จ. ์Œ.. ์ž์„ธํžˆ ์ฝ์–ด๋ด์•ผ ์•Œ๊ฒ ์ง€๋งŒ ๋ถ„๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š”๊ฑด ๋ถ„๋ฆฌํ•˜๋Š” ๋ฐฉ์‹์ธ ๋“ฏ ํ•˜๋‹ค. ์—ฌ๊ธฐ์„œ ์‹œ๊ฐํ™”๋Š” ์–ด๋–ป๊ฒŒ ํ•˜๋Š”๊ฑด์ง€ ๊ถ๊ธˆํ–ˆ๋Š”๋ฐ ํ™œ์„ฑํ™”๋œ layer์— ๋Œ€ํ•ด์„œ https://en.wikipedia.org/wiki/DeepDream (code )์ด๋ž€ ๋ฐฉ๋ฒ•๋ก ์„ ์‚ฌ์šฉํ•œ๋‹ค๊ณ  ํ•จ. ์˜›๋‚ ๋ถ€ํ„ฐ ์ € LSD ์Šค๋Ÿฌ์šด ๊ทธ๋ฆผ ์–ด๋–ป๊ฒŒ ๊ทธ๋ฆฌ๋Š”๊ฐ€ ๊ถ๊ธˆํ–ˆ๋Š”๋ฐ ์ด๋ ‡๊ฒŒ ์˜ค๋ž˜๋œ ๋…ผ๋ฌธ์ด์—ˆ๋‹ค๋‹ˆ..

High-Level Architecture

image

transformer๋Š” ๋Œ€์ถฉ ๋ณด๋ฉด ์ด๋ ‡๊ฒŒ ์ƒ๊ฒผ๋‹ค

  1. token embedding
  2. residual stream์— ๊ฐ head ์—ฐ์‚ฐ $h(x_i)$๋ฅผ ๋”ํ•ด์ฃผ๋Š” ๋ถ€๋ถ„
  3. residual stream์— mlp๋ฅผ ์ทจํ•˜๊ณ  ์ด๋ฅผ ๋‹ค์‹œ residual stream์— ๋”ํ•ด์ฃผ๋Š” ๋ถ€๋ถ„
  4. word unembedding (=> logit ์˜ˆ์ธก)

์—ฌ๊ธฐ์„œ “residual stream"์„ ๋ถ„์„ํ•˜๊ธฐ๋ฅผ channel ๊ฐ„ ์ปค๋ฎค๋‹ˆ์ผ€์ด์…˜์„ ํ•˜๋Š” ๊ณณ์ด๋ผ๊ณ  ๋ถ„์„ํ•œ๋‹ค. image

residual๋กœ ์—ฐ๊ฒฐ๋˜๋Š” ๋ถ€๋ถ„์ด ์žˆ์œผ๋‹ˆ๊นŒ ๊ฐ ๋ ˆ์ด์–ด์˜ hidden ๋ผ๋ฆฌ๋Š” ์„œ๋กœ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•˜๋‹ค image

image

Attention Heads are independent and additive

image

์š”๊ฑด ๊ทธ๋ƒฅ ํ–‰๋ ฌ ์—ฐ์‚ฐ์ธ๋ฐ ๊ฐ head ๋ณ„๋กœ concatํ•˜๊ณ  $W_o$๋ฅผ ํ•˜๋Š” ์‹์œผ๋กœ ๋˜์–ด์žˆ์ง€๋งŒ ์‹ค์ œ๋กœ ์ด๊ฑด ๊ฐ head๋ณ„๋กœ $W_o^{h_i}$๋ฅผ ๊ณฑํ•œ๋‹ค์Œ summation ํ•˜๋Š” ๊ฒƒ๊ณผ ๋™์น˜์ด๋‹ค. ์ฆ‰ ๊ฐ head ๋ณ„๋กœ residual stream์— ์ •๋ณด๋ฅผ ๋„ฃ์—ˆ๋‹ค ๋บ๋‹ค ํ•œ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

Attention Heads as Information Movement

์ด๋•Œ residual stream์—์„œ ์ •๋ณด๋ฅผ ์ฝ๋Š” ๊ฒƒ๊ณผ ์“ฐ๋Š” ๊ฒƒ์ด ์™„์ „ ๋ถ„๋ฆฌ๋  ์ˆ˜ ์žˆ๋‹ค. ์ด๋ฅผ ๋ณด๊ธฐ ์œ„ํ•ด attention ์—ฐ์‚ฐ์„ ์กฐ๊ธˆ ๋‹ค๋ฅด๊ฒŒ ์จ๋ณด์ž.

  1. ๊ฐ ํ† ํฐ๋“ค์ด residual stream์œผ๋กœ ๋ถ€ํ„ฐ ๋ด…ํ˜€์ ธ value vector๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค $v_i=W_Vx_i$
  2. attention score $A_i$๋ฅผ ๋ฐ›๊ณ  linear combination ํ•˜์—ฌ result vector๋ฅผ ๊ตฌํ•œ๋‹ค $r_i=\sum_j A_{i,j} v_j$
  3. ๊ฐ head๋ณ„๋กœ output vector๋ฅผ ๊ตฌํ•œ๋‹ค $h(x)_i=W_Or_i

๊ฐ step์€ matrix multiply๋กœ ์ ์„ ์ˆ˜ ์žˆ๋Š”๋ฐ, ์™œ ํ•˜๋‚˜์˜ matrix๋กœ ํ•ฉ์น˜์ง€ ์•Š๋ƒ๋ฉด, $x$๋Š” (seq_len, head_dim)์˜ 2์ฐจ์› ํ…์„œ์ธ๋ฐ, $W_v$, $W_o$๋ฅผ ๊ณฑํ•˜๋Š”๊ฑด head_dim ์ฐจ์›์—์„œ ์ผ์–ด๋‚˜๊ณ  $A$๋ฅผ ๊ณฑํ•˜๋Š”๊ฑด seq_len ์—์„œ ์ผ์–ด๋‚˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์œ„์˜ ์—ฐ์‚ฐ์„ Tensor product ๋กœ ํ‘œํ˜„ํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค. image

contextualized embedding $x$๋ฅผ V๋กœ ๋งŒ๋“ค๊ณ  attention score A๋ž‘ ๊ณฑํ•˜๊ณ  ์ด๋ฅผ outputrhk rhqgksek. ์ด๋ฅผ ์ •๋ฆฌํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™์ด ๋˜๊ณ  $W_oW_V$๋Š” ํ•˜๋‚˜๋กœ ํ•ฉ์น  ์ˆ˜ ์žˆ๋‹ค.

image

Observation about attention heads

  • attention head๋Š” residual stream์—์„œ token์ด ๋‹ค๋ฅธ ํ† ํฐ์œผ๋กœ ์˜ฎ๊ฒจ๊ฐ€๋Š” ์—ญํ• ์„ ํ•œ๋‹ค. residual vector space๋ฅผ “contextualized word embedding"์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค.
  • ์ด๋•Œ $A$์™€ $W_OW_V$ ๋‘๊ฐœ์˜ linear operation์œผ๋กœ ๋ณผ ์ˆ˜ ์žˆ๋Š”๋ฐ ๋‘๊ฐœ๊ฐ€ ๋‹ค๋ฅธ ์—ญํ• ์„ ํ•˜๋ฉฐ ์›€์ง์ธ๋‹ค.
    • $A$๋Š” “์–ด๋–ค token"์˜ ์ •๋ณด๊ฐ€ ์–ด๋””์„œ ์–ด๋””๋กœ ๊ฐ€๋Š”์ง€๋ฅผ ๊ด€์žฅํ•œ๋‹ค
    • $W_OW_V$๋Š” source token์—์„œ “์–ด๋–ค ์ •๋ณด"๊ฐ€ ์ฝํžˆ๊ณ  ์ž‘์„ฑ๋˜๋Š”์ง€๋ฅผ ์ •ํ•œ๋‹ค.
  • ์ด๋•Œ $A$๋งŒ softmax๊ฐ€ ์žˆ์–ด์„œ nonlinearํ•˜๊ณ  $A$๋ฅผ ๊ณ ์ •ํ•˜๋ฉด linear์—ฐ์‚ฐ์œผ๋กœ ๋ณผ ์ˆ˜์žˆ๋‹ค.
  • $W_Q$, $W_K$๋Š” ํ•ญ์ƒ ๊ฐ™์ด ์›€์ง์ด๊ณ  ๊ทธ๋ž˜์„œ ์šฐ๋ฆฌ๋Š” $W_OW_V$, $W_Q^TW_V$๋ฅผ ํ•˜๋‚˜์˜ low rank matrix์ฒ˜๋Ÿผ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๋‹ค.

Zero-Layer Transformer

mhsa๊ฐ€ ์—†๋Š” ๊ทธ๋ƒฅ zero-layer transformer๋Š” ์ผ์ข…์˜ bigram์„ ํ•™์Šตํ•˜๋Š” ์—ญํ• ์„ ํ•œ๋‹ค. image

One-Layer Attention-Only Transformer

image

์•„๋ž˜์™€ ๊ฐ™์ด ์ •๋ฆฌ๋  ์ˆ˜ ์žˆ๋‹ค. h๋Š” ๊ฐ head๋ณ„ ์—ฐ์‚ฐ์ด๊ณ  sum์œผ๋กœ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค (์œ„์˜ ์„น์…˜์—์„œ ์ •๋ฆฌํ–ˆ๋“ฏ์ด) ์ด๊ฑธ tensor notation์œผ๋กœ ๋ฐ”๊พธ๋ฉด image

์ด๋ ‡๊ณ  ์ด๊ฑธ ๋‹ค์‹œ ๋ฐ”๊พธ๋ฉด image

์ด๋ ‡๊ฒŒ ๋‘๊ฐœ๋กœ ๋ถ„๋ฆฌ๋œ๋‹ค. ์•ž์˜ term์€ zero-layer transformer์˜ bigram statistics๋ฅผ ์ „๋‹ฌํ•˜๋Š” ์—ญํ•  ๋’ค์˜ ํ•ญ์€ attention head

Splitting Attention Head terms into Query-Key and Output-Value Circuits

๋‘๋ฒˆ์งธ ํ•ญ์„ ๋˜ ๋ถ„๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค. image

image

์•ž์— ์„ค๋ช…ํ–ˆ๋“ฏ์ด OV cirtcuit์€ how to attend ์ด๊ณ  QK circuit์€ ์–ด๋–ค token์„ attend ํ•  ๊ฒƒ์ด๋ƒ ์ด๋‹ค.

OV AND QK INDEPENDENCE (THE FREEZING ATTENTION PATTERNS TRICK)

์ด๊ฑฐ ๋ณด๋ ค๊ณ  ๋‚ด๊ฐ€ ์ฝ์Œ.. ๊ฒฐ๋ก ์€ ๋‘๋ฒˆ forwardํ•ด์„œ QK circuit์„ ์ €์žฅํ•ด ๋†“๊ณ  ์ด๊ฑธ ๊ณ ์ •๋œ ๊ฐ’์œผ๋กœ ๋ณด๊ณ  OV circuit์„ ๋ถ„์„ํ•˜๋ฉด linear ํ•˜๋ฏ€๋กœ ์—ฌ๋Ÿฌ ์žฌ๋ฐŒ๋Š” ๋ถ„์„์„ ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ !

Thinking of the OV and QK circuits separately can be very useful, since they’re both individually functions we can understand (linear or bilinear functions operating on matrices we understand). But is it really principled to think about them independently? One thought experiment which might be helpful is to imagine running the model twice. The first time you collect the attention patterns of each head. This only depends on the QK circuit. 14 The second time, you replace the attention patterns with the “frozen” attention patterns you collected the first time. This gives you a function where the logits are a linear function of the tokens! We find this a very powerful way to think about transformers.

์‚ฌ์‹ค ์ด ๋’ค์— ๋ถ€ํ„ฐ๊ฐ€ ๋” ์žฌ๋ฐŒ๋Š” ๊ฒƒ ๊ฐ™์€๋ฐ… ์ง€์ณ์„œ ์—ฌ๊ธฐ๊นŒ์ง€๋งŒ ์ฝ๋Š”๋‹ค.