image

paper , code

TL;DR

  • I read this because.. : aka. CheferCAM. explainable CLIP score์— ๊ด€์‹ฌ์žˆ์–ด์„œ. ์ด ๋…ผ๋ฌธ ๋ ˆํฌ์—์„œ colab ์„ ๊ณต๊ฐœํ–ˆ๋Š”๋ฐ ํ† ํฐ๋ณ„ visualize ๊ฒฐ๊ณผ๋ฅผ ๋ณผ ์ˆ˜ ์žˆ์Œ.
  • task : explainability in neural network
  • problem : ์ „์ž‘ TiBA(https://github.com/long8v/PTIR/issues/158 ) ์—์„œ self-attention ๋งŒ ๋ง๊ณ  multi-modal ํ™˜๊ฒฝ์˜ co-attention, enocder-decoder ๊ตฌ์กฐ๋„ ํ•˜๊ณ  ์‹ถ๋‹ค
  • idea : ์ด์ „์˜ ouput์— ๋Œ€ํ•œ gradient(==LRP)๊ฐ€ ์•„๋‹ˆ๋ผ attention map์— ๋Œ€ํ•œ gradient๋ฅผ ์“ฐ์ž
  • input/output : model // heatmap for text or vision tokens
  • architecture : ViT, VisualBERT, LXMERT, DETR
  • baseline : rollout, raw attention, Grad-CAM, Partial LRP, TiBA
  • evaluation : perturbation(both in image and text token for VisualBERT), weakly, semantic segmentation
  • result : ์ „์ž‘ ๋Œ€๋น„ ๋‚˜์€ ์„ฑ๋Šฅ
  • contribution : cross-attention, co-attention ๋„ explainableํ•˜๊ฒŒ ํ•œ work. ICCV oral ์ž„
  • etc. : ์•ž์— deep taylor decomposition์ด๋‹ค ๋ญ๋‹ค ํ”ผ๊ณคํ–ˆ๋Š”๋ฐ ๊ทธ๊ฑฐ ๋ฌด์‹œํ•˜๊ณ  ์ด ๋…ผ๋ฌธ๋งŒ ์ฝ์œผ๋ฉด ์ด๋ก ์ ์ธ ๋‚ด์šฉ๋„ ํ•„์š” ์—†๊ณ  ๊น”๋”ํ•œ๋“ฏ.. ๊ทธ๋ฆฌ๊ณ  ์„ฑ๋Šฅ์ด ์ข‹์Œ. ๋Œ€์‹  ๋ฐ˜๋Œ€๋กœ ์ด๋ก ์ ์ธ ๋‚ด์šฉ์ด ์—†์–ด์„œ ์ข€ ์ฃผ๋จน๊ตฌ๊ตฌ ๋А๋‚Œ. CLIP์˜ ๊ฒฝ์šฐ ์ตœ์ข… output์ด embedding์ผํ…๋ฐ ๊ทธ๋Ÿผ CLIPscore์— ๋Œ€ํ•œ ์‹œ๊ฐํ™”๋Š” ์•„๋‹Œ ๊ฒƒ ๊ฐ™๊ธฐ๋„ ํ•จ..? colab ์ž์„ธํžˆ ๋ด์•ผํ• ๋“ฏ.

Details

some notation

  • i๋Š” ์ด๋ฏธ์ง€ ํ† ํฐ
  • t๋Š” ํ…์ŠคํŠธ ํ† ํฐ
  • $A^{tt}$๋Š” text๋ผ๋ฆฌ์˜ self-attenion / $A^{ii}$๋Š” image๋ผ๋ฆฌ์˜ self-attenion
  • $A^{ti}$๋Š” multi-modal attention interaction

Relevancy initialization

relevancy map์„ ์ดˆ๊ธฐํ™” / ์—…๋ฐ์ดํŠธ ํ•  ๊ฑฐ์ž„ image

SA ์ „์—๋Š” ์„œ๋กœ ์ƒํ˜ธ์ž‘์šฉ์ด ์—†์–ด์„œ $R^{ii}$, $R^{tt}$๋Š” identity. $R^{it}$๋Š” zero tensor.

Relevancy update rules

attention map A๋ฅผ ๊ฐ€์ง€๊ณ  relavancy๋ฅผ updateํ•  ๊ฒƒ์ž„ ์ „์ž‘์— ๋”ฐ๋ผ head ๊ฐ„ ํ‰๊ท ์„ ๊ตฌํ•˜๊ณ  gradient๋ฅผ ์‚ฌ์šฉ

image

์—ฌ๊ธฐ์„œ $\delta A$๋Š” ์šฐ๋ฆฌ๊ฐ€ ์‹œ๊ฐํ™”ํ•˜๊ณ  ์‹ถ์€ class t์— ๋Œ€ํ•œ output์ธ $y_t$๋ฅผ A๋กœ ๋ฏธ๋ถ„ํ•œ ๊ฒƒ. ํ‰๊ท ์„ ์ทจํ•˜๊ธฐ ์ „์— positive๋งŒ ๋‚จ๊ฒจ์คŒ(clamp)(์ด์— ๋Œ€ํ•œ ์ด์œ ๋Š” ๋”ฑํžˆ ์—†๊ณ  ์ „์ž‘์„ ๋”ฐ๋ผ์คŒ) image

image

self attention์— ๋Œ€ํ•œ relevance ์—…๋ฐ์ดํŠธ ๋ฐฉ์‹์€ ์•„๋ž˜์™€ ๊ฐ™์Œ ์—ฌ๊ธฐ์„œ s๋Š” query token, q๋Š” key token์ž„.

์—ฌ๊ธฐ์„œ $R^{xx}$๋Š” ๋‘๊ฐœ๋กœ ๋ถ„๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š”๋ฐ ์ฒ˜์Œ์— ์ดˆ๊ธฐํ™”ํ•œ $I$๋ž‘ $I$๋ฅผ ๋บ€ residual์ธ $\hat{R}^{xx}$์ž„. $\hat{R}^{xx}$๋Š” gradient๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ˆซ์ž๊ฐ€ ์ ˆ๋Œ€์ ์œผ๋กœ ์ž‘์Œ. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด row์˜ ํ•ฉ์ด 1์ด ๋˜๋„๋ก ์ •๊ทœํ™” ํ•ด์คŒ. image

co-attention / cross-attention์˜ ๊ฒฝ์šฐ update rule์„ ์•„๋ž˜์™€ ๊ฐ™์ด ์ •์˜ํ•ด์คŒ image

Obtaining classification relevancies

[CLS] ํ† ํฐ์˜ row์— ํ•ด๋‹นํ•˜๋Š” relevancy map์„ ๋ณด๋ฉด ๋˜๋Š”๋ฐ text ์— ๋Œ€ํ•œ๊ฑธ ๋ณด๋ ค๋ฉด $R^{tt}$์˜ ์ฒซ๋ฒˆ์งธ row๋ฅผ ๋ณด๋ฉด ๋˜๊ณ  image์— ๋Œ€ํ•œ๊ฑธ ๋ณด๋ ค๋ฉด $R^{ti}$์˜ ์ฒซ๋ฒˆ์งธ row๋ฅผ ๋ณด๋ฉด ๋จ

Adaptation to attention type

image
  • ๋‘ modality์˜ ํ† ํฐ์ด concat๋˜์–ด SA์— ๋“ค์–ด๊ฐ€๋Š” ๊ฒฝ์šฐ: ์ „์ฒด $R^{(i+t, i+t)}$์—์„œ [cls] token์— ํ•ด๋‹นํ•˜๋Š” row($R^{i+t}$)์˜ Relevancy map์œผ๋กœ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Œ.
  • ๋‘ modality๊ฐ€ ๊ฐ๊ฐ SA ๋จผ์ € ํ•˜๊ณ  ์„œ๋กœ CA๋กœ ์ •๋ณด๊ตํ™˜ํ•˜๋Š” ๊ฒฝ์šฐ(co-attention): ์œ„์—์„œ ์„ค๋ช…ํ•œ propagation์„ ๋‹ค ํ•ด์•ผ ํ•จ. ์ดํ›„ relavancy map์€ ๋ถ„๋ฅ˜ ๋ชจ๋ธ์˜ relevancy๋ฅผ ๋ณด๋Š” ๊ฒƒ๊ณผ ๊ฐ™์€ ๋ฐฉ์‹์œผ๋กœ ๋ณด๋ฉด ๋จ
  • encoder-decoder๊ตฌ์กฐ: cross-attention์ด ํ•œ ๋ฐฉํ–ฅ์œผ๋กœ๋งŒ ์ด๋ฃจ์–ด์ง€๋ฏ€๋กœ equation 11์€ ์•ˆํ•ด๋„ ๋จ

Result

image image image image image