image

a.k.a TextSpan paper , code

TL;DR

  • I read this because.. : CLIP spurious cues๋กœ ๊ฒ€์ƒ‰ํ•˜๋‹ค๊ฐ€ ๋‚˜์˜ด
  • task : CLIP ViT์˜ layer, head์˜ ํ…์ŠคํŠธ ํ‘œํ˜„ ๋ฝ‘๊ธฐ
  • idea : human + GPT๋กœ 3948๊ฐœ์˜ ์ผ๋ฐ˜์ ์ธ ํ‘œํ˜„ ๋ฌธ์žฅ์„ ๋งŒ๋“  ๋’ค์— ์ด๋ฏธ์ง€ ํ‘œํ˜„๊ณผ์˜ ๋‚ด์ ์—์„œ variance๊ฐ€์žฅ ๋†’์€ row๋ฅผ ๊ณ ๋ฅธ ๋’ค ์ด๋ฅผ projection์— ์ถ”๊ฐ€ํ•˜๋Š” ๋ฐฉ์‹
  • input/output : {image, model} -> text explanation of ViT layer and heads
  • architecture : ViT-B-16, ViT-L-14, ViT-H-14
  • baseline : LRP, Partial-LRP, rollout, raw attention, GradCAM, Chefer2021
  • data : ImageNet(mean ablation), Waterbirds dataset(reducing spurious cues), ImageNet-Segmentation(zs-segmentation)
  • evaluation : accuracy(imagenet), worst-group accuracy(waterbird), pixel accuracy/mIoU/mAP (zs-segmenatation)
  • result : ๋งˆ์ง€๋ง‰ 4๊ฐœ์˜ MSA layer๋งŒ ์ตœ์ข… ์˜ˆ์ธก์— ์˜ํ–ฅ์„ ์ฃผ๊ณ  ๋‹ค๋ฅธ ๋ ˆ์ด์–ด๋“ค์€ ์˜ํ–ฅ์„ ๋ณ„๋กœ ์•ˆ ์คŒ, qualitativeํ•˜๊ฒŒ ๋งค์šฐ ์žฌ๋ฐŒ๋Š” ๊ฒฐ๊ณผ, zs-segmentation์—์„œ sota
  • contribution : CLIP์˜ ๊ฐ ํ‘œํ˜„์„ text๋กœ ์„ค๋ช… ๊ฐ€๋Šฅํ•˜๊ฒŒ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ œ์•ˆ.
  • etc. :

Details

  • Multimodal neurons in artificial neural networks https://openai.com/index/multimodal-neurons
    • CLIP์˜ ๋ ˆ์ด์–ด, ํ—ค๋“œ ๋ณ„๋กœ ํ•™์Šต๋œ ํ‘œํ˜„์ด ๋งค์šฐ ํ•ด์„ ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ๋…ผ๋ฌธ
  • Disentangling visual and written concepts in CLIP
    • ์œ„์˜ ๋ฐฉ๋ฒ•๋ก ์„ ํ™œ์šฉํ•ด์„œ ์ด๋ฏธ์ง€ ํ‘œํ˜„์— ๊ธ€์ž๋ฅผ ์“ฐ๊ณ  ์ง€์šฐ๊ณ  ํ•˜๋Š” ๋…ผ๋ฌธ

Preliminary findings

image

last 4 layer์˜ MSA๋งŒ ์„ฑ๋Šฅ์— ์˜ํ–ฅ์„ ์ฃผ๊ณ  MLP๋‚˜ ๊ทธ์ „์˜ MSA ๋ ˆ์ด์–ด๋“ค์€ mean ablate๋ฅผ ํ•ด๋„ ์„ฑ๋Šฅ์— ํฐ ์˜ํ–ฅ์ด ์—†์—ˆ๋‹ค.

Decomposition to head

image

MSA๋ฅผ ์œ„์™€ ๊ฐ™์ด ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์Œ $\alpha$๋Š” attention score

image

์—ฌ๊ธฐ์— projection $P$ ๊นŒ์ง€ ํฌํ•จํ•ด์„œ ํ‘œํ˜„ํ•˜๋ฉด ์œ„์™€ ๊ฐ™์€ ์‹์ด ๋จ. ์ฆ‰ ๋ ˆ์ด์–ด, head, patch ๋ณ„๋กœ projection๊ณผ attention ์—ฐ์‚ฐ $c_{i, j, h}$๋ฅผ summationํ•˜์—ฌ ๊ฐ ๋ ˆ์ด์–ด, ํ—ค๋“œ ๋“ฑ์˜ ํ‘œํ˜„์„ ๊ตฌํ•  ์ˆ˜ ์žˆ์Œ

TextSpan algorithm

image

๋ณต์žกํ•ด ๋ณด์ด๋Š”๋ฐ ๋ณ„๊ฑฐ ์—†์Œ

  • layer, head ๋ณ„ attention output $C\in\mathbb{R}${K\times d’}$์™€ text representation $R\in\mathbb{R}^{M\times d’}$์™€ ํ–‰๋ ฌ ๊ณฑ ํ•œ๋‹ค์Œ์— ๊ฐ€์žฅ ๋ถ„์‚ฐ์„ ๋†’๊ฒŒ ํ•˜๋Š” ํ‘œํ˜„ j๋ฅผ ์ฐพ์€ ๋’ค ์ด $\tau$๋ฅผ projection์— ์ถ”๊ฐ€ํ•จ. ๊ทธ๋ฆฌ๊ณ  ์ด ํ‘œํ˜„์„ C์™€ R์— ์—…๋ฐ์ดํŠธํ•ด์ฃผ์–ด์„œ ์ด ํ‘œํ˜„์ด ๋‹ค์Œ ํ‘œํ˜„๊ณผ orthogonal ํ•˜๊ฒŒ ํ‘œํ˜„์„ ๋ฐ”๊ฟ”์คŒ (PCA์™€ ๋น„์Šทํ•œ ๋А๋‚Œ)

์ด๋ ‡๊ฒŒ ๋‚˜์˜จ layer / head ๋ณ„ ํ‘œํ˜„๋“ค image

Result

Quantitative

image

Qualitative

image image image