image

paper , code

Abstract

ViT์˜ multi-head self-attention์€ ์ด๋ฏธ์ง€ ํŒจ์น˜๋“ค์˜ ์‹œํ€€์Šค๋“ค์„ ์œ ์—ฐํ•˜๊ฒŒ ์ฐธ์กฐํ•œ๋‹ค. ์ค‘์š”ํ•œ ์ ์€ ๊ทธ๋Ÿฌ ์œ ์—ฐํ•จ์ด ์ž์—ฐ์ด๋ฏธ์ง€์—์„œ์˜ nuisances(๋ฐฉํ•ด๋ฌผ)์„ ์–ด๋–ป๊ฒŒ ์ž˜ ์ด์šฉํ•˜๋ƒ์ด๋‹ค. ์šฐ๋ฆฌ๋Š” ๋‹ค์–‘ํ•œ ์‹คํ—˜๋“ค์„ ํ†ตํ•ด CNN๊ณผ ๋น„๊ตํ•˜์—ฌ ViT๋ฅ˜๋“ค์ด ์–ด๋–ค ํŠน์„ฑ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋Š”์ง€ ์‹คํ—˜ํ•ด๋ณด์•˜๋‹ค. (a) ํŠธ๋žœ์Šคํฌ๋จธ๋Š” ์‹ฌํ•œ occlusion, perturbation, domain shift์— ๊ฐ•ํ•˜๋‹ค. ๊ฐ€๋ น ์ด๋ฏธ์ง€์˜ 80%๋ฅผ occlusion์œผ๋กœ ์ œ๊ฑฐํ•ด๋„ 60%์˜ top-1 accuracy๋ฅผ ๋‹ฌ์„ฑํ–ˆ๋‹ค. image

(b) (a)๋Š” texture bias๋•Œ๋ฌธ์ด ์•„๋‹ˆ๊ณ , ViT๊ฐ€ local texture์— ๋œ bias ๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์ด๋ผ๊ณ  ๋ณด์•˜๋‹ค. shape-based feature๋ฅผ encodeํ•˜๋„๋ก ์ž˜ ํ•™์Šตํ•˜๋ฉด, ์ด์ „ ์—ฐ๊ตฌ์—์„œ ๋ฐํ˜€์ง€์ง€ ์•Š์•˜์ง€๋งŒ ์ธ๊ฐ„์˜ ๋Šฅ๋ ฅ๊ณผ ์œ ์‚ฌํ•œ ์ •๋„์˜ shape recognition ๋Šฅ๋ ฅ์ด ์žˆ์—ˆ๋‹ค. (c) ViT๋ฅผ shape ํ‘œํ˜„์„ encodeํ•˜๊ฒŒ ์‚ฌ์šฉํ•˜๋ฉด, pixel-level์˜ supervision ์—†์ด๋„ ์ •ํ™•ํ•œ semantic segmentation์„ ํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. (d) ํ•˜๋‚˜์˜ ViT๋ชจ๋ธ์—์„œ Off-the-shelf ํ”ผ์ณ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ ๋‹ค๋ฅธ ํ”ผ์ณ ์•™์ƒ๋ธ”์„ ๋งŒ๋“œ๋Š”๋ฐ ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ์—ˆ๊ณ , ๋” ๋†’์€ ์ •ํ™•๋„๋ฅผ ๋‹ฌ์„ฑํ–ˆ๋‹ค.
์šฐ๋ฆฌ๋Š” ViT์˜ ์œ ์—ฐํ•˜๊ณ  ๋‹ค์ด๋‚˜๋ฏนํ•œ receptive field๊ฐ€ ViT์˜ ํšจ๊ณผ์ ์ธ feature์ž„์„ ๋ฐํ˜”๋‹ค.

Intriguing Properties of Vision Transformer

Are Vision Transformer Robust to Occlusions

Occlusion Modeling :

์ด๋ฏธ์ง€ x๊ฐ€ ์ฃผ์–ด์ง€๊ณ  label y๊ฐ€ ๋“ค์–ด์˜ค๋ฉด ์ด๋ฏธ์ง€ x๋Š” N๊ฐœ์˜ patch sequence๋กœ ํ‘œํ˜„๋œ๋‹ค. ์šฐ๋ฆฌ๋Š” ์ด N๊ฐœ์ค‘์— M๊ฐœ์˜ ์ด๋ฏธ์ง€ ํŒจ์น˜๋ฅผ ๊ณจ๋ผ์„œ 0์œผ๋กœ ๋ฐ”๊พธ์–ด x’๋ฅผ ๋งŒ๋“œ๋Š” ๋ฐฉ๋ฒ•(๋…ผ๋ฌธ์—์„œ PatchDrop๋กœ ๋ถ€๋ฆ„)์„ ์„ ํƒํ–ˆ๋‹ค. ์ด PatchDrop์„ ์•„๋ž˜ ์„ธ๊ฐœ์˜ ์ข…๋ฅ˜๋กœ ์ ์šฉ์„ ํ–ˆ๋‹ค. image

Robust Performance of Transformer Against Occlusions

  • ํ•™์Šต์€ ImageNet์œผ๋กœ ๋ถ„๋ฅ˜ ๋ฌธ์ œ๋ฅผ ํ’€์—ˆ๊ณ , validation set์˜ ์ •ํ™•๋„๋กœ ํ‰๊ฐ€ํ–ˆ๋‹ค.
  • Information Loss : ์ „์ฒด ํŒจ์น˜์ค‘ ๋“œ๋ž๋œ ํŒจ์น˜์˜ ๋น„์œจ์„ IL๋กœ ์ •์˜ (= M / N)
  • ์•„๋ž˜ ๊ทธ๋ž˜ํ”„๋ฅผ ๋ณด๋ฉด CNN๋ณด๋‹ค ViT๊ฐ€ ํ›จ์”ฌ ๊ฐ•๊ฑดํ•˜๋‹ค. image

ViT Representations are Robust against Information Loss

occlusion์— ๋Œ€ํ•œ ๋ชจ๋ธ์˜ ๋ฐ˜์‘์„ ๋” ์ž˜ ํŒŒ์•…ํ•˜๊ธฐ ์œ„ํ•˜์—ฌ, ๋‹ค๋ฅธ ๋ ˆ์ด์–ด์˜ ๊ฐ ํ—ค๋“œ๋“ค์˜ ์–ดํ…์…˜์„ ์‹œ๊ฐํ™”ํ•ด๋ณด์•˜๋‹ค. ์ดˆ๋ฐ˜์˜ ๋ ˆ์ด์–ด์—์„œ๋Š” ๋ชจ๋“  ์˜์—ญ์„ attendํ•˜์ง€๋งŒ ๊นŠ์–ด์งˆ ์ˆ˜๋ก ์ด๋ฏธ์ง€์—์„œ occlude๋˜์ง€ ์•Š์€ ์˜์—ญ์— ์ง‘์ค‘ํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์—ˆ๋‹ค. image

์œ„์—์„œ ๋งํ•œ ๋ ˆ์ด์–ด๊ฐ€ ๊นŠ์–ด์งˆ ๋•Œ ๋‹ฌ๋ผ์ง€๋Š” ๋ณ€ํ™”์— ๋Œ€ํ•ด token invariance๊ฐ€ ์žˆ๋Š”์ง€ ํ™•์ธ ํ•ด๋ณด๊ณ ์ž ํ•œ๋‹ค. ์šฐ๋ฆฌ๋Š” ์›๋ž˜ ์ด๋ฏธ์ง€์™€ occlude๋œ ์ด๋ฏธ์ง€์— ๋Œ€ํ•ด feature(๋˜๋Š” token)๊ฐ„์˜ correlation coefficient๋ฅผ ๊ณ„์‚ฐํ•˜์˜€๋‹ค. ResNet50์˜ ๊ฒฝ์šฐ์—, logit ๋ ˆ์ด์–ด ์ „์˜ feature๋ฅผ ์‚ฌ์šฉํ–ˆ๊ณ , ViT์˜ ๊ฒฝ์šฐ ๋งˆ์ง€๋ง‰ transformer block์˜ class ํ† ํฐ์„ ๊ฐ€์ ธ์™”๋‹ค. ResNet์— ๋น„ํ•ด ViT์˜ class token์€ ๋” ๊ฐ•๊ฑดํ–ˆ๋‹ค.(=correlation์ด ๋†’์•˜๋‹ค) ์ด๋Ÿฌํ•œ ์„ฑํ–ฅ์€ ๋น„๊ต์  ์ž‘์€ object๋ฅผ ๊ฐ€์ง„ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์…‹์—์„œ๋„ ๋™์ผํ–ˆ๋‹ค. image

Shape vs Texture: Can Transformer Model Both Characteristic?

Does Positional Encoding preserve the global image context?