image paper

Problem : ViT๋Š” ์ž‘์€ ๋ฐ์ดํ„ฐ์…‹์—์„œ์˜ ์„ฑ๋Šฅ์ด CNN ๋ณด๋‹ค ๋–จ์–ด์ง. local-feature๋ฅผ ์ž˜ ๋ชป์žก๊ณ , attention ๊ตฌ์กฐ๊ฐ€ ๋น„์ „์„ ์œ„ํ•ด ์„ค๊ณ„๋˜์ง€ ์•Š์Œ Solution : Transformer์— ๋„ฃ๋Š” input์„ ๋‹จ์ˆœ ํ† ํฐ์ด ์•„๋‹ˆ๋ผ, T2T module์˜ ๊ฒฐ๊ณผ๋ฅผ ์‚ฌ์šฉํ•จ. T2T ๋ชจ๋“ˆ๋Š” n x n ์œผ๋กœ ์ž๋ฅธ ์ด๋ฏธ์ง€๋ฅผ Transformer์— ๋„ฃ๊ณ  ๊ทธ token out๋“ค์„ ๋‹ค์‹œ ์ด๋ฏธ์ง€์ฒ˜๋Ÿผ w, h๊ฐ€ ์žˆ๋„๋ก ๊ตฌ์กฐํ™” ์‹œํ‚ด. ์ดํ›„ ์ธ์ ‘ํ•œ ํ† ํฐ๋ผ๋ฆฌ ํ•œ ํŒจ์น˜๋กœ ๋งŒ๋“ค์–ด ๊ฐ ํŒจ์น˜๋ฅผ concatํ•œ ๋’ค ๋‹ค์Œ T2T ๋ชจ๋“ˆ๋กœ ๋„˜๊น€. ์ด๋ ‡๊ฒŒ n๋ฒˆ์„ ๋ฐ˜๋ณตํ•˜์—ฌ ๋‚˜์˜จ ๊ฒƒ์„ ํšจ์œจํ™”๋œ ํŠธ๋žœ์Šคํฌ๋จธ backbone์— ํ•™์Šต. Result : ์œ ์‚ฌํ•˜๊ฑฐ๋‚˜ ๋” ํฐ ๊ทœ๋ชจ์˜ CNN์ด๋‚˜ ViT๋ณด๋‹ค ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜์—์„œ ์„ฑ๋Šฅ ์šฐ์œ„. ๋А๋‚€์  : ViT๊ฐ€ ์ฒ˜์Œ์—” inductive bias๊ฐ€ ์—†๋‹ค๋ฉด์„œ ๋‚˜์™”๋Š”๋ฐ, ๊ฒฐ๊ตญ CNN์˜ ๊ตฌ์กฐ๋“ค์„ ์ฐจ์šฉํ•œ ๋ชจ๋ธ๋“ค์ด ๋” ๋น ๋ฅด๊ฒŒ ํ•™์Šตํ•˜๋Š”๊ฒƒ์„ ๋ณด๋‹ˆ inductive bias๋Š” ๋น ๋ฅด๊ฒŒ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด ํ•„์š”ํ•˜๋‹ค.(ํฐ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด์„  ํ•™์Šต ๊ณผ์ •์—์„œ ๊ทธ๋Ÿฐ inductive bias๋ฅผ ์•Œ์•„์„œ ํ•™์Šตํ•˜๋‹ˆ ๋” ๋‚ซ๋‚˜๋ณด๋‹ค) ๊ทธ๋Ÿผ์—๋„ CNN๋ณด๋‹ค transformer๊ฐ€ ๋‚˜์€ ์ด์œ ๋Š” ๊ฒฐ๊ตญ ๋ณ‘๋ ฌํ™”..? ํ˜น์€ multi-modal ๊ฐ€๋Šฅ..? details : paper summary