image

paper

TL;DR

  • task : self-supervised learning -> image classification, object detection, image segmentation
  • problem : SSL์—์„œ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ๋Š” Masked Image Modling(MIM)์—์„œ mask๋˜๋Š” ํ† ํฐ์„ ์„ ํƒํ•˜๋Š” ์ „๋žต์— ๋Œ€ํ•ด ์ƒ๊ฐํ•ด๋ณด์ž
  • idea : ViT์— ๋„ฃ์—ˆ์„ ๋•Œ attention score๊ฐ€ ๋†’๊ฒŒ ๊ฑธ๋ฆฌ๋Š” ๊ฑธ ๋งˆ์Šคํ‚นํ•˜์ž!
  • architecture : teacher ViT๊ฐ€ ๋ชจ๋“  input tokens๋ฅผ ๋ฐ›๊ณ  attention score๊ฐ€ ๋†’์€๊ฑธ masking. student๋Š” MIM ํƒœ์Šคํฌ๋ฅผ ํ’ˆ. teacher์˜ weight๋Š” student์˜ weight์˜ exponential moving average(EMA)๋กœ ์—…๋ฐ์ดํŠธ ๋จ. ์•„ํ‚คํ…์ณ๋Š” ViT-S/16
  • objective : MIM loss(=reconstruction loss), distillation loss(student๊ณผ teacher์˜ [CLS] ํ† ํฐ์— ๋Œ€ํ•œ output ์ฐจ์ด)
  • baseline : iBOT, DINO, MST
  • data : ImageNet-1k for pretraining, CIFAR-10, CIFAR-100, Oxford Flower, COCO, ADE20K
  • result : random masking๋ณด๋‹ค ๋†’์€ ์„ฑ๋Šฅ
  • contribution : MIM์—์„œ masking strategy ํƒ์ƒ‰
  • limitation or ์ดํ•ด ์•ˆ๋˜๋Š” ๋ถ€๋ถ„ :

Details

image image