image

paper

TL;DR

  • I read this because.. : multilingual clip
  • task : multimodal alignment
  • problem : multilingual clip์„ ํ•™์Šตํ•˜๊ณ  ์‹ถ๋‹ค. ๋ฒˆ์—ญ์œผ๋กœ ๋งŒ๋“  ๊ฑด ๊ทธ ๋‚˜๋ผ์˜ ๋ฌธํ™” / ์–ดํœ˜์˜ ํŠน์„ฑ์„ ์žก์ง€ ๋ชปํ•œ๋‹ค
  • idea : ๋ฐ์ดํ„ฐ ๋ชจ์•„์„œ ํ•™์Šต
  • input/output : image + text / similiarity score(for clip)
  • architecture : image encoder(ViT-B/32) and text encoder(transformer)
  • objective : MSE(for MAE) and infoNCE(for CLIP)
  • baseline : CLIP, UNITER, Visual N-Gram, ImageBERT
  • data : web์—์„œ korean {image-text} pair ์ˆ˜์ง‘ + ๊ฐ€์šฉํ•œ english {image-text pair} ์ˆ˜์ง‘
  • evaluation : image classification / retrieval
  • result : clip ๋ณด๋‹ค ์˜์–ด์—์„œ๋„ ๋” ๋†’์€ ์„ฑ๋Šฅ
  • contribution : ํ•œ๊ตญ์–ด CLIP. ํ•™์Šต ๊ด€๋ จ ๋ช‡๊ฐ€์ง€ Finding. result ๋ถ€๋ถ„์— diffusion๋„ ํ•˜์‹œ๊ณ  .. ์ €์ž๊ฐ€ ๋‘๋ช…์ธ๋ฐ ์—ฌ๋Ÿฌ ๋ถ„์„ bb
  • etc. :

Details

motivation

  • multi-lingual CLIP์„ ๋งŒ๋“ค๊ณ  ์‹ถ์Œ
  • ์ฃผ๋กœ ํ•˜๋Š” approach๋Š” ๊ทธ๋ƒฅ text๋ฅผ machine translation ๋Œ๋ ค์„œ ํ•˜๋Š”๋ฐ ์ด๊ฑด ๊ทธ ๋‚˜๋ผ๋งŒ์˜ ์–ดํœ˜๋‚˜ ๋ฌธํ™”๋ฅผ ๋‹ด์„ ์ˆ˜ ์—†๋‹ค
  • english-korean bilingual ํ•™์Šต
    • ๋ฐ์ดํ„ฐ์…‹ ์ œ์•ˆ
    • training scheme ์ œ์•ˆ
      • MAE๋กœ 1๋‹จ๊ณ„๋กœ ํ•™์Šต
      • multi crop๊ธฐ๋ฒ• ์‚ฌ์šฉ
    • ๋ช‡๊ฐ€์ง€ finding
      • ์ง์ ‘ bilingual supervision์„ ์•ˆ์ฃผ๋”๋ผ๋„ image๋ฅผ ํ†ตํ•ด์„œ embedding space๊ฐ€ ๋งž์ถฐ์ง€๋”๋ผ
      • SimCLR์—์„œ ์‚ฌ์šฉ๋˜๋Š” strong augmentation ์˜คํžˆ๋ ค ๋ฐฉํ•ด๋˜๋”๋ผ

training scheme

  • CLIP๊ณผ ๋‹ค๋ฅธ ๋‘ ๊ฐ€์ง€
  • ๋ฐ”๋กœ contrastive๋กœ ํ•™์Šตํ•˜๋Š”๊ฒŒ ์•„๋‹ˆ๋ผ MAE๋กœ ๋จผ์ € vision encoder๋ฅผ ํ•™์Šต
image
  • multi-crop augmentation ์‚ฌ์šฉ
    • standard resolution 224 x 224 / low resolution 96 x 96
  • ์œ„ ๋‘๊ฐ€์ง€์— ๋Œ€ํ•œ ablation
image

dataset

  • english {image-text} pair
    • CUB200
    • 37.4M์˜ WIT (108 languages)
    • YFCC15M (clip์ด 100M์—์„œ filteringํ•œ)
    • CC3M
    • CC12M
    • LAION400M
    • LAION์ด ๋งŒ๋“  ๋ฐฉ์‹์„ ๋”ฐ๋ผ cc web dump์—์„œ 70M์„ ์ถ”๊ฐ€์ ์œผ๋กœ ๋งŒ๋“ฆ
  • korea {image - text} pair : 708M ๊ทœ๋ชจ
    • ๊ทธ๋ƒฅ ํฌ๋กค๋ง ํ–ˆ๋‹ค๊ณ  ์จ์ ธ์žˆ๋„น
    • 50M์˜ ์—ฐ์˜ˆ์ธ ์–ผ๊ตด๊ณผ ์ด๋ฆ„ ํฌํ•จ
    • korea wikipedia ํฌํ•จ
    • LAION400M์ด๋‚˜ CLIP์˜ WIT 400M๋ณด๋‹ค ํ›จ์”ฌ ํผ
  • ์ด ํ•ฉ์ณ์„œ โ‰ฅ 1B ์ •๋„ ๋ฐ์ดํ„ฐ์…‹์ด ๋ ๋“ฏ

training detail

  • implementation : ๋‹ค๋ฅธ๊ฑฐ ์•ˆ์“ฐ๊ณ  pytorch๋กœ๋งŒ

  • text encoder

    • GPT-2 style transformer(?) / 63M / 12 layesr / hid dim 512 / 8 heads

      • gpt-2 style transformer

        Layer normalization (Ba et al., 2016) was moved to the input of each sub-block, similar to a pre-activation residual network (He et al., 2016) and an additional layer normalization was added after the final self-attention block. A modified initialization which accounts for the accumulation on the residual path with model depth is used. We scale the weights of residual layers at initialization by a factor of 1/ โˆšN where N is the number of residual layers.

    • tokenizer : 2M์˜ english / 1.5M korean์œผ๋กœ ํ•™์Šตํ•œ BPE 98K vocab size

      • (โ†” CLIP์€ 49K)
  • visual encoder

    • ViT-B/32
    • 256 x 256 ?
  • ๊ธฐํƒ€ hparams

image
  • training
    • half precision
    • 80๊ฐœ์˜ A100 โ†’ MAE ํ•™์Šตํ•˜๋Š”๋ฐ 16์‹œ๊ฐ„ / multimodal trainingํ•˜๋Š”๋ฐ 362์‹œ๊ฐ„ (15์ผ?)

Benchmark Dataset

  • zs-classification
    • benchmark์˜ english label์„ ํ•œ๊ตญ์–ด๋กœ ๋ฒˆ์—ญํ•ด์„œ ์‚ฌ์šฉํ–ˆ์Œ
    • ImageNet / Cifar10 / Cifar100 / CLEVER Counts / Describable Textures Dataset / EuroSAT / FER2013 / Food101 / GTSRB / MNIST / RESIC45 / StanfordCars
    • (in-house data) WebKorean
      • 36,826 images โ†” 428 Korean labels
  • zs-retrieval
    • Flickr30k / MSCOCO(english) / MSCOCO(korean)

result

image
  • zero-shot classification
    • CLIP๋ณด๋‹ค ํ‰๊ท  3.3% ๋†’์€ ์„ฑ๋Šฅ
    • ํ•œ๊ตญ์–ด๋Š” CLIP ์„ฑ๋Šฅ ์ฒ˜์ฐธ
      • clip์ด a photo of { }๋ž‘ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๊ฑธ๋กœ ๋ถ„๋ฅ˜ํ•ด์„œ ๊ทธ๋Ÿผ
      • ํ•œ๊ตญ์–ด ๋ฐ์ดํ„ฐ๊ฐ€ ์•„์˜ˆ ์—†์—ˆ๋˜ ๊ฑด ์•„๋‹Œ๋ฐ ๋„ˆ๋ฌด ์ ์–ด์„œ
  • zero-shot retrieval
    • ์˜์–ด ํ•œ๊ตญ์–ด ๋‘˜๋‹ค ์„ฑ๋Šฅ ๊ตฟ๊ตฟ
image

findings

  • color ์™œ๊ณก ๋“ฑ strong augmentation์ด classification ๋“ฑ์€ ๋” ์„ฑ๋Šฅ์„ ๋†’์ด์ง€๋งŒ ๋” ๋†’์€ ์ฐจ์›์˜ ๋ฌธ์ œ์ธ retrieval์€ ๋” ๋ชปํ•˜๋”๋ผ
image
  • ๋‘ ์–ธ์–ด๊ฐ„ ์ง์ ‘ contrastive loss๋ฅผ ๋„ฃ์ง€ ์•Š์•˜๋Š”๋ฐ๋„ image๋ฅผ ๊ฐ™์ด ๋ณด๊ณ  ์žˆ์–ด์„œ ๊ทธ๋Ÿฐ์ง€ ๊ณต๊ฐ„์ด ๋งž์ถฐ์ง€๋”๋ผ
image
  • diffusion์„ ๋ถ™์—ฌ์„œ ํ•ด๋ดค๋Š”๋ฐ ํ™•์‹คํžˆ ํ•œ๊ตญ์–ด๊ฐ€ ๋‹ค๋ฅธ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์ด๋”๋ผ?
    • ์ด๊ฑด ์œ„์˜ ๊ฒฐ๊ณผ๋ž‘ ์ข€ ๋‹ค๋ฅธ๊ฑฐ ์•„๋‹Œ๊ฐ€ ใ…‹ใ…‹ similiarity๊ฐ€ 1.0์€ ์•„๋‹ˆ๋‹ˆ๊นŒ ๊ทธ๋Ÿฐ๊ฑด๊ฐ€
image