image

paper

TL;DR

  • I read this because.. : multilingual clip
  • task : multimodal alignment
  • problem : I want to learn multilingual clips, but the ones made in translation don’t capture the characteristics of the country’s culture/vocabulary
  • Idea:** Collect data to learn
  • input/output : image + text / similiarity score(for clip)
  • architecture : image encoder(ViT-B/32) and text encoder(transformer)
  • objective : MSE(for MAE) and infoNCE(for CLIP)
  • baseline : CLIP, UNITER, Visual N-Gram, ImageBERT
  • data : collect korean {image-text} pair from web + collect available english {image-text pair}
  • evaluation : image classification / retrieval
  • result : higher performance in English than clip
  • contribution : Korean CLIP. learning related some Finding. result part has diffusion as well .. two authors but multiple analysis bb
  • etc. :

Details

motivation

  • I want to create a multi-lingual CLIP
  • The common approach is to just machine translate the text, which doesn’t capture the vocabulary or culture of the country.
  • learn english-korean bilingual
  • Dataset suggestions
  • Suggest a training scheme
  • Learn with MAE in 1 step
  • Using the multi crop technique
  • Some findings
  • Even without direct bilingual supervision, the embedding space was still adjusted through the image.
  • I found the strong augmentation used in SimCLR to be intrusive.

training scheme

  • CLIP and the other two
  • Learn the vision encoder first with MAE instead of directly with contrastive
image
  • Using multi-crop augmentation
    • standard resolution 224 x 224 / low resolution 96 x 96
  • ablation for the above two
image

dataset

  • english {image-text} pair
    • CUB200
  • 37.4M WITs (108 languages)
  • YFCC15M (clip filtered at 100M)
    • CC3M
    • CC12M
    • LAION400M
  • Create another 70M from the cc web dump, following the way LAION created it
  • korea {image - text} pair : 708M scale
  • It just says crawled.
  • Includes 50M celebrity faces and names
  • korea wikipedia inclusion
  • Much larger than LAION400M or CLIP’s WIT 400M
  • as if the dataset would be ≥ 1B in total.

training detail

  • implementation : just pytorch without using anything else

  • text encoder

    • GPT-2 style transformer(?) / 63M / 12 layesr / hid dim 512 / 8 heads
      • gpt-2 style transformer

        Layer normalization (Ba et al., 2016) was moved to the input of each sub-block, similar to a pre-activation residual network (He et al., 2016) and an additional layer normalization was added after the final self-attention block. A modified initialization which accounts for the accumulation on the residual path with model depth is used. We scale the weights of residual layers at initialization by a factor of 1/ √N where N is the number of residual layers.

  • tokenizer : BPE 98K vocab size trained with 2M english/1.5M korean

  • (↔ CLIP is 49K)

  • visual encoder

    • ViT-B/32
    • 256 x 256 ?
  • Other hparams

image
  • training
    • half precision
  • 16 hours to learn 80 A100 → MAE / 362 hours for multimodal training (15 days?)

Benchmark Dataset

  • zs-classification
  • Used the benchmark’s english label translated to Korean
    • ImageNet / Cifar10 / Cifar100 / CLEVER Counts / Describable Textures Dataset / EuroSAT / FER2013 / Food101 / GTSRB / MNIST / RESIC45 / StanfordCars
    • (in-house data) WebKorean
      • 36,826 images ↔ 428 Korean labels
  • zs-retrieval
    • Flickr30k / MSCOCO(english) / MSCOCO(korean)

result

image
  • zero-shot classification
  • 3.3% higher performance than CLIP on average
  • Korean is CLIP performance chaos
  • clip is the closest thing to a photo of { }, so we’ll categorize it as a photo of { }, and then use
  • It’s not that I didn’t have any Korean data, but it was too small, so I used the
  • zero-shot retrieval
  • English Korean Both Performance GoodGood
image

findings

  • Strong augmentations, such as color distortion, perform better for classification, but not for retrieval, which is a higher-level problem.
image
  • Even though I didn’t add contrastive loss directly between the two languages, the images were spatially aligned because they were viewed together.
image
  • I tried it with diffusion, but it’s definitely different for Korean?
  • This is a little different from the above result, isn’t it?
image