[120] Large-scale Bilingual Language-Image Contrastive Learning

TL;DR

I read this because.. : multilingual clip
task : multimodal alignment
problem : I want to learn multilingual clips, but the ones made in translation don’t capture the characteristics of the country’s culture/vocabulary
Idea:** Collect data to learn
input/output : image + text / similiarity score(for clip)
architecture : image encoder(ViT-B/32) and text encoder(transformer)
objective : MSE(for MAE) and infoNCE(for CLIP)
baseline : CLIP, UNITER, Visual N-Gram, ImageBERT
data : collect korean {image-text} pair from web + collect available english {image-text pair}
evaluation : image classification / retrieval
result : higher performance in English than clip
contribution : Korean CLIP. learning related some Finding. result part has diffusion as well .. two authors but multiple analysis bb
etc. :

I want to create a multi-lingual CLIP
The common approach is to just machine translate the text, which doesn’t capture the vocabulary or culture of the country.
learn english-korean bilingual
Dataset suggestions
Suggest a training scheme
Learn with MAE in 1 step
Using the multi crop technique
Some findings
Even without direct bilingual supervision, the embedding space was still adjusted through the image.
I found the strong augmentation used in SimCLR to be intrusive.

Using multi-crop augmentation
- standard resolution 224 x 224 / low resolution 96 x 96
ablation for the above two

implementation : just pytorch without using anything else
text encoder
- GPT-2 style transformer(?) / 63M / 12 layesr / hid dim 512 / 8 heads
  - gpt-2 style transformer
    Layer normalization (Ba et al., 2016) was moved to the input of each sub-block, similar to a pre-activation residual network (He et al., 2016) and an additional layer normalization was added after the final self-attention block. A modified initialization which accounts for the accumulation on the residual path with model depth is used. We scale the weights of residual layers at initialization by a factor of 1/ √N where N is the number of residual layers.
tokenizer : BPE 98K vocab size trained with 2M english/1.5M korean
(↔ CLIP is 49K)
visual encoder
- ViT-B/32
- 256 x 256 ?
Other hparams

zs-classification
Used the benchmark’s english label translated to Korean
- ImageNet / Cifar10 / Cifar100 / CLEVER Counts / Describable Textures Dataset / EuroSAT / FER2013 / Food101 / GTSRB / MNIST / RESIC45 / StanfordCars
- (in-house data) WebKorean
  - 36,826 images ↔ 428 Korean labels
zs-retrieval
- Flickr30k / MSCOCO(english) / MSCOCO(korean)

zero-shot classification
3.3% higher performance than CLIP on average
Korean is CLIP performance chaos
clip is the closest thing to a photo of { }, so we’ll categorize it as a photo of { }, and then use
It’s not that I didn’t have any Korean data, but it was too small, so I used the
zero-shot retrieval
English Korean Both Performance GoodGood

Strong augmentations, such as color distortion, perform better for classification, but not for retrieval, which is a higher-level problem.

Even though I didn’t add contrastive loss directly between the two languages, the images were spatially aligned because they were viewed together.