[148] I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision

TL;DR

I read this because.. : aka. CLOSE. I read this because.. :** aka. CLOSE. ICCV explorer PPT to find Kakao paper, but it’s very similar to CapDec, so I read it thinking, what’s the difference?
task : zero-shot cross modal transfer (taking what you’ve learned in one modality and transferring it to another)
Problem :** Text and images have different embedding spaces, even when trained as contrastive! For example, for the COCO caption, the similarity of a positive {image, text} pair is 0.26, while the similarity between unrelated captions is 0.35.
IDEA : Let’s add gaussian noise to the text embedding space!
input/output : (train) text -> text (infer) image, text -> text
architecture : CLIP ViT-L/14 + T5 base
objective : cross entropy loss
baseline : ESPER, CLIP Cls, TAP-C (zero-shot multimodal transfer models)
data : COCO Captioning, SNLI (->SNLI-VE), VQA (->VQA-E), Visual News, synthetic captions with GPT-J RNG, GPT-J unigram, CURIE
evaluation : For each benchmark, create an
result : Among the existing multimodal models trained with text only, sota.
contribution : A simple idea that makes something work.
etc. : In conclusion, it’s very similar to CapDec lol I’m in their related work, so I guess they put some analysis. Is learning this way in VLM more scalable, or is this just the same as LLaVA’s approach?

Details

pipeline

Text embeddings are also from CLIP: things like context in VQA and premise in SNLI use T5 embeddings. It’s a little vague on how it’s put into vectors, but it seems like if the embedding from CLIP is 2048 and the embedding that T5 receives is 512, it cuts the 2048 embedding and replaces it with four 512 vectors.

Freeze CLIP’s image/text encoder and only finetune T5

modality adaptor

In conclusion, we scale with Gaussian noise + training hyperparameter w.

sensitivitiy

Adding some noise to the text vector was insensitive, and shifting it slightly in the direction of the image (mean) improved performance for VE, while shifting it in the opposite direction (-mean) hurt performance.

learned adpater analysis

Learning that zero gaussian is not the best, so there is a better adaptor. Instead, it can’t learn text-only, so it can’t go into the main model. linear is a way to train a linear map and cov. is a way to add structured noise as the covariance of the trainable text and image

training data with language model

You can learn by using GPT-J, etc. to generate captions with words that appear frequently in coco.

Can also do style captions

TL;DR#

Details#

TL;DR

Details