image

paper

TL;DR

  • I read this because.. : aka. CLOSE. I read this because.. :** aka. CLOSE. ICCV explorer PPT to find Kakao paper, but it’s very similar to CapDec, so I read it thinking, what’s the difference?
  • task : zero-shot cross modal transfer (taking what you’ve learned in one modality and transferring it to another)
  • Problem :** Text and images have different embedding spaces, even when trained as contrastive! For example, for the COCO caption, the similarity of a positive {image, text} pair is 0.26, while the similarity between unrelated captions is 0.35.
  • IDEA : Let’s add gaussian noise to the text embedding space!
  • input/output : (train) text -> text (infer) image, text -> text
  • architecture : CLIP ViT-L/14 + T5 base
  • objective : cross entropy loss
  • baseline : ESPER, CLIP Cls, TAP-C (zero-shot multimodal transfer models)
  • data : COCO Captioning, SNLI (->SNLI-VE), VQA (->VQA-E), Visual News, synthetic captions with GPT-J RNG, GPT-J unigram, CURIE
  • evaluation : For each benchmark, create an
  • result : Among the existing multimodal models trained with text only, sota.
  • contribution : A simple idea that makes something work.
  • etc. : In conclusion, it’s very similar to CapDec lol I’m in their related work, so I guess they put some analysis. Is learning this way in VLM more scalable, or is this just the same as LLaVA’s approach?

Details

  • pipeline image

Text embeddings are also from CLIP: things like context in VQA and premise in SNLI use T5 embeddings. It’s a little vague on how it’s put into vectors, but it seems like if the embedding from CLIP is 2048 and the embedding that T5 receives is 512, it cuts the 2048 embedding and replaces it with four 512 vectors.

image

Freeze CLIP’s image/text encoder and only finetune T5

  • modality adaptor image

In conclusion, we scale with Gaussian noise + training hyperparameter w.

image
  • sensitivitiy image

Adding some noise to the text vector was insensitive, and shifting it slightly in the direction of the image (mean) improved performance for VE, while shifting it in the opposite direction (-mean) hurt performance.

  • learned adpater analysis image

Learning that zero gaussian is not the best, so there is a better adaptor. Instead, it can’t learn text-only, so it can’t go into the main model. linear is a way to train a linear map and cov. is a way to add structured noise as the covariance of the trainable text and image

  • training data with language model image

You can learn by using GPT-J, etc. to generate captions with words that appear frequently in coco.

  • Can also do style captions image