image

paper , code

TL;DR

  • I read this because.. : fine-grained CLIP I tried to read it, saw it was quoted, and read it.
  • task : VLM model -> VQAv2, NLVR2, retrieval
  • problem : Selecting and training features with image -> CNN -> region takes too long.
  • idea : Let’s just put the projection directly into the multi-modal Transformer after patching like ViT and use it.
  • input/output : {image, text} -> matching score, masked token prediction
  • architecture : Transformer encoder
  • objective : Image-Text matching,
  • baseline : pixelBERT, ViLBERT, OSCAR, VisualBERT
  • data : MSCOCO, VG, GCC, SBU
  • evaluation : evaluation for each
  • result : 10x faster runtime reduction and similar or better performance
  • contribution : Minimized complex design for each modality.
  • etc. :

Details

Motivation

image image

Word Patch Alignment

image

What’s going on here? This is almost the same as the previous UNITER: UNIVERSAL Image-TExt Representation Learning , except we used patch instead of region. image

We don’t give direct word-region supervision, but we use an algorithm called [Optimal Transport] (https://en.wikipedia.org/wiki/Transportation_theory_(mathematics) ) to find the cost of minimizing the transport between image embedding and word embedding, and add it as a loss to get better alignment.

image
  • c: distance. using cosine similarity.
  • $T\in \mathbb{R}^{T\times K}$ : transport plan. learned to optimize alignment between $w$ and $v$. learned to optimize alignment between $w$ and $v$. Learned to optimize alignment between $w$ and $v$.

The difficulty here is how to find this minimum distance, so we approximate it with a complicated method called IPOT, which approximates the wasserstein distance. This partial implementation can be found here

image

This result -> Is it because of WPA?