TL;DR
- I read this because.. : fine-grained CLIP I tried to read it, saw it was quoted, and read it.
- task : VLM model -> VQAv2, NLVR2, retrieval
- problem : Selecting and training features with image -> CNN -> region takes too long.
- idea : Let’s just put the projection directly into the multi-modal Transformer after patching like ViT and use it.
- input/output : {image, text} -> matching score, masked token prediction
- architecture : Transformer encoder
- objective : Image-Text matching,
- baseline : pixelBERT, ViLBERT, OSCAR, VisualBERT
- data : MSCOCO, VG, GCC, SBU
- evaluation : evaluation for each
- result : 10x faster runtime reduction and similar or better performance
- contribution : Minimized complex design for each modality.
- etc. :
Details
Motivation
Word Patch Alignment
What’s going on here?
This is almost the same as the previous UNITER: UNIVERSAL Image-TExt Representation Learning
, except we used patch instead of region.
We don’t give direct word-region supervision, but we use an algorithm called [Optimal Transport] (https://en.wikipedia.org/wiki/Transportation_theory_(mathematics) ) to find the cost of minimizing the transport between image embedding and word embedding, and add it as a loss to get better alignment.
- c: distance. using cosine similarity.
- $T\in \mathbb{R}^{T\times K}$ : transport plan. learned to optimize alignment between $w$ and $v$. learned to optimize alignment between $w$ and $v$. Learned to optimize alignment between $w$ and $v$.
The difficulty here is how to find this minimum distance, so we approximate it with a complicated method called IPOT, which approximates the wasserstein distance. This partial implementation can be found here
This result -> Is it because of WPA?