[126] ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

paper , code

TL;DR

I read this because.. : fine-grained CLIP I tried to read it, saw it was quoted, and read it.
task : VLM model -> VQAv2, NLVR2, retrieval
problem : Selecting and training features with image -> CNN -> region takes too long.
idea : Let’s just put the projection directly into the multi-modal Transformer after patching like ViT and use it.
input/output : {image, text} -> matching score, masked token prediction
architecture : Transformer encoder
objective : Image-Text matching,
baseline : pixelBERT, ViLBERT, OSCAR, VisualBERT
data : MSCOCO, VG, GCC, SBU
evaluation : evaluation for each
result : 10x faster runtime reduction and similar or better performance
contribution : Minimized complex design for each modality.
etc. :

Details

Motivation

Word Patch Alignment

What’s going on here? This is almost the same as the previous UNITER: UNIVERSAL Image-TExt Representation Learning , except we used patch instead of region.

We don’t give direct word-region supervision, but we use an algorithm called [Optimal Transport] (https://en.wikipedia.org/wiki/Transportation_theory_(mathematics) ) to find the cost of minimizing the transport between image embedding and word embedding, and add it as a loss to get better alignment.

c: distance. using cosine similarity.
$T\in \mathbb{R}^{T\times K}$ : transport plan. learned to optimize alignment between $w$ and $v$. learned to optimize alignment between $w$ and $v$. Learned to optimize alignment between $w$ and $v$.

The difficulty here is how to find this minimum distance, so we approximate it with a complicated method called IPOT, which approximates the wasserstein distance. This partial implementation can be found here

This result -> Is it because of WPA?

TL;DR#

Details#

Motivation#

Word Patch Alignment#

TL;DR

Details

Motivation

Word Patch Alignment