[126] ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

paper , code

TL;DR

I read this because.. : fine-grained CLIP 읽으려다가 인용 되어있는거 보고 읽음.
task : VLM model -> VQAv2, NLVR2, retrieval
problem : image -> CNN -> region 으로 Feature 뽑고 학습하면 너무 오래 걸린다.
idea : 그냥 ViT처럼 patch 후에 projection 한걸 바로 multi-modal Transformer에 넣어서 사용하자
input/output : {image, text} -> matching score, masked token prediction
architecture : Transformer encoder
objective : Image-Text matching,
baseline : pixelBERT, ViLBERT, OSCAR, VisualBERT
data : MSCOCO, VG, GCC, SBU
evaluation : 각각에 맞는 evaluation
result : 10배 빠르게 runtime을 줄이고 성능이 비슷하거나 더 나음
contribution : 각 modality에 대한 복잡한 디자인 최소화
etc. :

Details

Motivation

Word Patch Alignment

여기서 무슨 일이 일어날까? 이전 UNITER: UNiversal Image-TExt Representation Learning 에서 거의 비슷. region 대신 patch로 했다는게 다른 점.

직접적인 word-region에 대한 supervision을 주는건 아니고 Optimal Transport 라는 알고리즘으로 image embedding과 word embedding 사이의 transport를 최소화하는 cost를 구해서 이걸 loss로 추가해서 alignment가 더 잘되도록

c: distance. cosine 유사도 사용.
$T\in \mathbb{R}^{T\times K}$ : transport plan. learned to optimize alignment between $w$ and $v$. 학습되는 건가보넹..

여기서 이 최소 거리를 구하는 방법이 어려워서 IPOT이라는 wasserstein distance를 approximate하는 복잡한 방법으로 근사. 이 부분 구현은 여기

이런 결과 ->WPA 때문인가?

TL;DR#

Details#

Motivation#

Word Patch Alignment#

TL;DR

Details

Motivation

Word Patch Alignment