problem : In the existing VLP, CNN backbone, object detector are required and visual encoder is heavy, which is good for performance but not suitable for real application.
Idea :** Create a unified VLP model without CNN.
architecture :** visual embeddings like ViT and word embeddings like BERT. The embeddings from each encoder are combined with their modal-type embeddings and put into a single transformer encoder and trained with the output as the pretraining task below.
objective : Image Text Matching (replace an image in an image-text pair with another image with 50% probability and learn binary whether the original pair is correct), MLM, whole word masking (masking the original word, not token by token. Masking only the center in gi, ##raf, ##fe allows prediction based on textual information without visual information).
baseline : ViLBERT, UNITER, PixelBERT …
result : time(ms) by 4-60x over the benchmark, while also improving performance by a factor of a
contribution : 1) Improved runtime/efficiency by making it without deep visual encoder 2) Similar performance with simple architecture without region features or deep convolution 3) Showed that word masking, image augmentation improve VLP performance
data : (pretraining) MSCOCO, Visual Genome, SBU captions
, Google Conceptual Captions
(downstream)
VQA v2, NLVR2 (Natural Language for Visual Reasoning, binary classification when given two images and a relationship between them (triplet) and given a question,)
Retrieval MSCOCO, Flickr30k for image-to-text, text-to-image retrieval