[114] MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks

paper

TL;DR

I read this because.. : multi-modal series.
task : VLM -> captioning, image2text retreival, text2image retreival, VQAv2, Video QA, Video Captioning, Open-Vocab Detection
Problem :** Contrastive learning has the ability to retrieve and captioning learning has the ability to generate text. However, the two learning approaches are conflicting and difficult to combine.
idea : decoder only for language model! do forward twice with different masking because contrastive and captioning need different things
input/output : (pretraining) image / text -> similiarity score / caption text
architecture : Image Encoder(ViT Huge, 650M) + Lanauge Decoder(1B Transformer)
objective : Image Captioning Loss + Focal Contrastive Loss
baseline : CoCa, Florence, CLIP, ALIGN, …
data : (pretaining) only ALIGN -> (finetune) MSCOCO, Flickr30K, VQAv2, …
result : In zero-shot image-text retreival, sota. VQAv2 also performs better in terms of parameter size.
contribution : fully e2e training without separate image vision encoder pretraining. CoCa와 비슷한데 architecture가 비교적 간단한데. video도 vision encoder 여러번 forward 안 않아도 되는게 강점이라고 하는데 이건 .. TubeViT의 contribution인 듯?…
etc. : CoCa also used JFT, so why did it beat the retrieval performance? Is it because CoCa has more tasks, so the retrieval performance is somewhat lower? Or the learning method? It would be better to compare the learning with CoCa feeling.

Details

Video Processing -> TubeViT
- https://arxiv.org/pdf/2212.03229.pdf
As if it’s a thesis that we can treat images/videos the same way because we’ve sampled them well.
To fill in the gap between PE from object detection and PE from pretraining -> Cropped PE / Focal Loss for Constrative loss
I saw it in Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers, but I can’t find the paper.

Architecture

The main contribution, two-pass learning Captioning requires causal masking (requiring a conditioned representation), while contrastive requires a full-text representation. Just use a different masking with decoder and forward twice! (masking / CA or just this is not an encoder lol) ~CoCa also expressed decoder with text encoder like text encoder, but here it is almost similar to CoCa, so it seems like Unimodal Text Decoder + MultiModal Text Decoder have the same weight! CoCa added self-attention with causal masking to the text decoder! He only needs to do one forward.

Loss

Captioning loss
Focal Constrative Loss Contrastive learning usually requires large BS. Learn from more challenging data than CE -> use focal loss

Video Processing

Result

Ablations

With captioning loss, text2image performs better and image2text performs worse. Generation seems to create a better text representation -> not sure about this… Isn’t it a two-way street?!
CA was good for frequent VQA and not so much for revalidation.

Small is good.

TL;DR#

Details#

Related works#

Architecture#

Loss#

Video Processing#

Result#

Ablations#