TL;DR
PROBLEM : Creating a good vision backbone. I want to create a model that can be trained from scratch by integrating three types of image pretraining for classification labels, a dual-encoder model that receives image-text pairs and is trained with contrastive loss, and an encoder-decoder model that has an image encoder and a text decoder that receives image features with cross-attention for classification, VQA, etc.
solution : Given an image-text pair, contrastive loss with the last token from the image encoder and the cls-token from the text decoder, stacking a multi-model text decoder with cross-attention with the image input on top of the text decoder, and captioning loss with the last token from the image encoder and the cls-token from the text decoder. Pretrain with the sum of the two losses.
result : SOTA on various tasks in SOTA

Details
Architecture

loss
captioning loss

dual encoder contrastive loss

- Attentional Poolers: When calculating contrast loss, we use only one token from the image, but when performing the encoder-decoder’s capturing task, we use the entire image token sequence. This is because in preliminary experiments, a single pooled image performed better in the visual recognition task, and in the multimodal task, it is advantageous to see more tokens because it is good to refer to region-level features. For this reason, we used task-specific attentional pooling to allow different visual representations for each downstream task. The pooler is a single multi-head attention layer with n learnable queries. (where key and value are encoder output) This allows it to be trained to have different length queries for two different losses. Naturally, this learnable query also acts as a task adaptor.