[8] SimVLM: Simple Visual Language Model Pretraining with Weak Supervision

arxiv Problem : Vision-Language Pretraining (VLP) requires bounding boxes and labels for images, which makes annotation costly and not easy to switch to zero-shot. Solution : The image was encoded with CoAtNet and the text-encoded value was prefixed to learn the encoder-decoder structure. The data used was ALIGN (noisy image-text pair data) and C4 (text-only). finetuning performs image captioning, visual reasoning, VQA, and multimodal translation Result : Good performance on SOTA, zero-shot on various finetuning tasks. Performance similar to zero-shot, no-finetuning, no-pretraining model on image caption task Confirmed that it is useful to include a text-only corpus when training the Vison-Lanugage model (enhances the decoder’s ability to generate)

etc :

There is a separate loss called CIDEr when doing VQA
VQA is learned by putting the image into the encoder, the text into the decoder, and then appending the FCN to the output of the last token in the decoder
multimodal translation is the task of changing the language of a description given an image.
Encoder-decoder structure was better than decoder-only structure
PrefixLM is a property that sees prefixes as bi-directional and LM afterwards (is this the first time prefixLM has been mentioned in this paper?)