arxiv
Problem : Vision-Language Pretraining (VLP) requires bounding boxes and labels for images, which makes annotation costly and not easy to switch to zero-shot.
Solution : The image was encoded with CoAtNet
and the text-encoded value was prefixed to learn the encoder-decoder structure. The data used was ALIGN (noisy image-text pair data) and C4 (text-only). finetuning performs image captioning, visual reasoning, VQA, and multimodal translation
Result : Good performance on SOTA, zero-shot on various finetuning tasks.
Performance similar to zero-shot, no-finetuning, no-pretraining model on image caption task
Confirmed that it is useful to include a text-only corpus when training the Vison-Lanugage model (enhances the decoder’s ability to generate)
etc :
- There is a separate loss called CIDEr when doing VQA
- VQA is learned by putting the image into the encoder, the text into the decoder, and then appending the FCN to the output of the last token in the decoder
- multimodal translation is the task of changing the language of a description given an image.
- Encoder-decoder structure was better than decoder-only structure
- PrefixLM is a property that sees prefixes as bi-directional and LM afterwards (is this the first time prefixLM has been mentioned in this paper?)