[138] ShareGPT4V: Improving Large Multi-Modal Models with Better Captionsmultimodal dataset 2023Q4 MLLM
[137] mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaborationmultimodal LLM 2023Q4 alibaba
[136] Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Modelsmultimodal naver 2021Q3 document emnlp
[135] Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with Textmultimodal dataset NeurIPS 2023Q2
[126] ViLT: Vision-and-Language Transformer Without Convolution or Region Supervisionmultimodal 2021Q1 25min kakao
[121] Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entitiesmultimodal CLIP 2023Q1 retrieval
[113] BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Modelsmultimodal 2023Q1 salesforce
[111] Perceiver IO: A General Architecture for Structured Inputs & Outputsmultimodal 2021Q2 ICLR DeepMind MTL
[97] Contrastive Language-Image Pre-Training with Knowledge Graphmultimodal NeurIPS graph 2022Q4 CLIP
[32] ViLT: Vision-and-Language Transformer Without Convolution or Region Supervisionmultimodal 2021Q2 naver
[29] Grounded Language-Image Pre-trainingmultimodal 2021Q4 few-shot zero-shot microsoft object detection
[19] Multimodal Explanations: Justifying Decisions and Pointing to the Evidencemultimodal 2018 dataset
[8] SimVLM: Simple Visual Language Model Pretraining with Weak Supervisionmultimodal SSL 2021Q2 zero-shot
[6] Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modelingmultimodal 2021Q4 backbone multitask