Multimodal

[198] Kimi k1.5: Scaling Reinforcement Learning with LLMs

multimodal RL reasoning 2025Q1

[141] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

multimodal CLIP 2023Q4

[140] Improved Baselines with Visual Instruction Tuning

multimodal LLM 2023Q3 MLLM

[138] ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

multimodal dataset 2023Q4 MLLM

[137] mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

multimodal LLM 2023Q4 alibaba

[136] Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models

multimodal naver 2021Q3 document emnlp

[135] Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with Text

multimodal dataset NeurIPS 2023Q2

[127] Linearly Mapping from Image to Text Space

multimodal ICLR 2023Q1

[126] ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

multimodal 2021Q1 25min kakao

[121] Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities

multimodal CLIP 2023Q1 retrieval

[119] Visual Instruction Tuning

multimodal NeurIPS 2023Q2

[118] PaLI-X: On Scaling up a Multilingual Vision and Language Model

multimodal google 2023Q2

[117] Multimodal Chain-of-Thought Reasoning in Language Models

multimodal 2023Q1

[115] ImageBind: One Embedding Space To Bind Them All

multimodal 25min 2023Q2 meta

[114] MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks

multimodal google 2023Q1

[113] BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

multimodal 2023Q1 salesforce

[111] Perceiver IO: A General Architecture for Structured Inputs & Outputs

multimodal 2021Q2 ICLR DeepMind MTL

[109] 🦩 Flamingo: a Visual Language Model for Few-Shot Learning

multimodal DeepMind LLM

[97] Contrastive Language-Image Pre-Training with Knowledge Graph

multimodal NeurIPS graph 2022Q4 CLIP

[32] ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

multimodal 2021Q2 naver

[31] GIT: A Generative Image-to-text Transformer for Vision and Language

multimodal microsoft 2022Q2

[30] CoCa: Contrastive Captioners are Image-Text Foundation Models

multimodal backbone google 2022Q2

[29] Grounded Language-Image Pre-training

multimodal 2021Q4 few-shot zero-shot microsoft object detection

[19] Multimodal Explanations: Justifying Decisions and Pointing to the Evidence

multimodal 2018 dataset

[11] DALL-E : Zero-Shot Text-to-Image Generation

multimodal 2021Q1 zero-shot openAI

[10] CLIP: Connecting Text and Images

multimodal 2021Q1 few-shot SSL zero-shot CLIP

[8] SimVLM: Simple Visual Language Model Pretraining with Weak Supervision

multimodal SSL 2021Q2 zero-shot

[7] SLIP: Self-supervision meets Language-Image Pre-training

multimodal 2021Q4 few-shot SSL

[6] Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling

multimodal 2021Q4 backbone multitask