[198] Kimi k1.5: Scaling Reinforcement Learning with LLMs

January 23, 2025 ยท 4 min ยท long8v ยท 

[141] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

December 15, 2023 ยท 3 min ยท long8v ยท 

[140] Improved Baselines with Visual Instruction Tuning

December 12, 2023 ยท 3 min ยท long8v ยท 

[138] ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

December 8, 2023 ยท 2 min ยท long8v ยท 

[137] mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

December 5, 2023 ยท 3 min ยท long8v ยท 

[136] Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models

November 28, 2023 ยท 3 min ยท long8v ยท 

[135] Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with Text

November 23, 2023 ยท 3 min ยท long8v ยท 

[127] Linearly Mapping from Image to Text Space

August 17, 2023 ยท 2 min ยท long8v ยท 

[126] ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

August 9, 2023 ยท 2 min ยท long8v ยท 

[121] Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities

June 23, 2023 ยท 3 min ยท long8v ยท 

[118] PaLI-X: On Scaling up a Multilingual Vision and Language Model

June 8, 2023 ยท 4 min ยท long8v ยท 

[117] Multimodal Chain-of-Thought Reasoning in Language Models

June 7, 2023 ยท 2 min ยท long8v ยท 

[115] ImageBind: One Embedding Space To Bind Them All

May 16, 2023 ยท 2 min ยท long8v ยท 

[114] MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks

May 9, 2023 ยท 3 min ยท long8v ยท 

[113] BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

April 27, 2023 ยท 3 min ยท long8v ยท 

[111] Perceiver IO: A General Architecture for Structured Inputs & Outputs

April 24, 2023 ยท 2 min ยท long8v ยท 

[109] ๐Ÿฆฉ Flamingo: a Visual Language Model for Few-Shot Learning

April 10, 2023 ยท 4 min ยท long8v ยท 

[32] ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

June 28, 2022 ยท 2 min ยท long8v ยท 

[31] GIT: A Generative Image-to-text Transformer for Vision and Language

June 26, 2022 ยท 2 min ยท long8v ยท 

[30] CoCa: Contrastive Captioners are Image-Text Foundation Models

June 22, 2022 ยท 2 min ยท long8v ยท 

[19] Multimodal Explanations: Justifying Decisions and Pointing to the Evidence

April 6, 2022 ยท 1 min ยท long8v ยท 

[8] SimVLM: Simple Visual Language Model Pretraining with Weak Supervision

January 24, 2022 ยท 1 min ยท long8v ยท 

[7] SLIP: Self-supervision meets Language-Image Pre-training

January 20, 2022 ยท 1 min ยท long8v ยท 

[6] Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling

January 18, 2022 ยท 1 min ยท long8v ยท