[152] Sigmoid Loss for Language Image Pre-Training

March 12, 2024 · 2 min · long8v · 

[129] Grounding Language Models to Images for Multimodal Inputs and Outputs

September 4, 2023 · 1 min · long8v · 

[127] Linearly Mapping from Image to Text Space

August 17, 2023 · 2 min · long8v · 

[125] RILS: Masked Visual Reconstruction in Language Semantic Space

August 2, 2023 · 2 min · long8v · 

[121] Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities

June 23, 2023 · 3 min · long8v · 

[117] Multimodal Chain-of-Thought Reasoning in Language Models

June 7, 2023 · 2 min · long8v · 

[114] MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks

May 9, 2023 · 3 min · long8v · 

[113] BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

April 27, 2023 · 3 min · long8v · 

[107] Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

March 30, 2023 · 2 min · long8v ·