[152] Sigmoid Loss for Language Image Pre-Training

2024λ…„ 3μ›” 12일 Β· 2 λΆ„ Β· long8v Β· 

[129] Grounding Language Models to Images for Multimodal Inputs and Outputs

2023λ…„ 9μ›” 4일 Β· 1 λΆ„ Β· long8v Β· 

[127] Linearly Mapping from Image to Text Space

2023λ…„ 8μ›” 17일 Β· 2 λΆ„ Β· long8v Β· 

[125] RILS: Masked Visual Reconstruction in Language Semantic Space

2023λ…„ 8μ›” 2일 Β· 2 λΆ„ Β· long8v Β· 

[121] Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities

2023λ…„ 6μ›” 23일 Β· 3 λΆ„ Β· long8v Β· 

[117] Multimodal Chain-of-Thought Reasoning in Language Models

2023λ…„ 6μ›” 7일 Β· 2 λΆ„ Β· long8v Β· 

[114] MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks

2023λ…„ 5μ›” 9일 Β· 2 λΆ„ Β· long8v Β· 

[113] BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

2023λ…„ 4μ›” 27일 Β· 3 λΆ„ Β· long8v Β· 

[107] Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

2023λ…„ 3μ›” 30일 Β· 2 λΆ„ Β· long8v Β·