[152] Sigmoid Loss for Language Image Pre-Training

2024๋…„ 3์›” 12์ผ ยท 2 ๋ถ„ ยท long8v ยท 

[129] Grounding Language Models to Images for Multimodal Inputs and Outputs

2023๋…„ 9์›” 4์ผ ยท 1 ๋ถ„ ยท long8v ยท 

[127] Linearly Mapping from Image to Text Space

2023๋…„ 8์›” 17์ผ ยท 2 ๋ถ„ ยท long8v ยท 

[125] RILS: Masked Visual Reconstruction in Language Semantic Space

2023๋…„ 8์›” 2์ผ ยท 2 ๋ถ„ ยท long8v ยท 

[121] Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities

2023๋…„ 6์›” 23์ผ ยท 3 ๋ถ„ ยท long8v ยท 

[117] Multimodal Chain-of-Thought Reasoning in Language Models

2023๋…„ 6์›” 7์ผ ยท 2 ๋ถ„ ยท long8v ยท 

[114] MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks

2023๋…„ 5์›” 9์ผ ยท 2 ๋ถ„ ยท long8v ยท 

[113] BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

2023๋…„ 4์›” 27์ผ ยท 3 ๋ถ„ ยท long8v ยท 

[107] Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

2023๋…„ 3์›” 30์ผ ยท 2 ๋ถ„ ยท long8v ยท