[196] Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

2025๋…„ 1์›” 17์ผ ยท 2 ๋ถ„ ยท long8v ยท 

[167] Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation

2024๋…„ 7์›” 24์ผ ยท 2 ๋ถ„ ยท long8v ยท 

[165] Rich Human Feedback for Text-to-Image Generation

2024๋…„ 7์›” 19์ผ ยท 2 ๋ถ„ ยท long8v ยท 

feat: add text span

2024๋…„ 5์›” 7์ผ ยท 1 ๋ถ„ ยท long8v ยท 

[156] Interpreting CLIP's Image Representation via Text-Based Decomposition

2024๋…„ 5์›” 6์ผ ยท 2 ๋ถ„ ยท long8v ยท 

[143] Honeybee: Locality-enhanced Projector for Multimodal LLM

2023๋…„ 12์›” 22์ผ ยท 3 ๋ถ„ ยท long8v ยท 

[141] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

2023๋…„ 12์›” 15์ผ ยท 2 ๋ถ„ ยท long8v ยท 

[139] Davidsonian Scene Graph: Improving Reliability in Fine-Grained Evaluation for Text-to-Image Generation

2023๋…„ 12์›” 11์ผ ยท 2 ๋ถ„ ยท long8v ยท 

[138] ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

2023๋…„ 12์›” 8์ผ ยท 2 ๋ถ„ ยท long8v ยท 

[137] mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

2023๋…„ 12์›” 5์ผ ยท 2 ๋ถ„ ยท long8v ยท