[149] Noise-aware Learning from Web-crawled Image-Text Data for Image Captioning

February 12, 2024 ยท 1 min ยท long8v ยท 

[143] Honeybee: Locality-enhanced Projector for Multimodal LLM

December 22, 2023 ยท 3 min ยท long8v ยท 

[126] ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

August 9, 2023 ยท 2 min ยท long8v ยท