[219] GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement LearningRL MLLM 2025Q3
[209] SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-traininggoogle RL Berkley 2025Q1
[208] FastCuRL: Curriculum Reinforcement Learning with Progressive Context Extension for Efficient Training R1-like Reasoning Models25min RL 2025Q1
[206] Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models25min RL MLLM 2025Q1
[207] MM-EUREKA: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement LearningRL MLLM 2025Q1
[205] LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!Berkley reasoning 2025Q1
[201] VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit AssignmentRL reasoning 2025Q1
[200] Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling25min RL 2025Q1 THU
[199] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement LearningRL reasoning 2025Q1
[196] Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human AnnotationsACL RL 2023Q4 reasoning
[194] Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model ParametersDeepMind 2024Q3 reasoning
[193] Scaling of Search and Learning: A Roadmap to Reproduce o1 from Reinforcement Learning Perspectivesurvey 2024Q4 reasoning
[187] Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference OptimizationRL MLLM 2024Q4 SHU
[183] MultiMath: Bridging Visual and Mathematical Reasoning for Large Language ModelsMLLM 2024Q3 STEM
[178] RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V TrustworthinessRL MLLM 2024Q2
[171] CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMsECCV RL MLLM 2024Q3
[172] RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human FeedbackCVPR RL MLLM 2024Q2
[173] Detecting and Preventing Hallucinations in Large Vision Language ModelsAAAI RL 2023Q3 MLLM ScaleAI
[170] Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference FeedbackRL AI2 2024Q2
[167] Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image GenerationNeurIPS 2023Q4 generation
[163] What You See is What You Read? Improving Text-Image Alignment Evaluationgoogle NeurIPS 2023Q2 evaluation
[164] TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question AnsweringICCV evaluation 2023Q3
[161] MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models & Tasks25min 2022Q4 XAI ACL
[157] LeGrad: An Explainability Method for Vision Transformers via Feature Formation SensitivityCLIP XAI 2024Q2
[155] Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratingsgoogle evaluation generation 2024Q2
[154] Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignmentgoogle XAI evaluation 2024Q2
[149] Noise-aware Learning from Web-crawled Image-Text Data for Image CaptioningICCV 25min 2022Q4 kakao
[148] I Can't Believe There's No Images! Learning Visual Tasks Using only Language SupervisionICCV 25min CLIP 2023Q3 AI2
[147] Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder TransformersICCV 2021Q1 XAI
[145] CLIPScore: A Reference-free Evaluation Metric for Image Captioning2021Q2 CLIP emnlp evaluation AI2
[144] Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyondmultilingual alibaba 2023Q3 MLLM qwen
[139] Davidsonian Scene Graph: Improving Reliability in Fine-Grained Evaluation for Text-to-Image Generationgoogle 2023Q4 evaluation generation
[138] ShareGPT4V: Improving Large Multi-Modal Models with Better Captionsmultimodal dataset 2023Q4 MLLM
[137] mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaborationmultimodal LLM 2023Q4 alibaba
[136] Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Modelsmultimodal naver 2021Q3 document emnlp
[135] Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with Textmultimodal dataset NeurIPS 2023Q2
[131] Re-labeling ImageNet: from Single to Multi-Labels, from Global to Localized Labels2021Q1 CVPR naver
[128] Pix2Struct: Screenshot Parsing as Pretraining for Visual Language UnderstandingICML google 2022Q3 document
[126] ViLT: Vision-and-Language Transformer Without Convolution or Region Supervisionmultimodal 2021Q1 25min kakao
[121] Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entitiesmultimodal CLIP 2023Q1 retrieval
[116] Data Distributional Properties Drive Emergent In-Context Learning in TransformersDeepMind NeurIPS 2022Q2
[113] BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Modelsmultimodal 2023Q1 salesforce
[111] Perceiver IO: A General Architecture for Structured Inputs & Outputsmultimodal 2021Q2 ICLR DeepMind MTL
[110] Understanding the Role of Self Attention for Efficient Speech Recognition2022Q1 ICLR 25min transformer
[108] Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene Graphs with Language Structures via Dependency Relationships2022Q1 dataset CVPR graph
[101] Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics2017 uncertainty MTL
[98] Bridging the Gap between Object and Image-level Representations for Open-Vocabulary DetectionNeurIPS object detection 2022Q3 CLIP
[97] Contrastive Language-Image Pre-Training with Knowledge Graphmultimodal NeurIPS graph 2022Q4 CLIP
[89] Relational Attention: Generalizing Transformers for Graph-Structured Tasksmicrosoft graph 2022Q4 transformer
[87] Bipartite Graph Network with Adaptive Message Passing for Unbiased Scene Graph Generation2021Q4 CVPR SGG imbalance
[85] Dynamic Head: Unifying Object Detection Heads with Attentions2021Q2 CVPR microsoft object detection
[83] Variance Networks: When Expectation Does Not Meet Your Expectations2018 ICLR uncertainty later.. bayesian
[82] Estimating and Evaluating Regression Predictive Uncertainty in Deep Object Detectors2021Q1 ICLR object detection uncertainty later..
[80] Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection2020Q3 object detection imbalance uncertainty
[74] “This is my unicorn, Fluffy”: Personalizing frozen vision-language representationsdataset 2022Q3 25min ECCV nvidia CLIP
[73] Simple Open-Vocabulary Object Detection with Vision Transformersgoogle object detection 2022Q2 25min ECCV OV
[72] Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity2021Q4 ICLR object detection sparse kakao
[71] Large Models are Parsimonious Learners: Activation Sparsity in Trained Transformers25min sparse 2022Q4 transformer
[67] Deformable DETR: Deformable Transformers for End-to-End Object Detection2020Q3 ICLR long object detection SenseTime
[54] Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language ModelsLM MoE 2022Q3 25min
[53] InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial NetsopenAI 2016 fundamental generative
[48] SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection2020Q1 long NeurIPS graph 25min
[38] Visual Relationship Detection Using Part-and-Sum Transformers with Composite QueriesICCV 2021Q2 SGG one-stage
[37] Relationformer: A Unified Framework for Image-to-Graph Generation2022Q1 SGG graph one-stage ECCV
[34] What Regularized Auto-Encoders Learn from the Data Generating Distributionfundamental 2012 generative
[32] ViLT: Vision-and-Language Transformer Without Convolution or Region Supervisionmultimodal 2021Q2 naver
[29] Grounded Language-Image Pre-trainingmultimodal 2021Q4 few-shot zero-shot microsoft object detection
[27] Cross-Domain Few-Shot Classification via Learned Feature-Wise Transformationfew-shot 2020Q1 ICLR
[26] Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts2018 MoE KDD
[22] Transformers without Tears: Improving the Normalization of Self-AttentionNLP 2019 fundamental norm
[19] Multimodal Explanations: Justifying Decisions and Pointing to the Evidencemultimodal 2018 dataset
[9] SimCLR : A Simple Framework for Contrastive Learning of Visual Representationsfew-shot SSL 2020Q3 ICML google
[8] SimVLM: Simple Visual Language Model Pretraining with Weak Supervisionmultimodal SSL 2021Q2 zero-shot
[6] Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modelingmultimodal 2021Q4 backbone multitask
[5] An Image is Worth 16x16 Words: Transformers for Image Recognition at ScaleViT backbone 2021Q1 re-read