[219] GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement LearningRL MLLM 2025Q3
[209] SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-traininggoogle RL Berkley 2025Q1
[208] FastCuRL: Curriculum Reinforcement Learning with Progressive Context Extension for Efficient Training R1-like Reasoning Models25min RL 2025Q1
[206] Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models25min RL MLLM 2025Q1
[207] MM-EUREKA: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement LearningRL MLLM 2025Q1
[201] VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit AssignmentRL reasoning 2025Q1
[200] Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling25min RL 2025Q1 THU
[199] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement LearningRL reasoning 2025Q1
[196] Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human AnnotationsACL RL 2023Q4 reasoning
[187] Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference OptimizationRL MLLM 2024Q4 SHU
[178] RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V TrustworthinessRL MLLM 2024Q2
[171] CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMsECCV RL MLLM 2024Q3
[172] RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human FeedbackCVPR RL MLLM 2024Q2
[173] Detecting and Preventing Hallucinations in Large Vision Language ModelsAAAI RL 2023Q3 MLLM ScaleAI
[170] Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference FeedbackRL AI2 2024Q2