RL | 🍎 Paper Today I Read 🦔

[219] GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

RL MLLM 2025Q3

[215] Group Sequence Policy Optimization

LLM RL 2025Q3

[214] Learning to Model the World With Language

ICML RL 2023Q3 WORLD-MODEL

[210] Weight Ensembling Improves Reasoning in Language Models

RL reasoning 2025Q2

[209] SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

google RL Berkley 2025Q1

[208] FastCuRL: Curriculum Reinforcement Learning with Progressive Context Extension for Efficient Training R1-like Reasoning Models

25min RL 2025Q1

[206] Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

25min RL MLLM 2025Q1

[207] MM-EUREKA: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning

RL MLLM 2025Q1

[204] DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL

25min RL reasoning 2025Q1

[203] DeepSeek-V3 Technical Report

WIP 25min LLM RL 2024Q4

[201] VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment

RL reasoning 2025Q1

[200] Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling

25min RL 2025Q1 THU

[199] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

RL reasoning 2025Q1

[198] Kimi k1.5: Scaling Reinforcement Learning with LLMs

multimodal RL reasoning 2025Q1

[197] Free Process Rewards without Process Labels

25min RL 2024Q4

[196] Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

ACL RL 2023Q4 reasoning

[191] Critique-out-Loud Reward Models

AllenAI LLM RL 2024Q3

[190] Solving math word problems with process and outcome-based feedback

DeepMind 2022Q4 RL

[189] Training Verifiers to Solve Math Word Problems

2021Q4 openAI 25min RL

[187] Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

RL MLLM 2024Q4 SHU

[182] Calibrated Self-Rewarding Vision Language Models

NeurIPS 25min RL MLLM 2024Q2

[181] Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

LLM RL 2023Q3

[179] Aligning Large Multimodal Models with Factually Augmented RLHF

25min RL 2023Q3 MLLM Berkley

[178] RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness

RL MLLM 2024Q2

[177] Fine-grained Image Captioning with CLIP Reward

2022Q2 25min RL NAACL

[176] Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive

LLM RL 2024Q1

[175] Dense Reward for Free in Reinforcement Learning from Human Feedback

ICML LLM RL 2024Q3

[171] CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs

ECCV RL MLLM 2024Q3

[172] RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback

CVPR RL MLLM 2024Q2

[173] Detecting and Preventing Hallucinations in Large Vision Language Models

AAAI RL 2023Q3 MLLM ScaleAI

[170] Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback

RL AI2 2024Q2

[169] Direct Preference Optimization: Your Language Model is Secretly a Reward Model

2023Q2 RL

[168] Proximal Policy Optimization Algorithms

2017 RL

[142] Trust Region Policy Optimization

2015 RL

[134] Asynchronous Methods for Deep Reinforcement Learning

2016 DeepMind RL