[219] GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

November 12, 2025 ยท 4 min ยท long8v ยท 

[215] Group Sequence Policy Optimization

August 1, 2025 ยท 3 min ยท long8v ยท 

[214] Learning to Model the World With Language

July 17, 2025 ยท 4 min ยท long8v ยท 

[210] Weight Ensembling Improves Reasoning in Language Models

May 30, 2025 ยท 2 min ยท long8v ยท 

[209] SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

May 21, 2025 ยท 2 min ยท long8v ยท 

[208] FastCuRL: Curriculum Reinforcement Learning with Progressive Context Extension for Efficient Training R1-like Reasoning Models

March 27, 2025 ยท 1 min ยท long8v ยท 

[206] Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

March 12, 2025 ยท 1 min ยท long8v ยท 

[207] MM-EUREKA: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning

March 12, 2025 ยท 3 min ยท long8v ยท 

[204] DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL

February 19, 2025 ยท 2 min ยท long8v ยท 

[203] DeepSeek-V3 Technical Report

February 13, 2025 ยท 2 min ยท long8v ยท 

[201] VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment

February 8, 2025 ยท 3 min ยท long8v ยท 

[200] Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling

February 3, 2025 ยท 2 min ยท long8v ยท 

[199] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

January 24, 2025 ยท 2 min ยท long8v ยท 

[198] Kimi k1.5: Scaling Reinforcement Learning with LLMs

January 23, 2025 ยท 4 min ยท long8v ยท 

[197] Free Process Rewards without Process Labels

January 20, 2025 ยท 1 min ยท long8v ยท 

[196] Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

January 17, 2025 ยท 2 min ยท long8v ยท 

[191] Critique-out-Loud Reward Models

December 17, 2024 ยท 2 min ยท long8v ยท 

[190] Solving math word problems with process and outcome-based feedback

December 16, 2024 ยท 4 min ยท long8v ยท 

[189] Training Verifiers to Solve Math Word Problems

December 9, 2024 ยท 1 min ยท long8v ยท 

[187] Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

November 21, 2024 ยท 2 min ยท long8v ยท 

[182] Calibrated Self-Rewarding Vision Language Models

October 10, 2024 ยท 2 min ยท long8v ยท 

[181] Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

October 7, 2024 ยท 2 min ยท long8v ยท 

[179] Aligning Large Multimodal Models with Factually Augmented RLHF

September 25, 2024 ยท 2 min ยท long8v ยท 

[178] RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness

September 23, 2024 ยท 3 min ยท long8v ยท 

[177] Fine-grained Image Captioning with CLIP Reward

September 6, 2024 ยท 2 min ยท long8v ยท 

[176] Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive

September 5, 2024 ยท 2 min ยท long8v ยท 

[175] Dense Reward for Free in Reinforcement Learning from Human Feedback

September 4, 2024 ยท 2 min ยท long8v ยท 

[171] CLIP-DPO: Vision-Language Models as a Source of Preference for Fixing Hallucinations in LVLMs

August 30, 2024 ยท 2 min ยท long8v ยท 

[172] RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback

August 30, 2024 ยท 2 min ยท long8v ยท 

[173] Detecting and Preventing Hallucinations in Large Vision Language Models

August 30, 2024 ยท 2 min ยท long8v ยท 

[170] Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback

August 27, 2024 ยท 2 min ยท long8v ยท 

[169] Direct Preference Optimization: Your Language Model is Secretly a Reward Model

August 26, 2024 ยท 1 min ยท long8v ยท 

[168] Proximal Policy Optimization Algorithms

August 21, 2024 ยท 2 min ยท long8v ยท 

[142] Trust Region Policy Optimization

December 17, 2023 ยท 1 min ยท long8v ยท 

[134] Asynchronous Methods for Deep Reinforcement Learning

October 18, 2023 ยท 4 min ยท long8v ยท