[221] Scaling Synthetic Data Creation with 1,000,000,000 Personas

January 19, 2026 · 1 min · long8v · 

[215] Group Sequence Policy Optimization

August 1, 2025 · 3 min · long8v · 

[203] DeepSeek-V3 Technical Report

February 13, 2025 · 2 min · long8v · 

[191] Critique-out-Loud Reward Models

December 17, 2024 · 2 min · long8v · 

[186] The Llama 3 Herd of Models

November 15, 2024 · 8 min · long8v · 

[181] Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

October 7, 2024 · 2 min · long8v · 

[176] Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive

September 5, 2024 · 2 min · long8v · 

[175] Dense Reward for Free in Reinforcement Learning from Human Feedback

September 4, 2024 · 2 min · long8v · 

[140] Improved Baselines with Visual Instruction Tuning

December 12, 2023 · 3 min · long8v · 

[137] mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

December 5, 2023 · 3 min · long8v · 

[109] 🦩 Flamingo: a Visual Language Model for Few-Shot Learning

April 10, 2023 · 4 min · long8v · 

[106] Prefix-Tuning: Optimizing Continuous Prompts for Generation

March 28, 2023 · 1 min · long8v ·