Reinforcement Learning Papers
October 9, 2025If there are some important papers that I did not include here, please let me know!
Classical Foundations
- Learning to Predict by the Methods of Temporal Differences (1988)
- Q-learning (1992)
- Reinforcement Learning: An Introduction (2018)
- Playing Atari with Deep Reinforcement Learning (2013)
- Human-level control through deep reinforcement learning (2015)
- Deep Reinforcement Learning with Double Q-learning (2015)
- Deterministic Policy Gradient Algorithms (2014)
- Trust Region Policy Optimization (2015)
- Proximal Policy Optimization Algorithms (2017)
- Continuous control with deep reinforcement learning (2015)
- Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor (2018)
- Rainbow: Combining Improvements in Deep Reinforcement Learning (2017)
- Distributional Reinforcement Learning with Quantile Regression (2017)
Reasoning
- Skywork Open Reasoner 1 Technical Report (2025)
- AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy (2025)
- GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models (2025)
- Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning (2025)
- Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning (2025)
- AM-Thinking-v1: Advancing the Frontier of Reasoning at 32B Scale (2025)
- MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention (2025)
- Hunyuan-A13B (2025)
- POLARIS: A POst-training recipe for scaling reinforcement Learning on Advanced ReasonIng modelS (2025)
- DeepCoder: A Fully Open-Source 14B Coder at O3-mini Level (2025)
- Your Efficient RL Framework Secretly Brings You Off-Policy RL Training (2025)
Infrastructure
- AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning (2025)
- Magistral (2025)
Agents
- SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents (2025)
- DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL (2025)
- AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning (2025)
- WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents (2025)
- DeepSWE: Training a Fully Open-sourced, State-of-the-Art Coding Agent by Scaling RL (2025)
- https://pretty-radio-b75.notion.site/DeepSWE-Training-a-Fully-Open-sourced-State-of-the-Art-Coding-Agent-by-Scaling-RL-22281902c1468193aabbe9a8c59bbe33 (2025)
General RL
- Writing-Zero: Bridge the Gap Between Non-verifiable Tasks and Verifiable Rewards (2025)
- Kimi K2: Open Agentic Intelligence (2025)
- Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning (2025)
- The Majority is not always right: RL training for solution aggregation (2025)
- DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization (2025)
- Inference-Time Scaling for Generalist Reward Modeling (2025)
Robotics
- Vision Language Models are In-Context Value Learners (2024)
- End-to-end RL Improves Dexterous Grasping Policies (2025)
Surveys
- Model-based Reinforcement Learning: A Survey (2020)
- Reinforcement Learning for Combinatorial Optimization: A Survey (2020)
- A Comprehensive Survey of Reinforcement Learning: From Algorithms to Practical Challenges (2024)
- Statistical and Algorithmic Foundations of Reinforcement Learning (2025)
- Reinforcement Learning Foundations for Deep Research Systems: A Survey (2025)
- A Survey of Reinforcement Learning for Large Reasoning Models (2025)