# Machine learning terms/Reinforcement Learning

> Source: https://aiwiki.ai/wiki/machine_learning_terms_reinforcement_learning
> Updated: 2026-06-27
> Categories: Machine Learning, Reinforcement Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

**Reinforcement learning** (RL) is a branch of [machine learning](/wiki/machine_learning) in which an [agent](/wiki/agent) learns to make sequential decisions by interacting with an [environment](/wiki/environment) and receiving feedback in the form of [rewards](/wiki/reward) or penalties. In the standard reference text, Richard Sutton and Andrew Barto define it as "learning what to do, how to map situations to actions, so as to maximize a numerical reward signal" [1]. Unlike [supervised learning](/wiki/supervised_learning), where a model is trained on labeled examples, an RL agent must discover good behavior through trial and error, balancing the need to gather new information (exploration) with the need to use what it already knows (exploitation). This page is a glossary index of the key machine-learning terms used in reinforcement learning: the core elements (agent, environment, state, action, reward, policy, value function, return, discount factor), the [Markov decision process](/wiki/markov_decision_process_mdp) framework, the main algorithm families (value-based, policy-gradient, actor-critic, and model-based), the exploration-exploitation tradeoff, and modern uses such as [reinforcement learning from human feedback](/wiki/reinforcement_learning_from_human_feedback) (RLHF).

RL has produced many landmark results in artificial intelligence, including [DeepMind](/wiki/deepmind)'s [Atari](/wiki/atari)-playing networks, [AlphaGo](/wiki/alphago), [AlphaZero](/wiki/alphazero), [MuZero](/wiki/muzero), [OpenAI Five](/wiki/openai_five), and the RLHF systems used to align modern [large language models](/wiki/large_language_model) such as [InstructGPT](/wiki/instructgpt), [ChatGPT](/wiki/chatgpt), [Claude](/wiki/claude), and [DeepSeek-R1](/wiki/deepseek-r1). It also serves as a glossary index linking to detailed pages on individual RL terms (see the [index of reinforcement learning terms](#index-of-reinforcement-learning-terms) below).

## What are the core elements of reinforcement learning?

The standard reinforcement learning loop is built around an interaction between an [agent](/wiki/agent) and an [environment](/wiki/environment). At each discrete time step, the agent observes a [state](/wiki/state), chooses an [action](/wiki/action) according to its [policy](/wiki/policy), and the environment responds with a new state and a scalar [reward](/wiki/reward). The agent's goal is to maximize the expected cumulative reward, often called the [return](/wiki/return), over time. Sutton and Barto identify trial-and-error search and delayed reward as the two most important distinguishing features of reinforcement learning [1].

| Concept | Symbol | Description |
|---|---|---|
| [Agent](/wiki/agent) | | The learner or decision maker that chooses actions. |
| [Environment](/wiki/environment) | | Everything outside the agent that responds to actions and produces states and rewards. |
| [State](/wiki/state) | s | A representation of the current situation that the agent observes. |
| [Action](/wiki/action) | a | A choice the agent makes at a given state. |
| [Reward](/wiki/reward) | r | A scalar signal indicating how good the most recent transition was. |
| [Policy](/wiki/policy) | π(a\|s) | A mapping from states to actions, possibly stochastic. |
| [Return](/wiki/return) | G | The total discounted future reward from a given time step. |
| [Value function](/wiki/value_function) | V(s) | Expected return starting from state s under a policy. |
| [Action-value function](/wiki/state-action_value_function) | Q(s, a) | Expected return after taking action a in state s and then following the policy. |
| [Discount factor](/wiki/discount_factor) | γ | A number in [0, 1] that reduces the weight of distant rewards. |
| [Trajectory](/wiki/trajectory) | τ | A sequence of states, actions, and rewards. |
| [Episode](/wiki/episode) | | A complete trajectory from an initial state to a terminal state. |
| [Termination condition](/wiki/termination_condition) | | A rule that ends an episode, for example reaching a goal or running out of time. |

A policy can be deterministic, choosing a single action per state, or stochastic, defining a probability distribution over actions. The optimal policy, usually written π*, is one that achieves the highest possible expected return from every state.

## What is a Markov decision process?

Most RL problems are modeled as a [Markov decision process](/wiki/markov_decision_process_mdp) (MDP), defined by a tuple (S, A, P, R, γ) where S is the set of states, A is the set of actions, P(s'|s, a) is the transition probability, R(s, a) is the reward function, and γ is the [discount factor](/wiki/discount_factor) [1]. The defining feature is the [Markov property](/wiki/markov_property): the next state depends only on the current state and action, not on the history of how the agent arrived there. When the agent cannot directly observe the full state, the problem is a partially observable MDP (POMDP), which often requires memory based policies built from [recurrent neural networks](/wiki/recurrent_neural_network) or [transformers](/wiki/transformer).

The [Bellman equation](/wiki/bellman_equation) expresses the value of a state as the expected immediate reward plus the discounted value of the next state. For the optimal action-value function, the Bellman optimality equation is:

Q*(s, a) = E[r + γ max_{a'} Q*(s', a')]

Most RL algorithms can be viewed as approximate ways of solving this equation. Classical methods such as [dynamic programming](/wiki/dynamic_programming), value iteration, and policy iteration require a known model of the environment and are described in Sutton and Barto's textbook *Reinforcement Learning: An Introduction* [1].

## What are tabular reinforcement learning methods?

When the state and action spaces are small, RL can be solved with tabular methods that store one value per state or state-action pair.

* [Q-learning](/wiki/q-learning), introduced by Christopher Watkins in his 1989 PhD thesis, is an off-policy [temporal difference](/wiki/temporal_difference_learning) algorithm [2]. The agent updates Q(s, a) toward r + γ max_{a'} Q(s', a'). [Tabular Q-learning](/wiki/tabular_q-learning) converges to the optimal policy under mild conditions when every state-action pair is visited infinitely often.
* SARSA (state, action, reward, state, action), described by Rummery and Niranjan in 1994, is an on-policy variant that updates toward r + γ Q(s', a') using the action actually taken under the current policy.
* Monte Carlo methods estimate value functions by averaging returns from complete episodes.
* Dyna-Q, proposed by Richard Sutton in 1990, blends real experience with simulated experience from a learned model, which is one of the earliest examples of model-based RL.

These algorithms typically use an [epsilon greedy policy](/wiki/epsilon_greedy_policy) for exploration: with probability ε the agent picks a random action and otherwise it picks the [greedy policy](/wiki/greedy_policy) action. A [random policy](/wiki/random_policy) selects actions uniformly at random and is often used as a baseline.

## How does value-based deep reinforcement learning work?

For large or continuous state spaces, tabular storage is infeasible and value functions must be approximated, usually with [neural networks](/wiki/neural_network). The combination of [deep learning](/wiki/deep_learning) with RL is known as [deep reinforcement learning](/wiki/deep_reinforcement_learning).

* [Deep Q-Network](/wiki/deep_q-network_dqn) (DQN), introduced by Mnih and colleagues at DeepMind in the 2013 arXiv paper *Playing Atari with Deep Reinforcement Learning* [3] and the 2015 *Nature* paper *Human-level control through deep reinforcement learning* [4], parameterizes the [Q-function](/wiki/q-function) with a convolutional neural network. DQN learned to play 49 [Atari 2600](/wiki/atari_2600) games at or above human level using the same architecture, network, and hyperparameters for every game [4].
* Two key stabilization tricks made DQN work. The [replay buffer](/wiki/replay_buffer), also called [experience replay](/wiki/experience_replay), stores past transitions and samples mini batches uniformly to break the correlations between consecutive samples. A separate [target network](/wiki/target_network) copies the online weights periodically and provides stable bootstrap targets [4].
* Double DQN (van Hasselt and colleagues, 2016) decouples action selection from action evaluation to reduce the systematic overestimation bias of standard Q-learning [5].
* Dueling DQN (Wang and colleagues, 2016) splits the network into a state-value stream and an advantage stream, then recombines them, which improves learning when many actions yield similar values [6].
* Prioritized experience replay (Schaul and colleagues, 2016) samples transitions with high temporal difference error more often [7].
* Rainbow DQN (Hessel and colleagues, 2018) combines six DQN improvements, namely double Q-learning, prioritized replay, dueling networks, multi-step targets, distributional RL, and noisy networks, to set new benchmark scores on Atari [8].

## What are policy gradient methods?

Policy gradient methods directly parameterize the policy π_θ(a|s) and update θ to increase expected return using the policy gradient theorem (Sutton, McAllester, Singh, and Mansour, 2000).

* REINFORCE, introduced by Ronald Williams in 1992, computes a Monte Carlo estimate of the policy gradient using complete episode returns [9].
* Actor-critic methods combine a policy network (the actor) with a value network (the [critic](/wiki/critic)) that estimates baselines, reducing variance.
* Advantage Actor-Critic (A2C) and Asynchronous Advantage Actor-Critic (A3C), introduced by Mnih and colleagues in 2016, run many parallel actors to decorrelate experience without a replay buffer.
* [Trust Region Policy Optimization](/wiki/trust_region_policy_optimization) (TRPO), proposed by Schulman and colleagues in 2015, constrains each policy update by a KL-divergence trust region for monotonic improvement [10].
* [Proximal Policy Optimization](/wiki/proximal_policy_optimization) (PPO), introduced by Schulman and colleagues in 2017, replaces TRPO's hard constraint with a clipped surrogate objective [11]. PPO is widely used because it is simple, sample efficient, and works well on many tasks. It became the default RL backbone for [OpenAI Five](/wiki/openai_five) and for the RLHF stage of [InstructGPT](/wiki/instructgpt) and [ChatGPT](/wiki/chatgpt).
* [Deep Deterministic Policy Gradient](/wiki/deep_deterministic_policy_gradient) (DDPG), proposed by Lillicrap and colleagues in 2016, extends actor-critic to continuous action spaces using off-policy data [12].
* Twin Delayed DDPG (TD3), introduced by Fujimoto, van Hoof, and Meger in 2018, fixes the overestimation bias of DDPG with twin critics, delayed policy updates, and target policy smoothing [13].
* [Soft Actor-Critic](/wiki/soft_actor_critic) (SAC), introduced by Haarnoja and colleagues in 2018, adds an entropy bonus to the objective so that the policy is as random as possible while still maximizing return [14]. SAC is a leading off-policy method for continuous control.

## What is the difference between value-based, policy-based, and actor-critic methods?

RL algorithms are often grouped into three families.

| Family | Learns | Typical algorithms | Strengths |
|---|---|---|---|
| Value based | Q(s, a) | Q-learning, SARSA, DQN, Rainbow | Sample efficient, easy to use with discrete actions |
| Policy based | π_θ(a\|s) | REINFORCE, TRPO, PPO | Handles continuous and stochastic actions, smooth policy improvement |
| Actor-critic | both | A2C, A3C, DDPG, TD3, SAC | Combines variance reduction of values with flexibility of policies |

Value-based methods are usually off-policy, which means they can learn from data collected by a different policy, while pure policy gradient methods are on-policy. Off-policy actor-critic methods such as DDPG, TD3, and SAC try to combine the best of both worlds.

## What is model-based reinforcement learning?

Model-based RL learns or uses a model of the environment to plan or to generate synthetic experience. This often improves sample efficiency at the cost of additional complexity.

* AlphaZero (Silver and colleagues, 2018) combines [Monte Carlo tree search](/wiki/monte_carlo_tree_search) with a deep network that predicts moves and values, and learns purely from self play [15]. It mastered Go, chess, and shogi from scratch.
* MuZero (Schrittwieser and colleagues, 2020) extends AlphaZero by learning the dynamics of the environment in a latent space, so it does not need a known set of rules [16]. It matches or surpasses AlphaZero on board games and DQN on Atari.
* World models (Ha and Schmidhuber, 2018) train a generative model of pixels and learn policies inside the imagined environment.
* Dreamer, DreamerV2, and DreamerV3 (Hafner and colleagues, 2020 to 2023) learn a recurrent latent world model and train an actor-critic by backpropagating through imagined trajectories [17]. DreamerV3 is notable for solving a wide range of tasks with one set of hyperparameters and was the first method to collect diamonds in [Minecraft](/wiki/minecraft) without curriculum [17].
* Other notable model-based methods include PETS, PlaNet, and SimPLe.

## What is the exploration-exploitation tradeoff?

Exploration is the problem of trying actions whose value is uncertain in order to discover better strategies, while exploitation means using current knowledge to maximize reward. Balancing the two is one of the central challenges of RL [1]. Naive random exploration scales poorly in large or sparse reward problems.

* [Epsilon greedy](/wiki/epsilon_greedy_policy) and Boltzmann (softmax) exploration are simple and widely used.
* Upper Confidence Bound (UCB) methods, formalized by Auer, Cesa-Bianchi, and Fischer in 2002, choose the action with the highest optimistic upper bound on its value [18].
* Thompson sampling samples a model from a posterior over environments and acts greedily with respect to it. The idea goes back to William Thompson's 1933 paper.
* Intrinsic motivation rewards the agent for visiting novel states. Examples include count based bonuses, Random Network Distillation (Burda and colleagues, 2018), and curiosity driven exploration via prediction error (Pathak and colleagues, 2017).
* Maximum entropy methods such as SAC encourage exploration by rewarding randomness in the policy [14].
* Go-Explore (Ecoffet and colleagues, 2021) explicitly remembers promising states and returns to them before exploring further.

## What is multi-agent reinforcement learning?

Multi-agent RL studies environments where two or more agents interact, possibly cooperating, competing, or both. Classic ideas come from game theory, including Nash equilibria and self play. Notable systems include [OpenAI Five](/wiki/openai_five) for Dota 2, [AlphaStar](/wiki/alphastar) for [StarCraft II](/wiki/starcraft_ii) (Vinyals and colleagues, *Nature* 2019) [19], and CICERO for the language game Diplomacy (Meta AI, 2022). Algorithms include independent Q-learning, MADDPG (Lowe and colleagues, 2017), QMIX (Rashid and colleagues, 2018), and population based training.

## What is reinforcement learning from human feedback (RLHF)?

[Reinforcement learning from human feedback](/wiki/reinforcement_learning_from_human_feedback) (RLHF) trains a model using a learned reward model fitted to human preferences. The standard recipe was popularized by Christiano and colleagues in the 2017 paper *Deep reinforcement learning from human preferences* [20]. It uses three steps:

1. Collect pairs of model outputs and ask humans which one they prefer.
2. Train a [reward model](/wiki/reward_model) to predict these preferences.
3. Fine-tune the policy with an RL algorithm, usually [PPO](/wiki/proximal_policy_optimization), to maximize the reward model under a KL penalty against a reference policy.

RLHF is the central alignment step in [InstructGPT](/wiki/instructgpt) (Ouyang and colleagues, 2022) [21] and [ChatGPT](/wiki/chatgpt), and similar techniques underlie [Claude](/wiki/claude), [Gemini](/wiki/gemini), and many open source instruction tuned models. Variants and successors include [Direct Preference Optimization](/wiki/direct_preference_optimization) (DPO) by Rafailov and colleagues in 2023, which removes the explicit reward model [22], [reinforcement learning from AI feedback](/wiki/reinforcement_learning_from_human_feedback) (RLAIF), [Constitutional AI](/wiki/constitutional_ai) (Bai and colleagues, Anthropic 2022) [23], and Identity Preference Optimization (IPO).

## What is reinforcement learning for reasoning models?

A new wave of large language model training uses RL with verifiable rewards (often grading code or math answers automatically) to elicit long chain of thought reasoning.

* [OpenAI o1](/wiki/openai_o1), previewed on 12 September 2024, was the first widely deployed reasoning model trained with large scale RL on chains of thought [24].
* [DeepSeek-R1](/wiki/deepseek-r1) (DeepSeek AI, January 2025) introduced Group Relative Policy Optimization (GRPO), an actor only RL algorithm that replaces PPO's value function with a group baseline computed from multiple sampled responses to the same prompt [25]. On the AIME 2024 math competition, RL raised DeepSeek-R1-Zero's pass@1 score from 15.6% to 71.0%, rising to 86.7% with majority voting [25]. GRPO was originally described in the DeepSeekMath paper (Shao and colleagues, 2024) [26], and the DeepSeek-R1 work was published in *Nature* in September 2025 with a reported training cost of about $294,000 [27].
* DAPO (Decoupled Clip and Dynamic Sampling Policy Optimization), released by ByteDance Seed and Tsinghua in 2025, builds on GRPO with decoupled clipping ranges, dynamic sampling, and token level loss for long chain of thought stability [28].
* Tülu 3 (Allen Institute for AI, 2024) is a fully open post training recipe that combines supervised fine tuning, DPO, and RL with verifiable rewards (RLVR) to reach state of the art results among open weight models [29].
* Other recent methods include ReMax, RLOO, and various length normalized policy optimization variants.

## What frameworks and benchmarks are used for reinforcement learning?

Researchers and engineers usually rely on standard libraries and benchmark suites.

| Project | Maintainer | Description |
|---|---|---|
| [OpenAI Gym](/wiki/openai_gym) | OpenAI (now Farama Foundation as Gymnasium) | Standard environment API including Atari, classic control, and MuJoCo tasks. |
| Gymnasium | Farama Foundation | Maintained fork of Gym used by most current research code. |
| [DeepMind Control Suite](/wiki/deepmind_control_suite) | DeepMind | Continuous control benchmarks built on MuJoCo. |
| MuJoCo | DeepMind (open source) | Physics simulator widely used for continuous control. |
| [Stable Baselines3](/wiki/stable_baselines3) | DLR-RM | PyTorch implementations of PPO, SAC, TD3, DQN, and others. |
| [RLlib](/wiki/rllib) | Anyscale (Ray) | Scalable distributed RL library. |
| [Dopamine](/wiki/dopamine_rl) | Google | Research framework focused on Atari and reproducibility. |
| [CleanRL](/wiki/cleanrl) | Costa Huang and contributors | Single file implementations of RL algorithms for clarity. |
| [Tianshou](/wiki/tianshou) | Tsinghua TSAIL | PyTorch RL library with broad algorithm coverage. |
| [Acme](/wiki/acme_rl) | DeepMind | Distributed agents library. |
| [TRL](/wiki/trl_library) | Hugging Face | RLHF, DPO, GRPO, and PPO for transformer language models. |
| [verl](/wiki/verl) | ByteDance Seed | Volcano Engine reinforcement learning library used for large scale LLM RL. |
| Procgen, MiniGrid, NetHack, Crafter | various | Generalization benchmarks. |

## What are the notable milestones in reinforcement learning?

| Year | Milestone |
|---|---|
| 1959 | Arthur Samuel's checkers program uses temporal difference style learning, an early precursor to modern RL. |
| 1989 | Christopher Watkins introduces Q-learning in his Cambridge PhD thesis [2]. |
| 1992 | Gerald Tesauro's TD-Gammon learns to play backgammon at a world class level using temporal difference learning with a neural network. |
| 1998 | First edition of *Reinforcement Learning: An Introduction* by Sutton and Barto [1]. |
| 2013 | Mnih and colleagues at DeepMind release the original DQN paper on arXiv, learning Atari games from raw pixels [3]. |
| 2015 | DQN paper appears in *Nature*, achieving human level scores on 49 Atari games [4]. |
| 2016 | DeepMind's [AlphaGo](/wiki/alphago) defeats Lee Sedol four games to one in a five-game match in Seoul, 9-15 March [30]. |
| 2017 | Christiano and colleagues publish *Deep RL from human preferences*, laying the foundation for RLHF [20]. |
| 2017 | Schulman and colleagues introduce PPO [11]. |
| 2017 | AlphaZero masters Go, chess, and shogi entirely through self play [15]. |
| 2018 | OpenAI Five defeats top human teams in restricted Dota 2. |
| 2019 | DeepMind's [AlphaStar](/wiki/alphastar) reaches grandmaster level in StarCraft II [19]. |
| 2019 | OpenAI's robot hand solves a Rubik's cube using PPO with domain randomization. |
| 2020 | MuZero matches AlphaZero without knowing the rules [16]. |
| 2022 | OpenAI releases [InstructGPT](/wiki/instructgpt) and then [ChatGPT](/wiki/chatgpt), both trained with PPO based RLHF [21]. |
| 2023 | DeepMind publishes DreamerV3 and a series of generally capable RL agents [17]. |
| 2024 | OpenAI launches the o1 reasoning model trained with large scale RL on chains of thought [24]. |
| 2025 | DeepSeek releases [DeepSeek-R1](/wiki/deepseek-r1) and the GRPO recipe, which kicks off widespread adoption of RL with verifiable rewards in open source LLM training [25]. |

## What are the limitations and challenges of reinforcement learning?

Reinforcement learning is powerful but notoriously difficult to use in practice. Common challenges include sample inefficiency (deep RL often needs millions of frames), unstable training due to bootstrapping with function approximation and off-policy data, sensitivity to hyperparameters, [reward hacking](/wiki/reward_hacking) where the agent finds unintended ways to maximize the reward, [credit assignment](/wiki/credit_assignment) over long horizons, and a [sim-to-real gap](/wiki/sim-to-real) that limits transfer from simulation to physical robots. Safety, [interpretability](/wiki/interpretability), and alignment with human intent are active research areas, particularly for RL fine tuned language models.

## What fields are related to reinforcement learning?

RL connects to many other fields. [Imitation learning](/wiki/imitation_learning) and [behavior cloning](/wiki/behavior_cloning) train a policy directly from expert demonstrations. [Inverse reinforcement learning](/wiki/inverse_reinforcement_learning) recovers a reward function from observed behavior. [Offline reinforcement learning](/wiki/offline_reinforcement_learning), also known as batch RL, learns from a fixed dataset without further interaction. [Meta reinforcement learning](/wiki/meta_reinforcement_learning) learns algorithms that adapt quickly to new tasks. [Hierarchical reinforcement learning](/wiki/hierarchical_reinforcement_learning) decomposes long horizon problems into reusable sub policies, often using the options framework of Sutton, Precup, and Singh.

## Index of reinforcement learning terms

- [action](/wiki/action)

- [agent](/wiki/agent)

- [Bellman equation](/wiki/bellman_equation)

- [critic](/wiki/critic)

- [Deep Q-Network (DQN)](/wiki/deep_q-network_dqn)

- [DQN](/wiki/dqn)

- [environment](/wiki/environment)

- [episode](/wiki/episode)

- [epsilon greedy policy](/wiki/epsilon_greedy_policy)

- [experience replay](/wiki/experience_replay)

- [greedy policy](/wiki/greedy_policy)

- [Markov decision process (MDP)](/wiki/markov_decision_process_mdp)

- [Markov property](/wiki/markov_property)

- [policy](/wiki/policy)

- [Q-function](/wiki/q-function)

- [Q-learning](/wiki/q-learning)

- [random policy](/wiki/random_policy)

- [reinforcement learning (RL)](/wiki/reinforcement_learning_rl)

- [replay buffer](/wiki/replay_buffer)

- [return](/wiki/return)

- [reward](/wiki/reward)

- [state](/wiki/state)

- [state-action value function](/wiki/state-action_value_function)

- [tabular Q-learning](/wiki/tabular_q-learning)

- [target network](/wiki/target_network)

- [termination condition](/wiki/termination_condition)

- [trajectory](/wiki/trajectory)

## References

1. Sutton, R. S., and Barto, A. G. (2018). *Reinforcement Learning: An Introduction* (2nd ed.). MIT Press. http://incompleteideas.net/book/the-book-2nd.html
2. Watkins, C. J. C. H. (1989). *Learning from delayed rewards*. PhD thesis, Cambridge University.
3. Mnih, V., et al. (2013). Playing Atari with Deep Reinforcement Learning. arXiv:1312.5602. https://arxiv.org/abs/1312.5602
4. Mnih, V., et al. (2015). Human-level control through deep reinforcement learning. *Nature*, 518(7540), 529-533. https://www.nature.com/articles/nature14236
5. van Hasselt, H., Guez, A., and Silver, D. (2016). Deep reinforcement learning with double Q-learning. AAAI.
6. Wang, Z., et al. (2016). Dueling network architectures for deep reinforcement learning. ICML.
7. Schaul, T., et al. (2016). Prioritized experience replay. ICLR.
8. Hessel, M., et al. (2018). Rainbow: Combining improvements in deep reinforcement learning. AAAI.
9. Williams, R. J. (1992). Simple statistical gradient following algorithms for connectionist reinforcement learning. *Machine Learning*, 8.
10. Schulman, J., et al. (2015). Trust Region Policy Optimization. ICML.
11. Schulman, J., et al. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347. https://arxiv.org/abs/1707.06347
12. Lillicrap, T. P., et al. (2016). Continuous control with deep reinforcement learning. ICLR.
13. Fujimoto, S., van Hoof, H., and Meger, D. (2018). Addressing function approximation error in actor-critic methods. ICML.
14. Haarnoja, T., et al. (2018). Soft actor-critic. ICML.
15. Silver, D., et al. (2018). A general reinforcement learning algorithm that masters chess, shogi, and Go through self play. *Science*, 362(6419), 1140-1144.
16. Schrittwieser, J., et al. (2020). Mastering Atari, Go, chess, and shogi by planning with a learned model. *Nature*, 588.
17. Hafner, D., et al. (2023). Mastering diverse domains through world models (DreamerV3). arXiv:2301.04104. https://arxiv.org/abs/2301.04104
18. Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. *Machine Learning*, 47.
19. Vinyals, O., et al. (2019). Grandmaster level in StarCraft II using multi-agent reinforcement learning. *Nature*, 575.
20. Christiano, P., et al. (2017). Deep reinforcement learning from human preferences. NeurIPS.
21. Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback (InstructGPT). NeurIPS. https://arxiv.org/abs/2203.02155
22. Rafailov, R., et al. (2023). Direct Preference Optimization: your language model is secretly a reward model. NeurIPS.
23. Bai, Y., et al. (2022). Constitutional AI: harmlessness from AI feedback. arXiv:2212.08073.
24. OpenAI (2024). Introducing OpenAI o1-preview. https://openai.com/index/introducing-openai-o1-preview/
25. DeepSeek-AI (2025). DeepSeek-R1: incentivizing reasoning capability in LLMs via reinforcement learning. arXiv:2501.12948. https://arxiv.org/abs/2501.12948
26. Shao, Z., et al. (2024). DeepSeekMath: pushing the limits of mathematical reasoning in open language models. arXiv:2402.03300.
27. DeepSeek-AI (2025). DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning. *Nature*, 645. https://www.nature.com/articles/s41586-025-09422-z
28. Yu, Q., et al. (2025). DAPO: an open source LLM reinforcement learning system at scale. arXiv:2503.14476.
29. Lambert, N., et al. (2024). Tülu 3: pushing frontiers in open language model post training. arXiv:2411.15124.
30. AlphaGo versus Lee Sedol. Wikipedia. https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol

