Policy

Introduction

In reinforcement learning (RL), a policy is the strategy or rule that an agent uses to decide which actions to take in a given state of the environment. Formally, a policy is a mapping from states to actions (or probability distributions over actions) that defines the agent's behavior at every point in time. The policy is the central object of interest in RL: the entire goal of training an agent is to find a policy that maximizes the expected cumulative reward over time.

Policies can take many forms, from simple lookup tables in small discrete environments to complex neural networks that process high-dimensional sensory inputs. The concept of a policy is grounded in the mathematical framework of the Markov decision process (MDP), which models sequential decision-making problems as a tuple of states, actions, transition probabilities, rewards, and a discount factor. Within this framework, the policy is the component that the agent controls directly, and optimizing it is equivalent to solving the MDP.

The study of policies spans several major research areas, including policy evaluation, policy improvement, policy gradient methods, and policy search. In recent years, the concept of a policy has taken on additional significance in the context of large language model (LLM) alignment, where reinforcement learning from human feedback (RLHF) treats the language model itself as a policy that generates text token by token.

Formal Definition

A policy operates within the framework of an MDP, defined by the tuple (S, A, P, R, gamma), where S is the set of states, A is the set of actions, P is the state-transition function, R is the reward function, and gamma is the discount factor.

A policy pi defines how the agent selects actions. There are two principal forms:

Type	Notation	Description
Deterministic policy	a = mu(s)	Maps each state s to a single, fixed action a. Given the same state, the agent always takes the same action.
Stochastic policy	a ~ pi(a \| s)	Maps each state s to a probability distribution over actions. The agent samples an action from this distribution.

In the deterministic case, the policy is denoted by mu (Greek letter mu), while stochastic policies are typically denoted by pi (Greek letter pi). When the policy is parameterized (for example, by a neural network with weights theta), it is written as pi_theta(a | s), indicating that the probability of selecting action a in state s depends on the learned parameters theta.

The objective of the agent is to find a policy that maximizes the expected discounted return:

J(pi) = E_pi [ sum_{t=0}^{infinity} gamma^t * R(s_t, a_t) ]

where gamma (0 <= gamma <= 1) is the discount factor that controls the trade-off between immediate and future rewards.

Deterministic vs. Stochastic Policies

The distinction between deterministic and stochastic policies is fundamental in reinforcement learning theory and has practical implications for algorithm design.

Deterministic Policies

A deterministic policy produces a single action for each state: a_t = mu(s_t). This means that given the same state, the agent will always select the same action. Deterministic policies are conceptually simple and computationally efficient, since no randomness is involved in action selection.

Deterministic policies work well in fully observable environments where the agent has complete information about the state. They are commonly found as the output of value-based methods like Q-learning, where the greedy policy selects the action with the highest Q-value: mu(s) = argmax_a Q(s, a).

However, deterministic policies have notable limitations. They do not naturally explore the environment, since the same state always produces the same action. This lack of exploration can prevent the agent from discovering better strategies. In partially observable environments or adversarial settings (such as games like rock-paper-scissors), a deterministic policy can be exploited by an opponent who predicts the agent's actions.

Stochastic Policies

A stochastic policy assigns a probability to each action given a state: pi(a | s) = P(A_t = a | S_t = s). The agent samples from this distribution at each time step, which introduces controlled randomness into decision-making.

Stochastic policies offer several advantages:

Advantage	Explanation
Built-in exploration	The agent naturally tries different actions, which helps it discover new and potentially better strategies without requiring a separate exploration mechanism.
Robustness in partial observability	When the agent cannot fully observe the environment state, randomizing over actions can be optimal because the same observation may correspond to different underlying states requiring different responses.
Smooth optimization landscape	Small changes to policy parameters produce small changes in action probabilities, making gradient-based optimization more stable. In contrast, deterministic policies can exhibit discontinuous behavior changes from small parameter shifts.
Game-theoretic optimality	In competitive multi-agent settings, stochastic (mixed) strategies can be optimal where deterministic strategies are not.

Two common parameterizations of stochastic policies are:

Categorical policies for discrete action spaces: The policy outputs a probability for each possible action using a softmax function over logits. Used in environments like Atari games.
Diagonal Gaussian policies for continuous action spaces: The policy outputs a mean and standard deviation for each action dimension, and actions are sampled from the resulting normal distribution. Used in robotics and continuous control tasks.

When to Use Each

As a general guideline, deterministic policies are preferred when the environment is fully observable and non-adversarial, and when the agent only needs to exploit a known-good strategy. Stochastic policies are preferred during training (to facilitate exploration), in partially observable environments, in multi-agent or adversarial settings, and whenever policy gradient methods are used for optimization.

It is worth noting that for any MDP, there always exists at least one deterministic optimal policy. However, the process of finding that optimal policy often requires the use of stochastic policies during training.

Optimal Policy

An optimal policy pi* is a policy that achieves the maximum expected cumulative reward from every state. Formally, a policy pi* is optimal if and only if, for all states s and all other policies pi:

V^{pi*}(s) >= V^{pi}(s)

where V^{pi}(s) is the state-value function under policy pi, representing the expected return starting from state s and following policy pi thereafter.

The existence of an optimal policy is guaranteed by the Bellman equation and the theory of MDPs. The Bellman optimality equation provides a recursive characterization of the optimal value function:

V*(s) = max_a [ R(s, a) + gamma * sum_{s'} P(s' | s, a) * V*(s') ]

Similarly, the optimal action-value function Q* satisfies:

Q*(s, a) = R(s, a) + gamma * sum_{s'} P(s' | s, a) * max_{a'} Q*(s', a')

Once Q* is known, the optimal deterministic policy is obtained by:

pi*(s) = argmax_a Q*(s, a)

Classical dynamic programming algorithms such as value iteration and policy iteration compute the optimal policy by iteratively applying the Bellman equation. These methods require a complete model of the environment (transition probabilities and rewards), making them suitable for planning problems but not for model-free settings where Q-learning and policy gradient methods are used instead.

Policy Representation

How a policy is represented computationally determines the types of environments it can handle and the algorithms that can be used to optimize it.

Representation	Description	Suitable For
Lookup table	A table storing the action (or action probabilities) for each state. Each state-action pair is an independent entry.	Small, discrete state and action spaces (e.g., gridworlds, simple board games).
Linear function approximator	The policy is a linear function of state features: pi(a \| s) = softmax(theta^T * phi(s)).	Moderate-sized problems with well-designed features.
Neural network	A deep neural network maps states (or raw observations like images) to actions or action distributions.	Large, continuous, or high-dimensional state and action spaces (e.g., Atari games, robotic control, autonomous driving).
Decision tree or rule set	A tree or set of if-then rules that maps states to actions.	Interpretable policies in domains requiring transparency.

In modern deep reinforcement learning, neural networks are the dominant policy representation. The network takes the current state (or observation) as input and outputs either a single action (deterministic) or parameters of a probability distribution over actions (stochastic). Convolutional neural networks are used when the input is an image, while recurrent neural networks or transformers handle sequential or partially observable inputs.

Policy Evaluation and Policy Improvement

Two fundamental operations in reinforcement learning are policy evaluation and policy improvement, which together form the basis of the policy iteration algorithm.

Policy Evaluation

Policy evaluation is the process of computing the value function V^{pi}(s) for a given policy pi. This tells the agent how good it is to be in each state when following the current policy. Policy evaluation can be performed exactly (using the Bellman expectation equation) or approximately (using sampling-based methods like Monte Carlo estimation or temporal-difference learning).

The Bellman expectation equation for policy evaluation is:

V^{pi}(s) = sum_a pi(a | s) * [ R(s, a) + gamma * sum_{s'} P(s' | s, a) * V^{pi}(s') ]

Iterative policy evaluation repeatedly applies this equation until the value function converges.

Policy Improvement

Given the value function from policy evaluation, policy improvement constructs a new, better policy by acting greedily with respect to the current value function:

pi'(s) = argmax_a [ R(s, a) + gamma * sum_{s'} P(s' | s, a) * V^{pi}(s') ]

The policy improvement theorem guarantees that the new policy pi' is at least as good as the original policy pi, and strictly better unless pi was already optimal.

Policy Iteration

Policy iteration alternates between evaluation and improvement:

Start with an arbitrary policy pi_0.
Evaluate: Compute V^{pi_k} for the current policy.
Improve: Construct pi_{k+1} by acting greedily with respect to V^{pi_k}.
Repeat until the policy no longer changes (convergence to the optimal policy).

Policy iteration is guaranteed to converge to the optimal policy in a finite number of steps for finite MDPs. A variant called generalized policy iteration (GPI) interleaves partial evaluation and improvement steps, and most practical RL algorithms can be viewed as instances of GPI.

Policy-Based vs. Value-Based Reinforcement Learning

RL algorithms are commonly categorized by how they approach the problem of finding a good policy.

Aspect	Value-Based Methods	Policy-Based Methods
What is learned	A value function (V or Q), from which a policy is derived implicitly	A parameterized policy pi_theta directly
Examples	Q-learning, DQN, SARSA	REINFORCE, PPO, A2C
Action spaces	Best suited for discrete action spaces	Handle both discrete and continuous action spaces naturally
Policy type	Typically deterministic (greedy w.r.t. value function)	Can represent stochastic policies
Optimization	Updates value estimates using Bellman equation backups	Updates policy parameters using gradient ascent on expected reward
Stability	Can suffer from instability with function approximation (the "deadly triad")	Policy changes are smooth; small parameter changes produce small behavior changes
Convergence	Strong convergence guarantees in tabular settings	Local optima possible, but strong practical performance

Value-based methods like Q-learning learn an action-value function Q(s, a) and derive a policy by selecting the action with the highest Q-value. These methods are highly effective in discrete action spaces but face challenges in continuous spaces, where finding the argmax over a continuous action domain is computationally expensive.

Policy-based methods parameterize the policy directly and optimize it using gradient-based techniques. They handle continuous action spaces naturally, since the policy can output parameters of a continuous distribution (e.g., the mean and variance of a Gaussian). They also support stochastic policies, which can be essential for exploration and for problems where randomized behavior is optimal.

Policy Gradient Methods

Policy gradient methods are a family of algorithms that optimize a parameterized policy pi_theta by computing the gradient of the expected return with respect to the policy parameters theta, and then performing gradient ascent.

The Policy Gradient Theorem

The theoretical foundation of policy gradient methods is the policy gradient theorem, introduced by Richard Sutton, David McAllester, Satinder Singh, and Yishay Mansour in their landmark 2000 paper "Policy Gradient Methods for Reinforcement Learning with Function Approximation," presented at the Conference on Neural Information Processing Systems (NeurIPS).

The key challenge in computing the gradient of J(theta) is that the expected return depends on the state distribution induced by the policy, which itself depends on theta. Computing the gradient of this state distribution directly is intractable. The policy gradient theorem elegantly sidesteps this issue by showing that the gradient can be expressed as:

∇_theta J(theta) = E_pi [ Q^{pi}(s, a) * ∇_theta ln pi_theta(a | s) ]

This formulation is remarkable because it does not require computing the gradient of the state distribution. The term ∇_theta ln pi_theta(a | s) is the score function (the gradient of the log-probability of taking action a in state s), and Q^{pi}(s, a) weights this gradient by how good the action was. This expectation can be estimated from sample trajectories, making the theorem the basis for practical algorithms.

REINFORCE

The REINFORCE algorithm, introduced by Ronald J. Williams in 1992, was the first practical policy gradient method. It uses Monte Carlo sampling to estimate the policy gradient from complete episodes.

The REINFORCE update rule is:

theta <- theta + alpha * gamma^t * G_t * ∇_theta ln pi_theta(a_t | s_t)

where G_t is the return (cumulative discounted reward) from time step t onward, and alpha is the learning rate.

REINFORCE is simple and unbiased, but it suffers from high variance in its gradient estimates because the return G_t can fluctuate significantly across episodes. A common remedy is to subtract a baseline b(s) from the return:

theta <- theta + alpha * (G_t - b(s_t)) * ∇_theta ln pi_theta(a_t | s_t)

A natural choice for the baseline is the state-value function V(s), which does not introduce bias but reduces variance. When the baseline is V(s), the quantity G_t - V(s_t) approximates the advantage function A(s, a) = Q(s, a) - V(s), which measures how much better an action is compared to the average.

Actor-Critic Methods

Actor-critic methods address the high variance of REINFORCE by introducing a learned value function (the critic) that evaluates the policy (the actor). Instead of waiting for a complete episode to compute returns, the critic provides bootstrapped value estimates that allow updates after every step.

Component	Role	What It Learns
Actor	The policy pi_theta	Which actions to take in each state
Critic	The value function V_w(s) or Q_w(s, a)	How good states (or state-action pairs) are under the current policy

The actor is updated using the policy gradient, with the critic's estimate replacing the Monte Carlo return:

theta <- theta + alpha_theta * A_w(s_t, a_t) * ∇_theta ln pi_theta(a_t | s_t)

where A_w is the advantage estimate from the critic. The critic is updated using temporal-difference (TD) learning:

w <- w + alpha_w * delta_t * ∇_w V_w(s_t)

where delta_t = r_t + gamma * V_w(s_{t+1}) - V_w(s_t) is the TD error.

Actor-critic methods combine the low bias of policy gradient methods with the low variance of value function estimation, producing faster and more stable learning than pure policy gradient approaches.

Advantage Actor-Critic (A2C)

A2C (Advantage Actor-Critic) is a synchronous variant of the A3C (Asynchronous Advantage Actor-Critic) algorithm introduced by Mnih et al. in 2016. A2C runs multiple parallel actors that each collect experience in their own copy of the environment, then performs a single synchronized update using data from all actors.

The key features of A2C include:

Advantage-based updates: Uses the advantage function A(s, a) = Q(s, a) - V(s) to weight policy gradients, which reduces variance compared to using raw returns.
Parallel data collection: Multiple workers collect trajectories simultaneously, improving data throughput.
Entropy regularization: An entropy bonus is added to the objective to encourage exploration and prevent premature convergence to a deterministic policy.
Synchronized updates: Unlike A3C, all workers complete their rollouts before the shared parameters are updated, which provides more stable training and better GPU utilization.

Proximal Policy Optimization (PPO)

PPO, introduced by John Schulman and colleagues at OpenAI in 2017, is one of the most widely used policy gradient algorithms in practice. It addresses a fundamental challenge: how to take the largest possible policy improvement step without causing a catastrophic collapse in performance.

PPO achieves this through a clipped surrogate objective. Let r(theta) = pi_theta(a | s) / pi_{theta_old}(a | s) be the probability ratio between the new and old policies. The PPO objective is:

L^{CLIP}(theta) = E [ min( r(theta) * A, clip(r(theta), 1 - epsilon, 1 + epsilon) * A ) ]

where epsilon is a hyperparameter (typically 0.1 or 0.2) and A is the advantage estimate. The clipping operation prevents the ratio r(theta) from moving too far from 1, which limits the size of each policy update.

PPO Feature	Detail
Clipping range	epsilon = 0.1 to 0.2 (controls maximum policy change per update)
Data reuse	Multiple epochs of minibatch updates on collected trajectories
Advantage estimation	Generalized Advantage Estimation (GAE) with lambda parameter
Additional objectives	Value function loss and entropy bonus added to the total loss
Computational cost	Lower than TRPO (no conjugate gradient or line search needed)

PPO has become the default algorithm in many RL applications due to its simplicity, stability, and strong empirical performance. It is used in robotics (OpenAI's Dactyl hand), game AI (OpenAI Five for Dota 2), and LLM alignment (the RL fine-tuning step in RLHF).

Trust Region Policy Optimization (TRPO)

Before PPO, TRPO (Schulman et al., 2015) addressed the same problem of safe policy updates by imposing a hard constraint on the KL divergence between the old and new policies:

maximize E [ r(theta) * A ] subject to D_KL(pi_{theta_old} || pi_theta) <= delta

TRPO uses conjugate gradient methods and line search to solve this constrained optimization problem, which makes it more computationally expensive than PPO. While TRPO provides stronger theoretical guarantees on monotonic improvement, PPO's simpler clipping mechanism achieves comparable or better results in practice.

On-Policy vs. Off-Policy Learning

Policies play a central role in the distinction between on-policy and off-policy learning, which is one of the most important taxonomic divisions in reinforcement learning.

Aspect	On-Policy	Off-Policy
Definition	The agent learns about the same policy it uses to generate behavior	The agent learns about a target policy different from the behavior policy
Data source	Trajectories generated by the current policy pi	Data can come from any policy (historical data, human demonstrations, different agents)
Examples	SARSA, REINFORCE, A2C, PPO	Q-learning, DQN, SAC, importance-weighted policy gradient
Sample efficiency	Lower (data becomes stale after each policy update)	Higher (can reuse data from old policies via experience replay)
Stability	Generally more stable during training	Requires correction techniques (importance sampling, target networks) to remain stable
Exploration	The policy being learned must also explore	Exploration and learning can be separated

On-policy methods learn the value of the policy they are currently following and can only use data generated by that policy. After each policy update, all previously collected data is typically discarded because it was generated under a different policy. This makes on-policy methods less sample-efficient but simpler and more stable.

Off-policy methods learn about one policy (the target policy) while following a different policy (the behavior policy). This allows them to reuse past experience stored in a replay buffer, learn from demonstrations, and separate exploration from exploitation. However, off-policy learning requires techniques like importance sampling to correct for the mismatch between the behavior and target policies, which can introduce variance and instability.

PPO is sometimes described as "on-policy" because it collects fresh data using the current policy, but it reuses each batch of data for multiple gradient updates within a trust region, blurring the boundary between on-policy and off-policy methods.

Policy Search

Policy search refers to a broad class of methods that optimize the policy directly in policy parameter space, without necessarily computing value functions. While policy gradient methods are one form of policy search, the term also encompasses gradient-free optimization approaches.

Gradient-based policy search methods include REINFORCE, actor-critic algorithms, PPO, and TRPO. These methods compute or estimate the gradient of the expected return and use it to update the policy parameters.

Gradient-free policy search methods treat policy optimization as a black-box optimization problem. They include evolutionary strategies (ES), cross-entropy methods (CEM), and Bayesian optimization. These approaches evaluate entire policies by running them in the environment and use the resulting scores to guide the search. Gradient-free methods can be advantageous when the gradient is difficult to compute, the reward function is non-differentiable, or the policy parameterization is non-smooth. OpenAI demonstrated in 2017 that evolutionary strategies can be competitive with gradient-based methods on continuous control benchmarks while offering natural parallelization.

Guided policy search (GPS), developed by Sergey Levine and colleagues, combines trajectory optimization with supervised learning. GPS first solves the control problem using local trajectory optimization (which is easier) and then trains a neural network policy to match the optimized trajectories. This approach avoids the poor local optima that can trap direct policy gradient methods.

Policy in LLM Alignment (RLHF)

The concept of a policy has gained significant importance outside traditional RL through its role in reinforcement learning from human feedback (RLHF), the technique used to align large language models like ChatGPT, Claude, and Gemini with human preferences.

In the RLHF framework, the language model itself is treated as a policy:

RL Concept	RLHF Interpretation
State	The prompt plus all tokens generated so far
Action	The next token to generate (from a vocabulary of ~50,000 tokens)
Policy	The language model pi_theta, which outputs a probability distribution over the next token
Reward	A score from a trained reward model that rates how well the generated text matches human preferences
Episode	The generation of a complete response to a prompt

The RLHF process involves three stages:

Supervised fine-tuning (SFT): A pretrained language model is fine-tuned on high-quality demonstration data to produce an initial policy.
Reward model training: Human annotators compare pairs of model outputs and indicate which they prefer. These preference rankings are used to train a reward model that assigns scalar scores to generated text.
RL fine-tuning: The SFT model is further optimized using PPO (or similar algorithms) to maximize the reward model's score while staying close to the original SFT policy.

The RL fine-tuning objective includes a KL divergence penalty to prevent the policy from drifting too far from the reference model:

R_total = R_reward_model(response) - lambda * D_KL(pi_theta || pi_reference)

This KL penalty is critical because without it, the policy could learn to produce outputs that "hack" the reward model (achieving high reward scores through degenerate text that does not actually reflect human preferences). By constraining the policy to remain close to the reference model, RLHF preserves the language model's fluency and general capabilities while steering its behavior toward alignment with human values.

More recent approaches such as Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO, introduced by DeepSeek) simplify the RLHF pipeline by eliminating the separate reward model training step, instead optimizing the policy directly on preference data. GRPO, in particular, omits the value function entirely and instead samples multiple outputs per prompt, computing relative advantages within each group.

Explain Like I'm 5 (ELI5)

Imagine you are playing a maze game. A policy is like a set of instructions that tells you what to do at every spot in the maze. At this corner, turn left. At that dead end, go back. A really simple policy might be written on a piece of paper with one instruction for every spot. A more complicated policy might be a helper who looks at where you are and figures out the best move right then and there.

Some policies always tell you the same thing at the same spot (these are called deterministic). Other policies might say "turn left 70% of the time and turn right 30% of the time" (these are called stochastic), which can be helpful when you are still figuring out the maze and want to try different paths.

The goal of reinforcement learning is to find the best possible policy, the set of instructions that gets you through the maze the fastest and collects the most treats along the way. The agent tries different things, sees what works, and slowly makes its instructions better and better until it has a really good plan.

References

Sutton, R. S. and Barto, A. G. (2018). *Reinforcement Learning: An Introduction* (2nd ed.). MIT Press.
Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. (2000). "Policy Gradient Methods for Reinforcement Learning with Function Approximation." Advances in Neural Information Processing Systems, 12.
Williams, R. J. (1992). "Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning." Machine Learning, 8(3-4), 229-256.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). "Proximal Policy Optimization Algorithms." arXiv preprint arXiv:1707.06347.
Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015). "Trust Region Policy Optimization." Proceedings of the 32nd International Conference on Machine Learning (ICML).
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harber, T., Silver, D., and Kavukcuoglu, K. (2016). "Asynchronous Methods for Deep Reinforcement Learning." Proceedings of the 33rd International Conference on Machine Learning (ICML).
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. (2022). "Training Language Models to Follow Instructions with Human Feedback." Advances in Neural Information Processing Systems, 35.
Christiano, P. F., Leike, J., Brown, T., Marber, M., Legg, S., and Amodei, D. (2017). "Deep Reinforcement Learning from Human Preferences." Advances in Neural Information Processing Systems, 30.
Levine, S. and Koltun, V. (2013). "Guided Policy Search." Proceedings of the 30th International Conference on Machine Learning (ICML).
Salimans, T., Ho, J., Chen, X., Sidor, S., and Sutskever, I. (2017). "Evolution Strategies as a Scalable Alternative to Reinforcement Learning." arXiv preprint arXiv:1703.03864.

Introduction

Formal Definition

Deterministic vs. Stochastic Policies

Deterministic Policies

Stochastic Policies

When to Use Each

Optimal Policy

Policy Representation

Policy Evaluation and Policy Improvement

Policy Evaluation

Policy Improvement

Policy Iteration

Policy-Based vs. Value-Based Reinforcement Learning

Policy Gradient Methods

The Policy Gradient Theorem

REINFORCE

Actor-Critic Methods

Advantage Actor-Critic (A2C)

Proximal Policy Optimization (PPO)

Trust Region Policy Optimization (TRPO)

On-Policy vs. Off-Policy Learning

Policy Search

Policy in LLM Alignment (RLHF)

Explain Like I'm 5 (ELI5)

References

Improve this article

Related Articles

Machine learning terms/Reinforcement Learning

ARC-AGI 2

AlphaGo

State (Reinforcement Learning)

State-Action Value Function

Action (Reinforcement Learning)

Introduction

Formal Definition

Deterministic vs. Stochastic Policies

Deterministic Policies

Stochastic Policies

When to Use Each

Optimal Policy

Policy Representation

Policy Evaluation and Policy Improvement

Policy Evaluation

Policy Improvement

Policy Iteration

Policy-Based vs. Value-Based Reinforcement Learning

Policy Gradient Methods

The Policy Gradient Theorem

REINFORCE

Actor-Critic Methods

Advantage Actor-Critic (A2C)

Proximal Policy Optimization (PPO)

Trust Region Policy Optimization (TRPO)

On-Policy vs. Off-Policy Learning

Policy Search

Policy in LLM Alignment (RLHF)

Explain Like I'm 5 (ELI5)

References

Related Articles

Machine learning terms/Reinforcement Learning

ARC-AGI 2

AlphaGo

State (Reinforcement Learning)

State-Action Value Function

Action (Reinforcement Learning)