See also: reinforcement learning, Markov decision process, Q-learning, dynamic programming, value function, Machine learning terms
The Bellman equation is a recursive formula that lies at the heart of reinforcement learning, dynamic programming, and optimal control theory. Named after the American mathematician Richard Bellman, the equation expresses the value of a decision problem at a given point in time as the sum of the immediate reward and the discounted value of future decision problems. In practical terms, it captures a simple but powerful idea: the value of your current situation equals the reward you receive right now, plus the value of wherever you end up next.
First formulated in 1957, the Bellman equation transformed how researchers and engineers approach sequential decision-making. Rather than evaluating every possible sequence of decisions from start to finish, the equation allows a problem to be broken into smaller subproblems that can be solved recursively. This principle, which Bellman called the "principle of optimality," became the foundation of dynamic programming and later the theoretical backbone of modern reinforcement learning algorithms including Q-learning, SARSA, and Deep Q-Networks. The Bellman equation also underlies methods such as value iteration, policy iteration, and temporal difference learning, and it is widely used in operations research, economics, and optimal control theory.
The Bellman equation appears in two main forms: the Bellman expectation equation, which evaluates a fixed policy, and the Bellman optimality equation, which characterizes the best possible policy. Both forms operate on value functions that assign numerical scores to states (or state-action pairs), quantifying how desirable it is for an agent to be in a particular situation within an environment.
Richard Ernest Bellman (August 26, 1920 to March 19, 1984) was an American applied mathematician born in Brooklyn, New York. He studied mathematics at Brooklyn College, graduating with a B.A. in 1941, and later completed his Ph.D. at Princeton University in 1946 under Solomon Lefschetz, with research focused on the stability of differential equations. He spent much of his career at the RAND Corporation and the University of Southern California.
Bellman began working at the RAND Corporation in 1949, initially as a summer researcher. By 1952, he had joined RAND full-time, suspending his teaching position at Stanford University to focus on developing the theory of dynamic programming. His first publication on the subject appeared in 1952, and his foundational book, Dynamic Programming, was published by Princeton University Press in 1957. This book formally introduced both the principle of optimality and the recursive equation that now bears his name.
Bellman chose the term "dynamic programming" deliberately. In his autobiography Eye of the Hurricane (1984), he explained that he needed a name that would sound impressive and avoid political opposition from then-Secretary of Defense Charles Wilson, who had an aversion to the word "research." The word "programming" referred to planning and scheduling (as in linear programming), while "dynamic" conveyed the multi-stage, time-varying nature of the problems. Bellman found that "dynamic programming" sounded impressive enough to avoid scrutiny from funders.
In the same 1957 book, Bellman coined the phrase "curse of dimensionality" to describe the exponential growth of computational requirements as the number of state variables increases. This observation remains one of the central challenges in applying the Bellman equation to real-world problems with large or continuous state spaces.
Over his career, Bellman authored over 600 papers and 41 books. He received the first Norbert Wiener Prize in Applied Mathematics in 1970 and the IEEE Gold Medal of Honor in 1979. He was elected to the National Academy of Sciences in 1983, one year before his death.
Bellman's work influenced fields far beyond operations research. In economics, his recursive approach enabled what is now known as recursive economics. In control engineering, the continuous-time extension of his equation became the Hamilton-Jacobi-Bellman (HJB) equation. And beginning in the late 1980s, researchers in artificial intelligence adopted Bellman's framework as the theoretical foundation for reinforcement learning.
At the core of the Bellman equation lies Bellman's principle of optimality, which states:
An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.
This principle allows complex, multi-stage optimization problems to be broken into a sequence of simpler subproblems. Instead of optimizing over all decisions simultaneously, one can solve the problem backward in time (backward induction), determining the optimal decision at each stage given the optimal solution to all subsequent stages. The Bellman equation is the mathematical expression of this recursive relationship.
Before diving into the mathematical formulations, it is important to understand the components that the Bellman equation operates on. These concepts are defined within the framework of a Markov decision process (MDP). An MDP provides a formal model for sequential decision-making under uncertainty and is defined by five components:
| Component | Symbol | Description |
|---|---|---|
| State | s (State space S) | The current situation of the agent in the environment; S is the set of all possible states |
| Action | a (Action space A) | A decision the agent can take; A is the set of all possible actions |
| Policy | pi | A mapping from states to actions (or a probability distribution over actions) |
| Reward | R(s, a) | The immediate numerical signal received after taking action a in state s |
| Transition probability | P(s' | s, a) | The probability of reaching state s' after taking action a in state s |
| Discount factor | gamma | A number between 0 and 1 that controls how much the agent values future rewards relative to immediate rewards |
| Value function | V(s) | The expected cumulative reward from state s onward |
| Action-value function | Q(s, a) | The expected cumulative reward from state s after taking action a |
At each time step, the agent observes the current state s, selects an action a according to its policy pi, receives a reward R(s, a), and transitions to a new state s' according to the transition probabilities P(s'|s, a). The Markov property requires that the transition probabilities depend only on the current state and action, not on the history of prior states.
The agent's objective is to find a policy that maximizes the expected cumulative discounted reward, also known as the return:
G_t = R_{t+1} + gamma * R_{t+2} + gamma^2 * R_{t+3} + ... = sum_{k=0}^{infinity} gamma^k * R_{t+k+1}
The discount factor gamma plays a critical role in the Bellman equation. It is a scalar between 0 and 1 (inclusive) that controls how the agent weights future rewards relative to immediate ones.
| Gamma Value | Behavior | Description |
|---|---|---|
| gamma = 0 | Myopic | The agent cares only about the immediate reward; future consequences are ignored entirely |
| gamma close to 0 | Short-sighted | The agent focuses heavily on near-term rewards with little regard for the distant future |
| gamma close to 1 | Far-sighted | The agent places nearly equal weight on immediate and future rewards, planning over long horizons |
| gamma = 1 | Undiscounted | All future rewards are weighted equally; the total return may diverge in infinite-horizon problems |
In most practical applications, gamma is set between 0.9 and 0.99. Mathematically, the discount factor serves several purposes. First, it ensures that the infinite sum of rewards converges to a finite value when gamma < 1, provided rewards are bounded. Second, it encodes a preference for receiving rewards sooner rather than later, which is both intuitively sensible and consistent with economic theories of time preference. Third, it accounts for uncertainty about the future: there may be a nonzero probability that the process terminates at each step, and gamma can be interpreted as the survival probability per time step. Finally, it provides a tunable knob for controlling the planning horizon of the agent.
Two value functions are central to the Bellman equation framework.
The state-value function V^pi(s) represents the expected return starting from state s and following policy pi thereafter:
V^pi(s) = E_pi[G_t | S_t = s] = E_pi[R_{t+1} + gamma * G_{t+1} | S_t = s]
This function answers the question: "How good is it to be in state s, given that I follow policy pi?"
The action-value function (also called the Q-function) Q^pi(s, a) represents the expected return starting from state s, taking action a, and then following policy pi thereafter:
Q^pi(s, a) = E_pi[G_t | S_t = s, A_t = a] = E_pi[R_{t+1} + gamma * G_{t+1} | S_t = s, A_t = a]
This function answers a more specific question: "How good is it to take action a in state s, and then follow policy pi?" The Q-function is especially important in reinforcement learning because it allows an agent to compare the value of different actions directly without needing a model of the environment's transition dynamics.
The state-value and action-value functions are closely related:
V^pi(s) = sum_a pi(a|s) * Q^pi(s, a)
That is, the value of a state is the expected Q-value across all actions weighted by the policy's action probabilities. Conversely:
Q^pi(s, a) = R(s, a) + gamma * sum_s' P(s'|s, a) * V^pi(s')
The Q-value of taking action a in state s is the immediate reward plus the discounted expected value of the resulting next state.
The Bellman expectation equation describes the value of a state (or state-action pair) under a specific, fixed policy pi. It does not attempt to find the best policy; instead, it answers the question: "If the agent follows policy pi forever, what is the expected cumulative reward starting from state s?" It expresses the recursive relationship that any value function must satisfy.
V^pi(s) = sum_a pi(a|s) * sum_s' P(s'|s, a) * [R(s, a) + gamma * V^pi(s')]
This equation states that the value of state s under policy pi equals the expected immediate reward plus the discounted value of the next state, where the expectation is taken over both the action selected by the policy and the resulting state transition. The recursive structure is the defining feature: V^pi appears on both sides of the equation.
Q^pi(s, a) = sum_s' P(s'|s, a) * [R(s, a) + gamma * sum_a' pi(a'|s') * Q^pi(s', a')]
Similarly, the action value for taking action a in state s under policy pi equals the expected immediate reward plus the discounted expected action value at the next state, where the next action is chosen according to pi. This version conditions on a specific action a in state s and then assumes the agent follows policy pi from the next state onward.
For finite MDPs, the Bellman expectation equation for V^pi can be written in matrix form:
V^pi = R^pi + gamma * P^pi * V^pi
where R^pi is a vector of expected immediate rewards under policy pi, and P^pi is the state transition matrix under policy pi. This is a system of linear equations that can be solved directly:
V^pi = (I - gamma * P^pi)^{-1} * R^pi
where I is the identity matrix. This closed-form solution exists whenever gamma < 1 because (I - gamma * P^pi) is guaranteed to be invertible. However, direct matrix inversion has O(|S|^3) complexity, making it impractical for large state spaces.
The Bellman optimality equation characterizes the value functions for an optimal policy pi*. Instead of averaging over actions according to a fixed policy, the optimality equation selects the action that maximizes value. This single change, replacing the expectation over actions with a maximization, is the key distinction between the expectation and optimality forms.
The optimal value function V*(s) is defined as the maximum value function over all possible policies:
V(s) = max_pi V^pi(s)* for all s in S
Similarly, the optimal action-value function is:
Q(s, a) = max_pi Q^pi(s, a)* for all s in S, a in A
V(s) = max_a { R(s, a) + gamma * sum_s' P(s'|s, a) * V(s') }**
The optimal value of a state equals the maximum expected return achievable by any action, assuming optimal behavior from that point forward. The key difference from the expectation equation is the max operator, which replaces the averaging over actions dictated by a fixed policy.
Q(s, a) = sum_s' P(s'|s, a) * [R(s, a) + gamma * max_a' Q(s', a')]**
Once Q* is known, the optimal policy can be extracted by simply choosing the action with the highest Q-value in each state:
pi(s) = argmax_a Q(s, a)**
This property makes Q* particularly useful in practice because it directly yields the optimal policy without requiring separate policy representation.
An important distinction between the expectation and optimality equations is that the Bellman expectation equation is linear (for a fixed policy), while the Bellman optimality equation is nonlinear due to the max operator. This means the optimality equation generally cannot be solved by direct matrix methods. Instead, iterative algorithms are required.
The Bellman equation can be derived from the definition of the value function and the law of total expectation.
Starting from the definition of the state-value function:
V^pi(s) = E_pi[G_t | S_t = s]
Substituting the definition of the return G_t:
V^pi(s) = E_pi[R_{t+1} + gamma * R_{t+2} + gamma^2 * R_{t+3} + ... | S_t = s]
Factoring out gamma from the second term onward:
V^pi(s) = E_pi[R_{t+1} + gamma * (R_{t+2} + gamma * R_{t+3} + ...) | S_t = s]
Recognizing that (R_{t+2} + gamma * R_{t+3} + ...) = G_{t+1}:
V^pi(s) = E_pi[R_{t+1} + gamma * G_{t+1} | S_t = s]
Applying the law of total expectation, conditioning on the action A_t and next state S_{t+1}:
V^pi(s) = sum_a pi(a|s) * sum_s' P(s'|s, a) * [R(s, a) + gamma * E_pi[G_{t+1} | S_{t+1} = s']]
By the Markov property, E_pi[G_{t+1} | S_{t+1} = s'] = V^pi(s'), yielding:
V^pi(s) = sum_a pi(a|s) * sum_s' P(s'|s, a) * [R(s, a) + gamma * V^pi(s')]
This is the Bellman expectation equation. The derivation for Q^pi(s, a) follows a similar pattern. The Bellman optimality equation is then obtained by taking the maximum over all possible actions instead of averaging according to a fixed policy.
The Bellman backup operator (sometimes called the Bellman operator) is the mathematical operation defined by the right-hand side of the Bellman equation. When applied to a value function, it produces an updated value function that is one step closer to the true solution. The Bellman equation can be expressed in terms of Bellman operators, which provide a compact and theoretically powerful way to analyze convergence properties.
The Bellman expectation operator T^pi is defined as:
(T^pi V)(s) = sum_a pi(a|s) * sum_s' P(s'|s, a) * [R(s, a) + gamma * V(s')]
The Bellman optimality operator T is defined as:
(T V)(s) = max_a { sum_s' P(s'|s, a) * [R(s, a) + gamma * V(s')] }
The term "backup" refers to the direction of information flow: value information is "backed up" from successor states to the current state. Both operators have a crucial mathematical property: they are contraction mappings under the supremum norm (also called the infinity norm) when gamma < 1, with contraction factor gamma. Specifically:
||T V_1 - T V_2||_infinity <= gamma * ||V_1 - V_2||_infinity
By the Banach fixed-point theorem, a contraction mapping on a complete metric space has a unique fixed point, and repeated application of the operator converges to this fixed point. This property is the theoretical foundation for the convergence of value iteration and policy evaluation algorithms. The rate of convergence is governed by the discount factor gamma. A smaller gamma means faster convergence but a more myopic agent.
The Bellman equation is not merely a theoretical construct; it directly yields practical algorithms for solving MDPs when the transition probabilities and rewards are known. These algorithms fall under the umbrella of dynamic programming.
Policy evaluation (also called prediction) is the task of computing the value function V^pi for a given policy pi. This amounts to solving the Bellman expectation equation. The algorithm repeatedly applies the Bellman expectation backup:
This process is called an iterative policy evaluation or a prediction task because it predicts the expected return under a known policy. Because T^pi is a contraction mapping, V_k converges to V^pi as k approaches infinity, regardless of the initial values.
Policy improvement takes a value function V^pi computed for the current policy and produces a new, improved policy pi'. The improvement is achieved by acting greedily with respect to the value function:
pi'(s) = argmax_a { sum_s' P(s'|s, a) * [R(s, a) + gamma * V^pi(s')] }
The policy improvement theorem guarantees that the resulting policy pi' is at least as good as pi for every state:
V^{pi'}(s) >= V^pi(s) for all s in S
If no improvement occurs (V^{pi'} = V^pi), then the current policy is already optimal.
Policy iteration alternates between two phases to find the optimal policy:
Policy iteration is guaranteed to converge to the optimal policy in a finite number of steps for finite MDPs. Each policy improvement step produces a policy that is at least as good as the previous one, and since there are finitely many deterministic policies, the process must terminate. Policy iteration typically converges in fewer iterations than value iteration because each step performs a complete policy evaluation. However, each policy evaluation step is itself computationally expensive (requiring either iterative computation or O(|S|^3) matrix inversion). In practice, the total computation time depends on the problem structure.
Value iteration combines the evaluation and improvement steps into a single update by directly applying the Bellman optimality backup:
Value iteration effectively truncates the policy evaluation step to a single sweep, relying on the contraction property of the Bellman optimality operator to guarantee convergence. Because the Bellman optimality operator is a gamma-contraction, value iteration is guaranteed to converge to V*. The convergence rate is geometric with factor gamma: after k iterations, the error is bounded by gamma^k * ||V_0 - V*||_infinity.
| Method | Update Rule | Convergence | Per-Iteration Cost | When to Use |
|---|---|---|---|---|
| Policy evaluation | Bellman expectation backup | Converges to V^pi | O(|S|^2 * |A|) | Evaluating a known policy |
| Policy iteration | Alternates evaluation and improvement | Finite steps to optimality; typically fewer iterations | O(|S|^3) for exact evaluation | Small to medium state spaces |
| Value iteration | Bellman optimality backup | Converges to V*; often many iterations | O(|S|^2 * |A|) per sweep | Large state spaces, when fast convergence desired |
In reinforcement learning, the agent typically does not have access to the environment's transition model P(s'|s, a) or reward function R(s, a). Instead, the agent must learn from sampled experience. The Bellman equation still provides the foundation, but it is applied through sampling-based approximations rather than exact sweeps over all states.
Temporal difference (TD) learning is a family of model-free reinforcement learning methods that learn value functions from experience without requiring knowledge of the transition probabilities or reward function. TD methods combine ideas from Monte Carlo sampling and dynamic programming, and they rely directly on the Bellman equation. The simplest variant, TD(0), updates the value function after each observed transition:
V(S_t) <- V(S_t) + alpha * [R_{t+1} + gamma * V(S_{t+1}) - V(S_t)]
The quantity in brackets is called the TD error (or TD residual):
delta_t = R_{t+1} + gamma * V(S_{t+1}) - V(S_t)
It represents the difference between the Bellman backup estimate and the current estimate V(s). The learning rate alpha controls how much the agent adjusts its estimates based on each new observation. When the value function satisfies the Bellman equation exactly, the expected TD error is zero.
The Bellman residual (also called the Bellman error) generalizes the concept of the TD error to the full Bellman equation. For a given value function V and policy pi, the Bellman residual at state s is:
BR(s) = V(s) - T^pi V(s) = V(s) - sum_a pi(a|s) * sum_s' P(s'|s, a) * [R(s, a) + gamma * V(s')]
When V = V^pi, the Bellman residual is zero everywhere. The Bellman residual provides a measure of how far a given value function is from satisfying the Bellman equation. Some algorithms, known as Bellman residual minimization (BRM) methods, directly minimize the squared Bellman residual as an objective function, which can provide stronger convergence guarantees in certain settings compared to standard TD methods.
Q-learning, introduced by Christopher Watkins in 1989, is a model-free algorithm that directly learns the optimal action-value function Q* by applying the Bellman optimality equation as an update rule, without knowing the environment model. After observing a transition (s, a, r, s'), the agent updates:
Q(S_t, A_t) <- Q(S_t, A_t) + alpha * [R_{t+1} + gamma * max_{a'} Q(S_{t+1}, a') - Q(S_t, A_t)]
The target r + gamma * max_a' Q(s', a') is a sampled version of the right-hand side of the Bellman optimality equation for Q*. Q-learning is an off-policy algorithm because the max operator selects the best action for the update regardless of which action the agent actually takes next.
SARSA is the on-policy counterpart to Q-learning. Its update uses the Bellman expectation equation instead of the optimality equation:
Q(s, a) = Q(s, a) + alpha * [r + gamma * Q(s', a') - Q(s, a)]
Here, a' is the action the agent actually selects in state s' according to its current policy, rather than the greedy maximizing action. The name SARSA comes from the quintuple (s, a, r, s', a') used in each update.
When the state space is too large for tabular methods (for example, when states are raw pixel images), Deep Q-Networks (DQN) use a neural network to approximate the Q-function. The network is trained to minimize the squared difference between its Q-value predictions and the Bellman optimality backup targets:
Loss = (r + gamma * max_a' Q(s', a'; theta_target) - Q(s, a; theta))^2
DQN, introduced by Mnih et al. at DeepMind in 2013 and published in Nature in 2015, incorporated two key innovations to stabilize training: experience replay (storing past transitions in a buffer and sampling mini-batches for training) and a target network (a periodically updated copy of the Q-network used to compute backup targets). These techniques address the instabilities that arise when a neural network's predictions and training targets are derived from the same rapidly changing parameters.
Variants such as Double DQN address the overestimation bias inherent in the max operator of the Bellman optimality equation by decoupling action selection from action evaluation.
The Hamilton-Jacobi-Bellman (HJB) equation is the continuous-time analog of the discrete-time Bellman equation. While the standard Bellman equation applies to problems where decisions are made at discrete time steps, the HJB equation governs problems where decisions can be made at every instant. Many problems in engineering, physics, and finance involve continuous dynamics.
The HJB equation takes the form of a nonlinear partial differential equation:
-dV/dt(x, t) = max_u [f(x, u) * dV/dx(x, t) + L(x, u)]
or equivalently, in the minimization form often used in control:
-dV/dt(x, t) = min_u { f(x, u, t) + (grad V(x, t))^T * g(x, u, t) }
where x is the continuous state, u is the control input, f describes the system dynamics or running cost, g describes the system dynamics, and L is the running cost. The connection to classical physics was first drawn by Rudolf Kalman, who recognized the structural similarity between Bellman's optimality equation and the Hamilton-Jacobi equation from analytical mechanics.
Solving the HJB equation provides both the optimal value function and the optimal control policy. The HJB equation plays a central role in optimal control theory and robotics. For linear systems with quadratic costs (LQR problems), the HJB equation reduces to the algebraic Riccati equation, which has a well-known closed-form solution. For general nonlinear systems, finding analytical solutions is possible only for special problem classes. Numerical methods are required for general nonlinear systems, and the computational cost grows exponentially with the state dimension.
Recent work in deep reinforcement learning has explored connections between the HJB equation and continuous-time RL formulations, bridging the gap between classical control theory and modern machine learning approaches.
Bellman himself coined the term "curse of dimensionality" to describe a fundamental computational challenge in dynamic programming. As the number of state variables grows, the number of entries in the value function grows exponentially. A problem with n state variables, each taking k possible values, has k^n states. For even moderate values of n, exhaustive computation of the value function becomes infeasible.
This challenge, the curse of dimensionality, motivates much of modern reinforcement learning research. Key strategies for coping with it include:
| Strategy | Description | Example |
|---|---|---|
| Function approximation | Represent V or Q with a parameterized model instead of a table | Neural networks, linear function approximation, deep reinforcement learning |
| State abstraction | Group similar states together | Tile coding, radial basis functions |
| Hierarchical methods | Decompose the problem into subproblems at different levels of abstraction | Options framework, feudal networks |
| Monte Carlo tree search | Sample promising paths rather than evaluating all states | AlphaGo, AlphaZero |
| Model-based planning | Learn a model and plan using simulated experience | Dyna, MuZero |
The Bellman equation finds applications across numerous fields due to its generality in modeling sequential decision problems.
The Bellman equation is the theoretical backbone of modern reinforcement learning. Nearly every RL algorithm is derived from or inspired by the Bellman equations:
| Algorithm | Bellman Equation Used | Type |
|---|---|---|
| Value iteration | Bellman optimality equation for V* | Dynamic programming |
| Policy iteration | Bellman expectation equation for V^pi | Dynamic programming |
| Q-learning | Bellman optimality equation for Q* | Model-free, off-policy |
| SARSA | Bellman expectation equation for Q^pi | Model-free, on-policy |
| TD(0) | Bellman expectation equation for V^pi | Model-free, on-policy |
| Deep Q-Network (DQN) | Bellman optimality equation for Q* | Deep RL |
Bellman originally developed dynamic programming to address problems in operations research. Inventory management, scheduling, and resource allocation problems are naturally framed as MDPs and solved using Bellman-equation-based methods. In inventory models, the Bellman equation determines optimal reorder quantities at each time period given current stock levels and demand uncertainty. In shortest path problems, the Bellman-Ford algorithm (which applies the Bellman equation to graph routing) finds shortest paths in weighted graphs even when edge weights are negative.
The Bellman equation is a foundational tool in modern economics. Key applications include:
Lars Ljungqvist and Thomas Sargent, in their textbook Recursive Macroeconomic Theory, extensively apply the Bellman equation to study monetary policy, fiscal policy, taxation, and labor economics.
In control engineering, the Bellman equation (through the HJB formulation) is used to design optimal controllers for dynamical systems. Applications include autonomous vehicle navigation, robotic manipulation, process control in manufacturing, aerospace trajectory optimization, path planning, and trajectory optimization. The linear-quadratic regulator (LQR), one of the most widely used controllers in engineering, is a special case where the Bellman equation has a closed-form solution via the Riccati equation.
Sequence alignment algorithms, including those based on hidden Markov models, use dynamic programming with Bellman-style recursions.
Multi-agent decision problems extend the Bellman equation to settings where multiple agents interact, leading to Nash equilibrium concepts for stochastic games.
Imagine you are standing at the start of a treasure hunt with many different paths. Each path has some coins along the way, and you want to collect as many coins as possible by the time you reach the end.
The Bellman equation is like a rule that says: "The best score you can get from where you are standing right now equals the coins you pick up on your very next step, plus the best score you can get from wherever that step takes you." So instead of figuring out the entire path all at once, you only need to think one step ahead, as long as you already know the best score from each place you might land.
If you start at the end of the treasure hunt and work backwards, you can figure out the best score for every spot on the map. The last spots are easy because there are no more steps to take. Then the second-to-last spots are easy because you already know the last spots. You keep going backwards until you reach the start, and by then you know exactly which path collects the most coins.
Another way to think about it: imagine you're playing a video game, and you're in a room with multiple doors. Each door leads to a different room, and you want to find the best path to get the most points. The Bellman equation helps you decide which door to choose by considering the points you would get immediately and the points you would get in the future rooms. In machine learning, the Bellman equation is used to help computer programs (called "agents") make the best decisions when they interact with different environments, like playing a game or navigating a maze.
That is the core idea behind the Bellman equation: solve a big problem by breaking it into tiny, one-step problems, starting from the end and working your way back.
| Term | Definition |
|---|---|
| State (s) | A specific situation or configuration of the environment |
| Action (a) | A choice available to the agent in a given state |
| Policy (pi) | A strategy mapping states to actions (or action probabilities) |
| Reward (R) | The immediate numerical feedback received after taking an action |
| Return (G_t) | The total discounted cumulative reward from time t onward |
| Discount factor (gamma) | A value in [0, 1] controlling the weight of future rewards |
| State-value function V^pi(s) | Expected return from state s following policy pi |
| Action-value function Q^pi(s, a) | Expected return from state s, taking action a, then following pi |
| Bellman expectation equation | Recursive equation for V^pi or Q^pi under a fixed policy |
| Bellman optimality equation | Recursive equation for V* or Q* under the optimal policy |
| Bellman operator | The mapping T or T^pi that defines the Bellman update |
| Bellman residual | The error V(s) - (T^pi V)(s), measuring how far V is from satisfying the Bellman equation |