Greedy Policy

In reinforcement learning, a greedy policy is a decision-making strategy that always selects the action with the highest estimated value in a given state. Formally, a greedy policy chooses the action that maximizes the action-value function (also called the Q-function) for each state, without any randomness or exploration. Greedy policies play a central role in many reinforcement learning algorithms, including Q-learning, value iteration, and policy iteration, where they serve as the mechanism through which learned value estimates are converted into concrete behavior.

Formal definition

A greedy policy is defined as the policy that, for every state s, selects the action a that maximizes the estimated action-value function Q(s, a):

pi(s) = argmax_a Q(s, a)

Here, pi(s) denotes the action chosen by the policy in state s, and Q(s, a) represents the expected cumulative reward from taking action a in state s and following the optimal policy thereafter. The argmax operator selects the action yielding the highest Q-value. If multiple actions share the same maximum Q-value, the tie can be broken arbitrarily (for example, by selecting the first such action or by random selection among tied actions).

In the context of state-value functions rather than action-value functions, a greedy policy selects the action that maximizes the expected immediate reward plus the discounted value of the next state:

pi(s) = argmax_a [ R(s, a) + gamma * sum_s' P(s' | s, a) * V(s') ]

where R(s, a) is the expected reward, gamma is the discount factor, P(s' | s, a) is the state transition probability, and V(s') is the value of the successor state.

Relationship to exploitation

A greedy policy represents pure exploitation: the agent always acts on its current best knowledge without ever trying new or uncertain actions. In the exploration-exploitation framework, exploitation means selecting the action that the agent currently believes will yield the highest return, based on its learned value estimates.

This is a strength in settings where the agent has accurate value estimates, because the greedy policy directly translates good knowledge into high-reward behavior. However, pure exploitation becomes problematic when the agent's value estimates are incomplete or inaccurate, which is often the case early in learning or in environments with stochastic dynamics.

Role in reinforcement learning algorithms

Policy iteration

Policy iteration is a dynamic programming algorithm that alternates between two steps: policy evaluation (computing the value function for the current policy) and policy improvement (constructing a new greedy policy with respect to the current value function). The policy improvement theorem guarantees that the greedy policy with respect to the value function of any policy pi is at least as good as pi itself. By repeatedly evaluating and improving, policy iteration converges to the optimal policy in a finite number of steps for finite Markov decision processes (MDPs).

The policy improvement step works as follows: given the current value function V_pi, construct a new policy pi' such that for every state s, pi'(s) = argmax_a Q_pi(s, a). The policy improvement theorem then states that V_pi'(s) >= V_pi(s) for all states s. Equality holds for all states only when both policies are already optimal.

Value iteration

Value iteration combines the evaluation and improvement steps into a single update. At each iteration, the Bellman optimality equation is applied as the update rule:

V(s) <- max_a [ R(s, a) + gamma * sum_s' P(s' | s, a) * V(s') ]

The max operation here is equivalent to acting greedily: the algorithm implicitly constructs a greedy policy at each step. Once the value function converges (that is, when the maximum change across all states falls below a threshold), extracting the greedy policy from the converged value function yields the optimal policy. The convergence is guaranteed because the Bellman optimality operator is a contraction mapping with respect to the infinity norm.

Q-learning

Q-learning, introduced by Christopher Watkins in 1989, is an off-policy temporal difference learning algorithm. Its update rule uses the greedy action in the next state:

Q(s, a) <- Q(s, a) + alpha * [ r + gamma * max_a' Q(s', a') - Q(s, a) ]

The term max_a' Q(s', a') reflects the assumption that the agent will behave greedily from the next state onward. This is what makes Q-learning "off-policy": the behavior policy (which the agent actually follows to collect data) can be exploratory (such as epsilon-greedy), while the target policy (the one being learned) is greedy. Watkins and Dayan proved in 1992 that Q-learning converges to the optimal action-value function with probability 1, provided that all state-action pairs are visited infinitely often and the learning rate satisfies standard stochastic approximation conditions.

SARSA

SARSA (State-Action-Reward-State-Action) is an on-policy temporal difference method. Unlike Q-learning, SARSA updates its value estimates using the action actually taken by the behavior policy rather than the greedy action:

Q(s, a) <- Q(s, a) + alpha * [ r + gamma * Q(s', a') - Q(s, a) ]

Here a' is the action selected by the current policy in state s', not necessarily the greedy action. When SARSA uses an epsilon-greedy behavior policy, it learns the value of that epsilon-greedy policy rather than the optimal greedy policy. This makes SARSA more conservative than Q-learning in risky environments: it accounts for the possibility of exploratory (potentially dangerous) actions, whereas Q-learning assumes optimal behavior going forward.

Deep Q-networks (DQN)

In deep Q-networks (DQN), a neural network approximates the action-value function, and the greedy policy is derived by selecting the action with the highest predicted Q-value. DQN uses an epsilon-greedy behavior policy during training, where epsilon is typically annealed from a high value (for example, 1.0) to a low value (for example, 0.01) over the course of training. The target policy remains greedy. Two mechanisms stabilize the learning process: experience replay, which stores transitions and samples random mini-batches for training, and a target network, which is a periodically updated copy of the Q-network used to compute target values.

Connection to the Bellman optimality equation

The Bellman optimality equation establishes the theoretical foundation for greedy policies. It states that the optimal value function satisfies:

V(s) = max_a [ R(s, a) + gamma * sum_s' P(s' | s, a) * V(s') ]**

The max operator in this equation is exactly the greedy selection over actions. Any policy that is greedy with respect to the optimal value function V* is an optimal policy. More precisely, the Bellman optimality equation asserts that the optimal policy can always be obtained by acting greedily with respect to the optimal value function. This means that if an agent has access to perfect value estimates, a greedy policy is provably optimal.

The analogous result for action-value functions is:

Q(s, a) = R(s, a) + gamma * sum_s' P(s' | s, a) * max_a' Q(s', a')**

Once Q* is known, the optimal greedy policy simply selects argmax_a Q(s, a)* in each state.

Generalized policy iteration

Generalized policy iteration (GPI) is the overarching framework described by Sutton and Barto (2018) that unifies nearly all reinforcement learning methods. GPI consists of two interacting processes: policy evaluation (estimating the value function for the current policy) and policy improvement (making the policy greedy with respect to the current value function). These two processes compete and cooperate: evaluation makes the value function consistent with the current policy, while improvement changes the policy to be greedy with respect to the current value function, which in turn makes the value function inaccurate for the new policy.

Almost all reinforcement learning algorithms can be described as instances of GPI. The key insight is that as long as both processes continue to update, the policy and value function will converge to the optimal policy and optimal value function, regardless of the specific evaluation and improvement methods used.

The exploration-exploitation tradeoff

The primary limitation of a purely greedy policy is the exploration-exploitation tradeoff. A greedy policy exploits current knowledge but never explores, which means it can fail to discover better actions that have not been tried or whose values have been underestimated. This is especially problematic in the following situations:

Early in learning, when value estimates are based on limited data and may be highly inaccurate
In stochastic environments, where the true value of an action can only be estimated through repeated sampling
In non-stationary environments, where the reward distribution changes over time, so previously suboptimal actions may become optimal
In environments with sparse rewards, where many states yield no reward and greedy behavior may never reach rewarding regions

The multi-armed bandit problem illustrates this tradeoff clearly. An agent facing multiple slot machines (arms) with unknown payout distributions must decide whether to pull the arm with the best observed average payout (exploit) or try other arms that might have higher true payouts (explore). A purely greedy strategy can lock onto a suboptimal arm early if that arm happened to produce a high initial payout.

Alternatives and extensions

Several strategies modify the greedy policy to incorporate exploration. The following table compares the most common approaches:

Strategy	Mechanism	Exploration style	Key parameter	When to use
Greedy	Always picks argmax Q(s, a)	None (pure exploitation)	None	When value estimates are accurate
Epsilon-greedy	With probability epsilon, pick a random action; otherwise pick argmax	Undirected random	epsilon (exploration rate)	General-purpose; simple to implement
Epsilon-greedy with decay	Epsilon decreases over time (for example, epsilon_t = 1/t)	Decreasing random	Initial epsilon, decay schedule	When convergence to greedy is desired
Boltzmann (softmax)	Sample actions proportionally to exp(Q(s,a)/tau)	Directed, value-weighted	tau (temperature)	When action values should influence exploration
Upper confidence bound (UCB)	Pick argmax [Q(s,a) + c * sqrt(ln(t)/N(a))]	Directed, uncertainty-based	c (confidence parameter)	When visit counts are available
Thompson sampling	Sample from posterior distribution of Q-values; act greedily on sample	Directed, Bayesian	Prior distribution	When Bayesian modeling is feasible

Epsilon-greedy policy

The epsilon-greedy policy is the most widely used modification of the greedy policy. It selects the greedy action with probability 1 - epsilon and a uniformly random action with probability epsilon. Formally:

With probability 1 - epsilon: choose argmax_a Q(s, a)
With probability epsilon: choose a random action uniformly from the action space

The parameter epsilon controls the exploration rate. A common practice is to start with a high epsilon (for example, 1.0, meaning fully random behavior) and decay it over time toward a small value (for example, 0.01), so the agent explores broadly at first and then increasingly exploits its learned knowledge. The decay schedule can be linear, exponential, or follow a formula such as epsilon_t = c / t^(1/3), which has been shown to minimize the regret upper bound for deep epsilon-greedy policies.

Boltzmann exploration

Boltzmann exploration (also called softmax exploration) selects actions with probabilities proportional to their exponentiated Q-values:

P(a | s) = exp(Q(s, a) / tau) / sum_b exp(Q(s, b) / tau)

The temperature parameter tau controls the degree of exploration. When tau is high, the distribution becomes nearly uniform (maximum exploration). When tau approaches zero, the distribution concentrates on the greedy action (maximum exploitation). Unlike epsilon-greedy, Boltzmann exploration uses the relative magnitudes of Q-values to guide exploration, meaning that actions with Q-values close to the maximum are selected more often than clearly suboptimal ones.

Upper confidence bound (UCB)

The upper confidence bound strategy selects the action that maximizes a sum of the estimated Q-value and an exploration bonus:

a_t = argmax_a [ Q(s, a) + c * sqrt( ln(t) / N_t(a) ) ]

Here, N_t(a) is the number of times action a has been selected up to time t, and c is a parameter controlling the degree of exploration. The exploration bonus is inversely proportional to the number of times an action has been tried, so under-explored actions receive a higher bonus. UCB provides theoretical guarantees on regret and is particularly effective in stationary bandit problems.

Thompson sampling

Thompson sampling maintains a posterior distribution over the parameters of the reward distribution for each action. At each time step, the agent samples a value from each action's posterior and acts greedily with respect to those samples. This Bayesian approach naturally balances exploration and exploitation: actions with high uncertainty are sampled more widely, leading to greater exploration, while actions with well-estimated high values are exploited.

GLIE: greedy in the limit with infinite exploration

A policy is said to be GLIE (greedy in the limit with infinite exploration) if it satisfies two conditions:

Every state-action pair is visited infinitely often (ensuring complete exploration)
In the limit, the policy converges to the greedy policy with probability 1

The GLIE property is important because it guarantees convergence to the optimal policy in Monte Carlo control methods and certain temporal difference methods. An epsilon-greedy policy can be made GLIE by using a decaying epsilon schedule such as epsilon_t = c / n_t(s), where n_t(s) is the number of visits to state s and 0 < c < 1. Under this schedule, the agent explores less frequently over time but never stops exploring entirely, eventually converging to greedy behavior while having sampled all state-action pairs sufficiently.

Greedy policy vs. greedy algorithm

The term "greedy" appears in both reinforcement learning and classical algorithm design, but the concepts are distinct:

Aspect	Greedy policy (RL)	Greedy algorithm (classical CS)
Domain	Reinforcement learning, MDPs	Combinatorial optimization, graph theory
Decision basis	Maximizes estimated Q-value	Makes locally optimal choice at each step
Optimality	Optimal when value estimates are exact	Optimal only for problems with greedy-choice property (for example, minimum spanning trees)
Revisits decisions	Value estimates are updated over time	Never reconsiders past choices
Stochastic	Can be deterministic or probabilistic	Deterministic
Examples	Q-learning target policy, policy improvement step	Dijkstra's algorithm, Kruskal's algorithm, Huffman coding

In classical computer science, a greedy algorithm builds a solution piece by piece, always choosing the locally optimal option at each step without backtracking. This approach yields optimal solutions only for problems exhibiting the greedy-choice property and optimal substructure (for example, activity selection, fractional knapsack). In reinforcement learning, a greedy policy is optimal when the underlying value function is exact, which is guaranteed by the Bellman optimality equation.

Advantages and disadvantages

Advantages

Simplicity: A greedy policy requires only computing an argmax over action values, making it straightforward to implement and computationally efficient.
Optimality guarantee: When value estimates are exact, a greedy policy is provably optimal by the Bellman optimality equation.
Fast convergence: In later stages of learning, when value estimates are accurate, acting greedily can accelerate convergence to the optimal policy.
Works well in fully observed settings: In environments where the agent has a complete and accurate model (as in dynamic programming), greedy policies produce optimal behavior directly.

Disadvantages

No exploration: A purely greedy policy never tries new actions, risking permanent suboptimality if early value estimates are inaccurate.
Sensitivity to initial conditions: If the initial value estimates happen to favor a suboptimal action, a greedy policy may never discover the true optimum.
Local optima: In complex environments with many local optima, a greedy policy can become trapped.
Poor performance in stochastic environments: When rewards are noisy, early estimates can be misleading, and a greedy policy will exploit these unreliable estimates.
Failure in sparse reward settings: If most actions yield zero reward, a greedy policy has no signal to guide behavior and may behave arbitrarily.

Practical considerations

In practice, purely greedy policies are rarely used during training. Most reinforcement learning implementations use an epsilon-greedy policy or another exploration strategy during the learning phase and switch to a greedy policy (or near-greedy policy) at deployment time when the agent's value estimates are sufficiently accurate. Key practical considerations include:

Epsilon scheduling: The decay rate of epsilon significantly affects performance. Decaying too quickly causes premature convergence to a suboptimal policy; decaying too slowly wastes time on unnecessary exploration. Common schedules include linear decay, exponential decay (epsilon_t = epsilon_0 * decay^t), and inverse-time decay (epsilon_t = epsilon_0 / (1 + decay * t)).
Evaluation vs. training policy: During evaluation or deployment, the epsilon is typically set to zero (pure greedy) to assess the agent's learned behavior without the noise of exploratory actions.
Continuous action spaces: In environments with continuous action spaces, computing the argmax is not straightforward. Methods like DDPG (Deep Deterministic Policy Gradient) learn a deterministic policy network that outputs the greedy action directly, while methods like SAC (Soft Actor-Critic) learn a stochastic policy that incorporates entropy regularization.

Explain like I'm 5 (ELI5)

Imagine you are at a candy store and you have tried three types of candy so far. The chocolate was really good, the gummy bear was okay, and the lollipop was not great. A greedy approach means you always pick the chocolate from now on, because it was the best one you have tried. But there are 20 other candies you have never tasted. Some of them might be even better than chocolate. The problem is, you will never know because you never try anything new. That is what a greedy policy does: it always picks the thing it thinks is best right now, even if trying something different might lead to finding something even better.

References

Sutton, R. S., & Barto, A. G. (2018). *Reinforcement Learning: An Introduction* (2nd ed.). MIT Press. Sections 4.2 (Policy Improvement), 4.3 (Policy Iteration), 4.4 (Value Iteration), and 4.7 (Generalized Policy Iteration).
Watkins, C. J. C. H. (1989). *Learning from Delayed Rewards*. PhD thesis, King's College, Cambridge.
Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. *Machine Learning*, 8(3-4), 279-292.
Bellman, R. (1957). *Dynamic Programming*. Princeton University Press.
Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. *Machine Learning*, 47(2-3), 235-256.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... & Hassabis, D. (2015). Human-level control through deep reinforcement learning. *Nature*, 518(7540), 529-533.
Singh, S. P., Jaakkola, T., Littman, M. L., & Szepesvari, C. (2000). Convergence results for single-step on-policy reinforcement-learning algorithms. *Machine Learning*, 38(3), 287-308.
Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. *Biometrika*, 25(3-4), 285-294.
Lai, T. L., & Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. *Advances in Applied Mathematics*, 6(1), 4-22.
Puterman, M. L. (1994). *Markov Decision Processes: Discrete Stochastic Dynamic Programming*. John Wiley & Sons.
Rummery, G. A., & Niranjan, M. (1994). *On-line Q-learning using connectionist systems*. Technical Report CUED/F-INFENG/TR 166, Cambridge University Engineering Department.
Fan, J., Wang, Z., Xie, Y., & Yang, Z. (2020). A theoretical analysis of deep Q-learning. *Proceedings of the 2nd Conference on Learning for Dynamics and Control (L4DC)*, PMLR 120, 486-489.

Formal definition

Relationship to exploitation

Role in reinforcement learning algorithms

Policy iteration

Value iteration

Q-learning

SARSA

Deep Q-networks (DQN)

Connection to the Bellman optimality equation

Generalized policy iteration

The exploration-exploitation tradeoff

Alternatives and extensions

Epsilon-greedy policy

Boltzmann exploration

Upper confidence bound (UCB)

Thompson sampling

GLIE: greedy in the limit with infinite exploration

Greedy policy vs. greedy algorithm

Advantages and disadvantages

Advantages

Disadvantages

Practical considerations

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

Machine learning terms/Reinforcement Learning

ARC-AGI 2

AlphaGo

State (Reinforcement Learning)

State-Action Value Function

Action (Reinforcement Learning)

Formal definition

Relationship to exploitation

Role in reinforcement learning algorithms

Policy iteration

Value iteration

Q-learning

SARSA

Deep Q-networks (DQN)

Connection to the Bellman optimality equation

Generalized policy iteration

The exploration-exploitation tradeoff

Alternatives and extensions

Epsilon-greedy policy

Boltzmann exploration

Upper confidence bound (UCB)

Thompson sampling

GLIE: greedy in the limit with infinite exploration

Greedy policy vs. greedy algorithm

Advantages and disadvantages

Advantages

Disadvantages

Practical considerations

Explain like I'm 5 (ELI5)

References

Related Articles

Machine learning terms/Reinforcement Learning

ARC-AGI 2

AlphaGo

State (Reinforcement Learning)

State-Action Value Function

Action (Reinforcement Learning)