# Greedy Policy

> Source: https://aiwiki.ai/wiki/greedy_policy
> Updated: 2026-07-11
> Categories: Machine Learning, Reinforcement Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

In [reinforcement learning](/wiki/reinforcement_learning_rl), a **greedy policy** is a decision rule that, in every state, selects the action with the highest estimated value, formally the action that maximizes the [action-value function](/wiki/q-learning) $$Q(s, a)$$ (written $$\pi(s) = \arg\max_a Q(s, a)$$), with no randomness and no exploration. It is the mechanism that turns learned value estimates into concrete behavior: a greedy policy is provably optimal when its value estimates are exact, but it can lock onto a suboptimal choice when they are not [1]. Greedy action selection sits at the core of many [reinforcement learning](/wiki/reinforcement_learning_rl) algorithms, including [Q-learning](/wiki/q-learning), value iteration, and [policy](/wiki/policy) iteration, and it is the limiting case ($$\epsilon = 0$$) of the widely used [epsilon-greedy](/wiki/epsilon_greedy_policy) strategy.

## What is a greedy policy? (formal definition)

A greedy policy is the policy that, for every state *s*, selects the action $$a$$ that maximizes the estimated action-value function $$Q(s, a)$$:

$$
\pi(s) = \arg\max_a Q(s, a)
$$

Here, $$\pi(s)$$ denotes the action chosen by the policy in state $$s$$, and $$Q(s, a)$$ represents the expected cumulative [reward](/wiki/reward) from taking action $$a$$ in state $$s$$ and following the optimal policy thereafter. The argmax operator selects the action yielding the highest Q-value. If multiple actions share the same maximum Q-value, the tie can be broken arbitrarily (for example, by selecting the first such action or by random selection among tied actions).

In the context of state-value functions rather than action-value functions, a greedy policy selects the action that maximizes the expected immediate reward plus the discounted value of the next state:

$$
\pi(s) = \arg\max_a \left[ R(s, a) + \gamma \sum_{s'} P(s' \mid s, a) V(s') \right]
$$

where $$R(s, a)$$ is the expected reward, $$\gamma$$ is the [discount factor](/wiki/discount_factor), $$P(s' \mid s, a)$$ is the state transition probability, and $$V(s')$$ is the value of the successor state. Sutton and Barto note that the term "greedy" is used in computer science to describe any procedure that selects alternatives "based only on local or immediate considerations, without considering the possibility that such a selection may prevent future access to even better alternatives" [1].

## How does a greedy policy relate to exploitation?

A greedy policy represents pure exploitation: the agent always acts on its current best knowledge without ever trying new or uncertain actions. In the exploration-exploitation framework, exploitation means selecting the action that the agent currently believes will yield the highest return, based on its learned value estimates [1].

This is a strength in settings where the agent has accurate value estimates, because the greedy policy directly translates good knowledge into high-reward behavior. However, pure exploitation becomes problematic when the agent's value estimates are incomplete or inaccurate, which is often the case early in learning or in environments with stochastic dynamics.

## How is a greedy policy used in reinforcement learning algorithms?

### Policy iteration

Policy iteration is a [dynamic programming](/wiki/markov_decision_process_mdp) algorithm that alternates between two steps: policy evaluation (computing the value function for the current policy) and policy improvement (constructing a new greedy policy with respect to the current value function). The policy improvement theorem guarantees that the greedy policy with respect to the value function of any policy $$\pi$$ is at least as good as $$\pi$$ itself [1]. By repeatedly evaluating and improving, policy iteration converges to the optimal policy in a finite number of steps for finite [Markov decision processes](/wiki/markov_decision_process_mdp) (MDPs) [1][10].

The policy improvement step works as follows: given the current value function $$V_\pi$$, construct a new policy $$\pi'$$ such that for every state $$s$$, $$\pi'(s) = \arg\max_a Q_\pi(s, a)$$. The policy improvement theorem then states that $$V_{\pi'}(s) \ge V_\pi(s)$$ for all states $$s$$. Equality holds for all states only when both policies are already optimal. Sutton and Barto state the result precisely: "Let pi and pi' be any pair of deterministic policies such that, for all s, q_pi(s, pi'(s)) >= v_pi(s). Then the policy pi' must be as good as, or better than, pi" [1]. They add that, "by construction, the greedy policy meets the conditions of the policy improvement theorem, so we know that it is as good as, or better than, the original policy" [1].

### Value iteration

Value iteration combines the evaluation and improvement steps into a single update. At each iteration, the [Bellman](/wiki/bellman_equation) optimality equation is applied as the update rule:

$$
V(s) \leftarrow \max_a \left[ R(s, a) + \gamma \sum_{s'} P(s' \mid s, a) V(s') \right]
$$

The max operation here is equivalent to acting greedily: the algorithm implicitly constructs a greedy policy at each step. Once the value function converges (that is, when the maximum change across all states falls below a threshold), extracting the greedy policy from the converged value function yields the optimal policy. The convergence is guaranteed because the Bellman optimality operator is a contraction mapping with respect to the infinity norm [1][10].

### Q-learning

[Q-learning](/wiki/q-learning), introduced by Christopher Watkins in 1989, is an off-policy [temporal difference](/wiki/reinforcement_learning_rl) learning algorithm [2]. Its update rule uses the greedy action in the next state:

$$
Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \right]
$$

The term $$\max_{a'} Q(s', a')$$ reflects the assumption that the agent will behave greedily from the next state onward. This is what makes Q-learning "off-policy": the behavior policy (which the agent actually follows to collect data) can be exploratory (such as [epsilon-greedy](/wiki/epsilon_greedy_policy)), while the target policy (the one being learned) is greedy. Watkins and Dayan proved in 1992 that Q-learning converges to the optimal action-value function with probability 1, provided that all state-action pairs are visited (sampled) infinitely often and the learning rate satisfies standard stochastic approximation conditions [3].

### SARSA

SARSA (State-Action-Reward-State-Action) is an on-policy temporal difference method [11]. Unlike Q-learning, SARSA updates its value estimates using the action actually taken by the behavior policy rather than the greedy action:

$$
Q(s, a) \leftarrow Q(s, a) + \alpha \left[ r + \gamma Q(s', a') - Q(s, a) \right]
$$

Here $$a'$$ is the action selected by the current policy in state $$s'$$, not necessarily the greedy action. When SARSA uses an epsilon-greedy behavior policy, it learns the value of that epsilon-greedy policy rather than the optimal greedy policy [1]. This makes SARSA more conservative than Q-learning in risky environments: it accounts for the possibility of exploratory (potentially dangerous) actions, whereas Q-learning assumes optimal behavior going forward.

### Deep Q-networks (DQN)

In [deep Q-networks](/wiki/deep_q-network_dqn) (DQN), a [neural network](/wiki/neural_network) approximates the action-value function, and the greedy policy is derived by selecting the action with the highest predicted Q-value [6]. DQN uses an epsilon-greedy behavior policy during training, where epsilon is annealed (in the original Atari agent) linearly from 1.0 to 0.1 over the first one million frames and then held fixed at 0.1 [6]. The target policy remains greedy. Two mechanisms stabilize the learning process: [experience replay](/wiki/experience_replay), which stores transitions and samples random mini-batches for training, and a [target network](/wiki/target_network), which is a periodically updated copy of the Q-network used to compute target values [6]. The 2015 DQN agent reached human-level or above performance on 29 of the 49 Atari 2600 games it was tested on, learning directly from pixels [6].

## How does a greedy policy connect to the Bellman optimality equation?

The [Bellman optimality equation](/wiki/bellman_equation) establishes the theoretical foundation for greedy policies [4]. It states that the optimal value function satisfies:

$$
V^*(s) = \max_a \left[ R(s, a) + \gamma \sum_{s'} P(s' \mid s, a) V^*(s') \right]
$$

The max operator in this equation is exactly the greedy selection over actions. Any policy that is greedy with respect to the optimal value function $$V^*$$ is an optimal policy [1]. More precisely, the Bellman optimality equation asserts that the optimal policy can always be obtained by acting greedily with respect to the optimal value function. This means that if an agent has access to perfect value estimates, a greedy policy is provably optimal.

The analogous result for action-value functions is:

$$
Q^*(s, a) = R(s, a) + \gamma \sum_{s'} P(s' \mid s, a) \max_{a'} Q^*(s', a')
$$

Once $$Q^*$$ is known, the optimal greedy policy simply selects $$\arg\max_a Q^*(s, a)$$ in each state.

## Generalized policy iteration

Generalized policy iteration (GPI) is the overarching framework described by Sutton and Barto (2018) that unifies nearly all reinforcement learning methods [1]. GPI consists of two interacting processes: policy evaluation (estimating the value function for the current policy) and policy improvement (making the policy greedy with respect to the current value function). These two processes compete and cooperate: evaluation makes the value function consistent with the current policy, while improvement changes the policy to be greedy with respect to the current value function, which in turn makes the value function inaccurate for the new policy.

Almost all reinforcement learning algorithms can be described as instances of GPI. The key insight is that as long as both processes continue to update, the policy and value function will converge to the optimal policy and optimal value function, regardless of the specific evaluation and improvement methods used [1].

## Why is a purely greedy policy a problem? (the exploration-exploitation tradeoff)

The primary limitation of a purely greedy policy is the exploration-exploitation tradeoff. A greedy policy exploits current knowledge but never explores, which means it can fail to discover better actions that have not been tried or whose values have been underestimated [1]. This is especially problematic in the following situations:

- **Early in learning**, when value estimates are based on limited data and may be highly inaccurate
- **In stochastic environments**, where the true value of an action can only be estimated through repeated sampling
- **In non-stationary environments**, where the reward distribution changes over time, so previously suboptimal actions may become optimal
- **In environments with sparse rewards**, where many states yield no reward and greedy behavior may never reach rewarding regions

The multi-armed bandit problem illustrates this tradeoff clearly. An agent facing multiple slot machines (arms) with unknown payout distributions must decide whether to pull the arm with the best observed average payout (exploit) or try other arms that might have higher true payouts (explore) [5]. A purely greedy strategy can lock onto a suboptimal arm early if that arm happened to produce a high initial payout. Lai and Robbins showed in 1985 that, for the bandit problem, the regret of any good allocation rule must grow at least logarithmically in the number of trials, which is impossible for a purely greedy rule that stops sampling alternative arms [9].

## Greedy vs epsilon-greedy and other exploration strategies?

Several strategies modify the greedy policy to incorporate exploration. The following table compares the most common approaches:

| Strategy | Mechanism | Exploration style | Key parameter | When to use |
|---|---|---|---|---|
| Greedy | Always picks $$\arg\max Q(s, a)$$ | None (pure exploitation) | None | When value estimates are accurate |
| [Epsilon-greedy](/wiki/epsilon_greedy_policy) | With probability $$\epsilon$$, pick a random action; otherwise pick $$\arg\max$$ | Undirected random | $$\epsilon$$ (exploration rate) | General-purpose; simple to implement |
| Epsilon-greedy with decay | Epsilon decreases over time (for example, $$\epsilon_t = 1/t$$) | Decreasing random | Initial $$\epsilon$$, decay schedule | When convergence to greedy is desired |
| Boltzmann (softmax) | Sample actions proportionally to $$\exp(Q(s, a) / \tau)$$ | Directed, value-weighted | $$\tau$$ (temperature) | When action values should influence exploration |
| Upper confidence bound (UCB) | Pick $$\arg\max [Q(s, a) + c \sqrt{\ln(t) / N(a)}]$$ | Directed, uncertainty-based | $$c$$ (confidence parameter) | When visit counts are available |
| Thompson sampling | Sample from posterior distribution of Q-values; act greedily on sample | Directed, Bayesian | Prior distribution | When Bayesian modeling is feasible |

### Epsilon-greedy policy

The [epsilon-greedy](/wiki/epsilon_greedy_policy) policy is the most widely used modification of the greedy policy. It selects the greedy action with probability $$1 - \epsilon$$ and a uniformly random action with probability $$\epsilon$$ [1]. Formally:

- With probability $$1 - \epsilon$$: choose $$\arg\max_a Q(s, a)$$
- With probability $$\epsilon$$: choose a random action uniformly from the action space

The parameter $$\epsilon$$ controls the exploration rate. A common practice is to start with a high epsilon (for example, 1.0, meaning fully random behavior) and decay it over time toward a small value (for example, 0.01), so the agent explores broadly at first and then increasingly exploits its learned knowledge. The decay schedule can be linear, exponential, or follow a formula such as $$\epsilon_t = c / t^{1/3}$$, which has been shown to minimize the regret upper bound for deep epsilon-greedy policies [12].

### Boltzmann exploration

Boltzmann exploration (also called [softmax](/wiki/softmax) exploration) selects actions with probabilities proportional to their exponentiated Q-values:

$$
P(a \mid s) = \frac{\exp(Q(s, a) / \tau)}{\sum_b \exp(Q(s, b) / \tau)}
$$

The temperature parameter $$\tau$$ controls the degree of exploration. When $$\tau$$ is high, the distribution becomes nearly uniform (maximum exploration). When $$\tau$$ approaches zero, the distribution concentrates on the greedy action (maximum exploitation). Unlike epsilon-greedy, Boltzmann exploration uses the relative magnitudes of Q-values to guide exploration, meaning that actions with Q-values close to the maximum are selected more often than clearly suboptimal ones.

### Upper confidence bound (UCB)

The upper confidence bound strategy selects the action that maximizes a sum of the estimated Q-value and an exploration bonus:

$$
a_t = \arg\max_a \left[ Q(s, a) + c \sqrt{\frac{\ln(t)}{N_t(a)}} \right]
$$

Here, $$N_t(a)$$ is the number of times action $$a$$ has been selected up to time $$t$$, and $$c$$ is a parameter controlling the degree of exploration. The exploration bonus is inversely proportional to the number of times an action has been tried, so under-explored actions receive a higher bonus. UCB provides theoretical guarantees on regret and is particularly effective in stationary bandit problems [5].

### Thompson sampling

Thompson sampling maintains a posterior distribution over the parameters of the reward distribution for each action [8]. At each time step, the agent samples a value from each action's posterior and acts greedily with respect to those samples. This Bayesian approach naturally balances exploration and exploitation: actions with high uncertainty are sampled more widely, leading to greater exploration, while actions with well-estimated high values are exploited.

## What is GLIE (greedy in the limit with infinite exploration)?

A policy is said to be GLIE (greedy in the limit with infinite exploration) if it satisfies two conditions [7]:

1. Every state-action pair is visited (each action is taken) infinitely often in every state that is visited infinitely often, ensuring complete exploration
2. In the limit, the policy converges to the greedy policy with respect to the learned Q-function with probability 1

The GLIE property is important because it guarantees convergence to the optimal policy in Monte Carlo control methods and certain temporal difference methods [7]. An epsilon-greedy policy can be made GLIE by using a decaying epsilon schedule such as $$\epsilon_k = 1 / k$$, where $$k$$ is the episode number, or $$\epsilon_t = c / n_t(s)$$, where $$n_t(s)$$ is the number of visits to state $$s$$ [7]. Under such a schedule, the agent explores less frequently over time but never stops exploring entirely, eventually converging to greedy behavior while having sampled all state-action pairs sufficiently. Singh, Jaakkola, Littman, and Szepesvari (2000) proved that, for a GLIE policy whose learning rates satisfy the standard stochastic approximation conditions (the sum of learning rates diverges and the sum of squared learning rates is finite), on-policy methods such as SARSA converge to the optimal action-value function with probability 1 [7].

## How does a greedy policy differ from a greedy algorithm in classical CS?

The term "greedy" appears in both reinforcement learning and classical algorithm design, but the concepts are distinct:

| Aspect | Greedy policy (RL) | Greedy algorithm (classical CS) |
|---|---|---|
| Domain | Reinforcement learning, MDPs | Combinatorial optimization, graph theory |
| Decision basis | Maximizes estimated Q-value | Makes locally optimal choice at each step |
| Optimality | Optimal when value estimates are exact | Optimal only for problems with greedy-choice property (for example, minimum spanning trees) |
| Revisits decisions | Value estimates are updated over time | Never reconsiders past choices |
| Stochastic | Can be deterministic or probabilistic | Deterministic |
| Examples | Q-learning target policy, policy improvement step | Dijkstra's algorithm, Kruskal's algorithm, Huffman coding |

In classical computer science, a greedy algorithm builds a solution piece by piece, always choosing the locally optimal option at each step without backtracking. This approach yields optimal solutions only for problems exhibiting the greedy-choice property and optimal substructure (for example, activity selection, fractional knapsack). In reinforcement learning, a greedy policy is optimal when the underlying value function is exact, which is guaranteed by the Bellman optimality equation [1].

## What are the advantages and disadvantages of a greedy policy?

### Advantages

- **Simplicity**: A greedy policy requires only computing an argmax over action values, making it straightforward to implement and computationally efficient.
- **Optimality guarantee**: When value estimates are exact, a greedy policy is provably optimal by the Bellman optimality equation [1].
- **Fast convergence**: In later stages of learning, when value estimates are accurate, acting greedily can accelerate convergence to the optimal policy.
- **Works well in fully observed settings**: In environments where the agent has a complete and accurate model (as in dynamic programming), greedy policies produce optimal behavior directly.

### Disadvantages

- **No exploration**: A purely greedy policy never tries new actions, risking permanent suboptimality if early value estimates are inaccurate.
- **Sensitivity to initial conditions**: If the initial value estimates happen to favor a suboptimal action, a greedy policy may never discover the true optimum.
- **Local optima**: In complex environments with many local optima, a greedy policy can become trapped.
- **Poor performance in stochastic environments**: When rewards are noisy, early estimates can be misleading, and a greedy policy will exploit these unreliable estimates.
- **Failure in sparse reward settings**: If most actions yield zero reward, a greedy policy has no signal to guide behavior and may behave arbitrarily.

## When should a greedy policy be used in practice?

In practice, purely greedy policies are rarely used during training. Most reinforcement learning implementations use an epsilon-greedy policy or another exploration strategy during the learning phase and switch to a greedy policy (or near-greedy policy) at deployment time when the agent's value estimates are sufficiently accurate. Key practical considerations include:

- **Epsilon scheduling**: The decay rate of epsilon significantly affects performance. Decaying too quickly causes premature convergence to a suboptimal policy; decaying too slowly wastes time on unnecessary exploration. Common schedules include linear decay, exponential decay ($$\epsilon_t = \epsilon_0 \cdot \text{decay}^t$$), and inverse-time decay ($$\epsilon_t = \epsilon_0 / (1 + \text{decay} \cdot t)$$).
- **Evaluation vs. training policy**: During evaluation or deployment, epsilon is typically set to zero (pure greedy) to assess the agent's learned behavior without the noise of exploratory actions.
- **Continuous action spaces**: In environments with continuous action spaces, computing the argmax is not straightforward. Methods like DDPG (Deep Deterministic Policy Gradient) learn a deterministic policy network that outputs the greedy action directly, while methods like SAC (Soft Actor-Critic) learn a stochastic policy that incorporates entropy regularization.

## Explain like I'm 5 (ELI5)

Imagine you are at a candy store and you have tried three types of candy so far. The chocolate was really good, the gummy bear was okay, and the lollipop was not great. A greedy approach means you always pick the chocolate from now on, because it was the best one you have tried. But there are 20 other candies you have never tasted. Some of them might be even better than chocolate. The problem is, you will never know because you never try anything new. That is what a greedy policy does: it always picks the thing it thinks is best right now, even if trying something different might lead to finding something even better.

## See also

- [Epsilon-greedy policy](/wiki/epsilon_greedy_policy)
- [Q-learning](/wiki/q-learning)
- [Reinforcement learning](/wiki/reinforcement_learning_rl)
- [Policy](/wiki/policy)
- [Bellman equation](/wiki/bellman_equation)
- [Markov decision process (MDP)](/wiki/markov_decision_process_mdp)
- [Deep Q-network (DQN)](/wiki/deep_q-network_dqn)
- [Softmax](/wiki/softmax)

## References

1. Sutton, R. S., & Barto, A. G. (2018). *Reinforcement Learning: An Introduction* (2nd ed.). MIT Press. Sections 4.2 (Policy Improvement), 4.3 (Policy Iteration), 4.4 (Value Iteration), and 4.7 (Generalized Policy Iteration).
2. Watkins, C. J. C. H. (1989). *Learning from Delayed Rewards*. PhD thesis, King's College, Cambridge.
3. Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. *Machine Learning*, 8(3-4), 279-292.
4. Bellman, R. (1957). *Dynamic Programming*. Princeton University Press.
5. Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. *Machine Learning*, 47(2-3), 235-256.
6. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... & Hassabis, D. (2015). Human-level control through deep reinforcement learning. *Nature*, 518(7540), 529-533.
7. Singh, S. P., Jaakkola, T., Littman, M. L., & Szepesvari, C. (2000). Convergence results for single-step on-policy reinforcement-learning algorithms. *Machine Learning*, 38(3), 287-308.
8. Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. *Biometrika*, 25(3-4), 285-294.
9. Lai, T. L., & Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. *Advances in Applied Mathematics*, 6(1), 4-22.
10. Puterman, M. L. (1994). *Markov Decision Processes: Discrete Stochastic Dynamic Programming*. John Wiley & Sons.
11. Rummery, G. A., & Niranjan, M. (1994). *On-line Q-learning using connectionist systems*. Technical Report CUED/F-INFENG/TR 166, Cambridge University Engineering Department.
12. Fan, J., Wang, Z., Xie, Y., & Yang, Z. (2020). A theoretical analysis of deep Q-learning. *Proceedings of the 2nd Conference on Learning for Dynamics and Control (L4DC)*, PMLR 120, 486-489.