In reinforcement learning, a greedy policy is a decision-making strategy that always selects the action with the highest estimated value in a given state. Formally, a greedy policy chooses the action that maximizes the action-value function (also called the Q-function) for each state, without any randomness or exploration. Greedy policies play a central role in many reinforcement learning algorithms, including Q-learning, value iteration, and policy iteration, where they serve as the mechanism through which learned value estimates are converted into concrete behavior.
A greedy policy is defined as the policy that, for every state s, selects the action a that maximizes the estimated action-value function Q(s, a):
pi(s) = argmax_a Q(s, a)
Here, pi(s) denotes the action chosen by the policy in state s, and Q(s, a) represents the expected cumulative reward from taking action a in state s and following the optimal policy thereafter. The argmax operator selects the action yielding the highest Q-value. If multiple actions share the same maximum Q-value, the tie can be broken arbitrarily (for example, by selecting the first such action or by random selection among tied actions).
In the context of state-value functions rather than action-value functions, a greedy policy selects the action that maximizes the expected immediate reward plus the discounted value of the next state:
pi(s) = argmax_a [ R(s, a) + gamma * sum_s' P(s' | s, a) * V(s') ]
where R(s, a) is the expected reward, gamma is the discount factor, P(s' | s, a) is the state transition probability, and V(s') is the value of the successor state.
A greedy policy represents pure exploitation: the agent always acts on its current best knowledge without ever trying new or uncertain actions. In the exploration-exploitation framework, exploitation means selecting the action that the agent currently believes will yield the highest return, based on its learned value estimates.
This is a strength in settings where the agent has accurate value estimates, because the greedy policy directly translates good knowledge into high-reward behavior. However, pure exploitation becomes problematic when the agent's value estimates are incomplete or inaccurate, which is often the case early in learning or in environments with stochastic dynamics.
Policy iteration is a dynamic programming algorithm that alternates between two steps: policy evaluation (computing the value function for the current policy) and policy improvement (constructing a new greedy policy with respect to the current value function). The policy improvement theorem guarantees that the greedy policy with respect to the value function of any policy pi is at least as good as pi itself. By repeatedly evaluating and improving, policy iteration converges to the optimal policy in a finite number of steps for finite Markov decision processes (MDPs).
The policy improvement step works as follows: given the current value function V_pi, construct a new policy pi' such that for every state s, pi'(s) = argmax_a Q_pi(s, a). The policy improvement theorem then states that V_pi'(s) >= V_pi(s) for all states s. Equality holds for all states only when both policies are already optimal.
Value iteration combines the evaluation and improvement steps into a single update. At each iteration, the Bellman optimality equation is applied as the update rule:
V(s) <- max_a [ R(s, a) + gamma * sum_s' P(s' | s, a) * V(s') ]
The max operation here is equivalent to acting greedily: the algorithm implicitly constructs a greedy policy at each step. Once the value function converges (that is, when the maximum change across all states falls below a threshold), extracting the greedy policy from the converged value function yields the optimal policy. The convergence is guaranteed because the Bellman optimality operator is a contraction mapping with respect to the infinity norm.
Q-learning, introduced by Christopher Watkins in 1989, is an off-policy temporal difference learning algorithm. Its update rule uses the greedy action in the next state:
Q(s, a) <- Q(s, a) + alpha * [ r + gamma * max_a' Q(s', a') - Q(s, a) ]
The term max_a' Q(s', a') reflects the assumption that the agent will behave greedily from the next state onward. This is what makes Q-learning "off-policy": the behavior policy (which the agent actually follows to collect data) can be exploratory (such as epsilon-greedy), while the target policy (the one being learned) is greedy. Watkins and Dayan proved in 1992 that Q-learning converges to the optimal action-value function with probability 1, provided that all state-action pairs are visited infinitely often and the learning rate satisfies standard stochastic approximation conditions.
SARSA (State-Action-Reward-State-Action) is an on-policy temporal difference method. Unlike Q-learning, SARSA updates its value estimates using the action actually taken by the behavior policy rather than the greedy action:
Q(s, a) <- Q(s, a) + alpha * [ r + gamma * Q(s', a') - Q(s, a) ]
Here a' is the action selected by the current policy in state s', not necessarily the greedy action. When SARSA uses an epsilon-greedy behavior policy, it learns the value of that epsilon-greedy policy rather than the optimal greedy policy. This makes SARSA more conservative than Q-learning in risky environments: it accounts for the possibility of exploratory (potentially dangerous) actions, whereas Q-learning assumes optimal behavior going forward.
In deep Q-networks (DQN), a neural network approximates the action-value function, and the greedy policy is derived by selecting the action with the highest predicted Q-value. DQN uses an epsilon-greedy behavior policy during training, where epsilon is typically annealed from a high value (for example, 1.0) to a low value (for example, 0.01) over the course of training. The target policy remains greedy. Two mechanisms stabilize the learning process: experience replay, which stores transitions and samples random mini-batches for training, and a target network, which is a periodically updated copy of the Q-network used to compute target values.
The Bellman optimality equation establishes the theoretical foundation for greedy policies. It states that the optimal value function satisfies:
V(s) = max_a [ R(s, a) + gamma * sum_s' P(s' | s, a) * V(s') ]**
The max operator in this equation is exactly the greedy selection over actions. Any policy that is greedy with respect to the optimal value function V* is an optimal policy. More precisely, the Bellman optimality equation asserts that the optimal policy can always be obtained by acting greedily with respect to the optimal value function. This means that if an agent has access to perfect value estimates, a greedy policy is provably optimal.
The analogous result for action-value functions is:
Q(s, a) = R(s, a) + gamma * sum_s' P(s' | s, a) * max_a' Q(s', a')**
Once Q* is known, the optimal greedy policy simply selects argmax_a Q(s, a)* in each state.
Generalized policy iteration (GPI) is the overarching framework described by Sutton and Barto (2018) that unifies nearly all reinforcement learning methods. GPI consists of two interacting processes: policy evaluation (estimating the value function for the current policy) and policy improvement (making the policy greedy with respect to the current value function). These two processes compete and cooperate: evaluation makes the value function consistent with the current policy, while improvement changes the policy to be greedy with respect to the current value function, which in turn makes the value function inaccurate for the new policy.
Almost all reinforcement learning algorithms can be described as instances of GPI. The key insight is that as long as both processes continue to update, the policy and value function will converge to the optimal policy and optimal value function, regardless of the specific evaluation and improvement methods used.
The primary limitation of a purely greedy policy is the exploration-exploitation tradeoff. A greedy policy exploits current knowledge but never explores, which means it can fail to discover better actions that have not been tried or whose values have been underestimated. This is especially problematic in the following situations:
The multi-armed bandit problem illustrates this tradeoff clearly. An agent facing multiple slot machines (arms) with unknown payout distributions must decide whether to pull the arm with the best observed average payout (exploit) or try other arms that might have higher true payouts (explore). A purely greedy strategy can lock onto a suboptimal arm early if that arm happened to produce a high initial payout.
Several strategies modify the greedy policy to incorporate exploration. The following table compares the most common approaches:
| Strategy | Mechanism | Exploration style | Key parameter | When to use |
|---|---|---|---|---|
| Greedy | Always picks argmax Q(s, a) | None (pure exploitation) | None | When value estimates are accurate |
| Epsilon-greedy | With probability epsilon, pick a random action; otherwise pick argmax | Undirected random | epsilon (exploration rate) | General-purpose; simple to implement |
| Epsilon-greedy with decay | Epsilon decreases over time (for example, epsilon_t = 1/t) | Decreasing random | Initial epsilon, decay schedule | When convergence to greedy is desired |
| Boltzmann (softmax) | Sample actions proportionally to exp(Q(s,a)/tau) | Directed, value-weighted | tau (temperature) | When action values should influence exploration |
| Upper confidence bound (UCB) | Pick argmax [Q(s,a) + c * sqrt(ln(t)/N(a))] | Directed, uncertainty-based | c (confidence parameter) | When visit counts are available |
| Thompson sampling | Sample from posterior distribution of Q-values; act greedily on sample | Directed, Bayesian | Prior distribution | When Bayesian modeling is feasible |
The epsilon-greedy policy is the most widely used modification of the greedy policy. It selects the greedy action with probability 1 - epsilon and a uniformly random action with probability epsilon. Formally:
The parameter epsilon controls the exploration rate. A common practice is to start with a high epsilon (for example, 1.0, meaning fully random behavior) and decay it over time toward a small value (for example, 0.01), so the agent explores broadly at first and then increasingly exploits its learned knowledge. The decay schedule can be linear, exponential, or follow a formula such as epsilon_t = c / t^(1/3), which has been shown to minimize the regret upper bound for deep epsilon-greedy policies.
Boltzmann exploration (also called softmax exploration) selects actions with probabilities proportional to their exponentiated Q-values:
P(a | s) = exp(Q(s, a) / tau) / sum_b exp(Q(s, b) / tau)
The temperature parameter tau controls the degree of exploration. When tau is high, the distribution becomes nearly uniform (maximum exploration). When tau approaches zero, the distribution concentrates on the greedy action (maximum exploitation). Unlike epsilon-greedy, Boltzmann exploration uses the relative magnitudes of Q-values to guide exploration, meaning that actions with Q-values close to the maximum are selected more often than clearly suboptimal ones.
The upper confidence bound strategy selects the action that maximizes a sum of the estimated Q-value and an exploration bonus:
a_t = argmax_a [ Q(s, a) + c * sqrt( ln(t) / N_t(a) ) ]
Here, N_t(a) is the number of times action a has been selected up to time t, and c is a parameter controlling the degree of exploration. The exploration bonus is inversely proportional to the number of times an action has been tried, so under-explored actions receive a higher bonus. UCB provides theoretical guarantees on regret and is particularly effective in stationary bandit problems.
Thompson sampling maintains a posterior distribution over the parameters of the reward distribution for each action. At each time step, the agent samples a value from each action's posterior and acts greedily with respect to those samples. This Bayesian approach naturally balances exploration and exploitation: actions with high uncertainty are sampled more widely, leading to greater exploration, while actions with well-estimated high values are exploited.
A policy is said to be GLIE (greedy in the limit with infinite exploration) if it satisfies two conditions:
The GLIE property is important because it guarantees convergence to the optimal policy in Monte Carlo control methods and certain temporal difference methods. An epsilon-greedy policy can be made GLIE by using a decaying epsilon schedule such as epsilon_t = c / n_t(s), where n_t(s) is the number of visits to state s and 0 < c < 1. Under this schedule, the agent explores less frequently over time but never stops exploring entirely, eventually converging to greedy behavior while having sampled all state-action pairs sufficiently.
The term "greedy" appears in both reinforcement learning and classical algorithm design, but the concepts are distinct:
| Aspect | Greedy policy (RL) | Greedy algorithm (classical CS) |
|---|---|---|
| Domain | Reinforcement learning, MDPs | Combinatorial optimization, graph theory |
| Decision basis | Maximizes estimated Q-value | Makes locally optimal choice at each step |
| Optimality | Optimal when value estimates are exact | Optimal only for problems with greedy-choice property (for example, minimum spanning trees) |
| Revisits decisions | Value estimates are updated over time | Never reconsiders past choices |
| Stochastic | Can be deterministic or probabilistic | Deterministic |
| Examples | Q-learning target policy, policy improvement step | Dijkstra's algorithm, Kruskal's algorithm, Huffman coding |
In classical computer science, a greedy algorithm builds a solution piece by piece, always choosing the locally optimal option at each step without backtracking. This approach yields optimal solutions only for problems exhibiting the greedy-choice property and optimal substructure (for example, activity selection, fractional knapsack). In reinforcement learning, a greedy policy is optimal when the underlying value function is exact, which is guaranteed by the Bellman optimality equation.
In practice, purely greedy policies are rarely used during training. Most reinforcement learning implementations use an epsilon-greedy policy or another exploration strategy during the learning phase and switch to a greedy policy (or near-greedy policy) at deployment time when the agent's value estimates are sufficiently accurate. Key practical considerations include:
Imagine you are at a candy store and you have tried three types of candy so far. The chocolate was really good, the gummy bear was okay, and the lollipop was not great. A greedy approach means you always pick the chocolate from now on, because it was the best one you have tried. But there are 20 other candies you have never tasted. Some of them might be even better than chocolate. The problem is, you will never know because you never try anything new. That is what a greedy policy does: it always picks the thing it thinks is best right now, even if trying something different might lead to finding something even better.