See also: Machine learning terms, Reinforcement Learning, Q-Learning
The epsilon-greedy policy (also written as ε-greedy) is one of the most widely used exploration strategies in reinforcement learning (RL). It provides a simple mechanism for balancing the fundamental tradeoff between exploration (trying new actions to gather information) and exploitation (choosing the best-known action to maximize reward). Under this policy, an agent selects the action with the highest estimated value most of the time, but with a small probability ε it instead picks a random action.
The epsilon-greedy approach was popularized by Richard Sutton and Andrew Barto in their influential textbook Reinforcement Learning: An Introduction, first published in 1998 and updated in 2018. It plays a central role in algorithms such as Q-learning, SARSA, Monte Carlo control, and Deep Q-Networks (DQN). Its simplicity, requiring only a single hyperparameter, makes it a practical default choice for many RL applications.
Given a set of actions A and a state s, the epsilon-greedy policy selects actions according to the following rule:
The parameter ε (epsilon) is a value between 0 and 1 that controls the degree of exploration. Under this scheme, each non-greedy action is selected with probability ε / |A|, while the greedy action is selected with probability (1 - ε) + ε / |A|, where |A| is the total number of actions.
When ε = 0, the policy becomes a purely greedy policy that always exploits the current best estimate. When ε = 1, the policy becomes fully random, selecting actions with uniform probability regardless of their estimated values.
The exploration-exploitation dilemma is a core challenge in reinforcement learning. An agent must decide whether to:
Pure exploitation risks getting stuck in locally optimal behavior, missing better actions that the agent has not yet tried. Pure exploration wastes time on actions that are already known to be suboptimal. The epsilon-greedy policy offers a straightforward compromise: the agent mostly exploits its current knowledge, but occasionally explores at random.
This approach ensures that, over time, every action has a nonzero probability of being selected. In tabular settings with infinite exploration, this property is essential for convergence guarantees.
The epsilon-greedy strategy was originally studied in the context of the multi-armed bandit problem, a simplified RL setting where an agent must choose among k actions (called "arms") at each time step, each providing a stochastic reward drawn from an unknown distribution. The goal is to maximize total reward over a sequence of trials.
In the bandit setting, the agent maintains an estimate of the expected reward for each arm, typically using the sample-average method:
Q_t(a) = (sum of rewards received when action a was selected) / (number of times action a was selected)
The epsilon-greedy agent then selects the arm with the highest Q_t(a) with probability 1 - ε, and a random arm with probability ε. Over T time steps, approximately εT actions will have been chosen randomly, ensuring continued exploration.
Typical epsilon values for bandit problems range from 0.01 to 0.1. Sutton and Barto demonstrated in their 10-armed testbed that ε = 0.1 outperforms ε = 0.01 in the short run due to faster exploration, but ε = 0.01 eventually achieves better performance because it exploits more frequently after sufficient learning.
Q-learning is an off-policy temporal-difference (TD) control algorithm that uses the epsilon-greedy policy for action selection during training. The algorithm proceeds as follows:
Here, α is the learning rate and γ is the discount factor. The key observation is that while the agent selects actions using the epsilon-greedy policy (the behavior policy), the Q-value update uses the maximum over all actions (the target policy is greedy). This off-policy nature allows Q-learning to learn the optimal policy even while following an exploratory behavior policy.
At the beginning of training, ε is typically set high (often 1.0 or close to it) so the agent explores broadly. As training progresses and Q-value estimates become more accurate, ε is reduced so the agent increasingly exploits its learned knowledge.
The Deep Q-Network (DQN) algorithm, introduced by Mnih et al. in 2013 and refined in 2015, uses epsilon-greedy as its primary exploration mechanism. In the original DQN paper that achieved human-level performance on Atari 2600 games, the authors used the following schedule:
| Parameter | Value |
|---|---|
| Starting epsilon | 1.0 |
| Final epsilon | 0.1 |
| Annealing period | 1,000,000 frames |
| Evaluation epsilon | 0.05 |
| Total training frames | 10,000,000 |
During training, epsilon was linearly annealed from 1.0 to 0.1 over the first million frames, then held constant at 0.1 for the remainder of training. During evaluation, a lower epsilon of 0.05 was used to allow a small amount of stochasticity while predominantly exploiting the learned policy.
This linear annealing schedule became a standard practice in deep RL. The initial period of high exploration helps populate the experience replay buffer with diverse transitions, which is critical for stable training of the neural network function approximator.
Rather than using a fixed value of epsilon throughout training, it is common practice to decay epsilon over time. The rationale is that the agent should explore more during early training when it knows little about the environment, and exploit more as its estimates improve. Several decay strategies are used in practice:
Epsilon decreases at a constant rate per time step:
ε_t = max(ε_min, ε_start - (ε_start - ε_min) * t / T)
where ε_start is the initial epsilon, ε_min is the minimum epsilon, t is the current time step, and T is the total number of annealing steps. This is the schedule used in the original DQN paper.
Epsilon decreases multiplicatively:
ε_t = max(ε_min, ε_start * d^t)
where d is a decay factor slightly less than 1 (for example, 0.995 or 0.999). Exponential decay reduces epsilon rapidly at first and more slowly later, which can be beneficial when the agent needs to transition quickly from exploration to exploitation.
ε_t = 1 / t
This schedule satisfies the theoretical conditions for Greedy in the Limit with Infinite Exploration (GLIE), which guarantees convergence of Monte Carlo control methods. Under GLIE, every state-action pair must be explored infinitely many times, and the policy must converge to a greedy policy in the limit.
Introduced by Aakash et al. (2019), this approach ties epsilon reduction to the agent's performance. Epsilon is decreased only when the agent crosses a reward threshold, providing evidence that learning has occurred. This avoids premature exploitation when the agent has not yet gathered sufficient information.
| Decay Schedule | Formula | Characteristics |
|---|---|---|
| Constant | ε_t = ε | Simple; no adaptation over time |
| Linear | ε_t = ε_start - (ε_start - ε_min) * t / T | Steady decrease; easy to tune |
| Exponential | ε_t = ε_start * d^t | Fast early decay; slow later decay |
| Inverse (1/t) | ε_t = 1 / t | Satisfies GLIE conditions; can be slow |
| Reward-based | Decay on reward thresholds | Adaptive; tied to agent performance |
The choice of epsilon depends on the problem, the stage of training, and the algorithm. The following table summarizes common values used across different settings:
| Setting | Typical ε Value | Notes |
|---|---|---|
| Multi-armed bandits | 0.01 to 0.1 | Lower values after sufficient exploration |
| Tabular Q-learning (early) | 0.5 to 1.0 | High exploration during initial learning |
| Tabular Q-learning (late) | 0.01 to 0.1 | Shift toward exploitation |
| DQN training (start) | 1.0 | Fully random to fill replay buffer |
| DQN training (final) | 0.01 to 0.1 | Mostly exploiting the learned policy |
| DQN evaluation | 0.05 | Small randomness for robustness |
| Production / deployment | 0.0 to 0.05 | Mostly or fully greedy |
Using a constant epsilon is simpler but has drawbacks. If ε is set too high, the agent continues to take many random actions even after it has learned a good policy, degrading performance. If ε is set too low, the agent may not explore enough early on and can get stuck with poor estimates.
A decaying epsilon addresses this by starting with high exploration and gradually reducing it. The main considerations when choosing between constant and decaying epsilon include:
Several alternative exploration strategies exist, each with different tradeoffs relative to epsilon-greedy:
Instead of choosing between a random action and the greedy action, softmax exploration samples actions from a Boltzmann distribution over Q-values:
P(a|s) = exp(Q(s,a) / τ) / Σ_b exp(Q(s,b) / τ)
The temperature parameter τ controls exploration: high τ produces near-uniform selection, while low τ concentrates probability on the best action. Unlike epsilon-greedy, softmax exploration considers the relative values of actions, giving higher probability to actions with higher Q-values even during exploration. However, tuning τ can be more difficult than tuning ε.
UCB algorithms select the action that maximizes an upper confidence bound on the estimated value:
a_t = argmax_a [Q_t(a) + c * sqrt(ln(t) / N_t(a))]
where N_t(a) is the number of times action a has been selected and c is an exploration constant. UCB is guided by the principle of "optimism in the face of uncertainty" and naturally explores less-visited actions without requiring a separate randomization step. UCB methods often outperform epsilon-greedy in bandit problems because they direct exploration toward uncertain actions rather than exploring uniformly at random.
Thompson sampling maintains a Bayesian posterior distribution over the expected reward of each action. At each step, it samples a value from each action's posterior and selects the action with the highest sampled value. This approach naturally balances exploration and exploitation: actions with uncertain value estimates have wider posteriors and are more likely to produce high samples. Thompson sampling often provides strong empirical performance and has theoretical regret bounds, but it requires maintaining and sampling from posterior distributions, which adds computational complexity.
NoisyNet, introduced by Fortunato et al. (2017), replaces epsilon-greedy exploration with learned parametric noise added directly to the neural network weights. The parameters of the noise are learned through gradient descent alongside the network weights. A key advantage of NoisyNet is that exploration becomes state-dependent: the degree and direction of exploration vary based on the current state, unlike epsilon-greedy where exploration is state-independent. NoisyNet has been shown to substantially improve performance over epsilon-greedy in DQN, Dueling DQN, and A3C agents across a wide range of Atari games.
| Strategy | Exploration Type | Key Parameter | Strengths | Weaknesses |
|---|---|---|---|---|
| Epsilon-greedy | Uniform random | ε (epsilon) | Simple; easy to implement and tune | Explores uniformly; state-independent |
| Softmax / Boltzmann | Probability-weighted | τ (temperature) | Considers action values during exploration | Temperature tuning can be difficult |
| UCB | Optimism-based | c (confidence) | Directed exploration; strong theory | Harder to apply in large state spaces |
| Thompson Sampling | Posterior sampling | Prior distribution | Strong empirical and theoretical results | Requires Bayesian inference |
| NoisyNet | Learned noise | Noise parameters | State-dependent; no epsilon schedule needed | More complex; adds parameters |
A greedy policy is the special case where ε = 0. The agent always selects the action with the highest estimated value, performing no exploration at all. While a greedy policy maximizes expected immediate reward based on current estimates, it can fail to discover better actions and may converge to suboptimal behavior.
In practice, a greedy policy is typically used only during evaluation or deployment, after the agent has been trained with an exploratory policy. During training, even a very small amount of exploration (such as ε = 0.01) can significantly improve the quality of the learned policy.
Despite its popularity, the epsilon-greedy strategy has several notable limitations:
Uniform random exploration: When the agent explores, it selects actions uniformly at random. This means it does not prioritize exploring uncertain or promising actions. In environments with large action spaces, random exploration can be extremely inefficient.
State-independent exploration: The probability of exploring is the same regardless of the current state. The agent explores equally in well-understood states and poorly-understood states, which wastes exploration budget.
No exploration memory: Each exploration step is independent. The agent does not form coherent exploratory strategies that span multiple time steps, which can make it difficult to reach distant or hard-to-find rewards (the "sparse reward" problem).
Sensitivity to ε and decay schedule: Performance can be sensitive to the choice of ε and the decay schedule. Tuning these hyperparameters often requires experimentation, and optimal values may differ significantly across environments.
Continued suboptimal actions: Even with a small ε, the agent periodically takes random actions during deployment, which can be undesirable in safety-critical applications.
Large action spaces: In continuous or very large discrete action spaces, the probability of randomly selecting any particular informative action becomes vanishingly small, making epsilon-greedy exploration impractical without modifications.
A basic epsilon-greedy action selection in Python can be implemented as follows:
import numpy as np
def epsilon_greedy_action(Q, state, epsilon, n_actions):
"""Select an action using the epsilon-greedy policy."""
if np.random.random() < epsilon:
# Explore: choose a random action
return np.random.randint(n_actions)
else:
# Exploit: choose the action with the highest Q-value
return np.argmax(Q[state])
For epsilon decay, a common pattern during training is:
epsilon = max(epsilon_min, epsilon * decay_rate) # Exponential decay
# or
epsilon = max(epsilon_min, epsilon_start - step * (epsilon_start - epsilon_min) / total_steps) # Linear decay
Imagine you are in a new city and want to find the best ice cream shop. You could try a different shop every day (exploring), or you could keep going back to the one you liked best so far (exploiting). The epsilon-greedy approach says: most of the time, go to your favorite shop. But once in a while, pick a random shop you have not tried. Maybe you will discover something even better! As you try more shops over time, you become more confident about which one is best, so you visit random shops less and less often. That is epsilon decay: you explore a lot at first when you do not know much, and gradually settle on your favorite as you learn.