Epsilon Greedy Policy

Introduction

The epsilon-greedy policy (also written as ε-greedy) is one of the most widely used exploration strategies in reinforcement learning (RL). It provides a simple mechanism for balancing the fundamental tradeoff between exploration (trying new actions to gather information) and exploitation (choosing the best-known action to maximize reward). Under this policy, an agent selects the action with the highest estimated value most of the time, but with a small probability ε it instead picks a random action.

The epsilon-greedy approach was popularized by Richard Sutton and Andrew Barto in their influential textbook Reinforcement Learning: An Introduction, first published in 1998 and updated in 2018. It plays a central role in algorithms such as Q-learning, SARSA, Monte Carlo control, and Deep Q-Networks (DQN). Its simplicity, requiring only a single hyperparameter, makes it a practical default choice for many RL applications.

Formal Definition

Given a set of actions A and a state s, the epsilon-greedy policy selects actions according to the following rule:

With probability 1 - ε, the agent selects the greedy action: the action a* that maximizes the current action-value estimate Q(s, a). That is, a* = argmax_a Q(s, a).
With probability ε, the agent selects an action uniformly at random from the full set of available actions A.

The parameter ε (epsilon) is a value between 0 and 1 that controls the degree of exploration. Under this scheme, each non-greedy action is selected with probability ε / |A|, while the greedy action is selected with probability (1 - ε) + ε / |A|, where |A| is the total number of actions.

When ε = 0, the policy becomes a purely greedy policy that always exploits the current best estimate. When ε = 1, the policy becomes fully random, selecting actions with uniform probability regardless of their estimated values.

The Exploration vs. Exploitation Tradeoff

The exploration-exploitation dilemma is a core challenge in reinforcement learning. An agent must decide whether to:

Explore: Try new or less-visited actions to potentially discover higher rewards and build a more accurate model of the environment.
Exploit: Use current knowledge to select the action believed to yield the highest immediate reward.

Pure exploitation risks getting stuck in locally optimal behavior, missing better actions that the agent has not yet tried. Pure exploration wastes time on actions that are already known to be suboptimal. The epsilon-greedy policy offers a straightforward compromise: the agent mostly exploits its current knowledge, but occasionally explores at random.

This approach ensures that, over time, every action has a nonzero probability of being selected. In tabular settings with infinite exploration, this property is essential for convergence guarantees.

Multi-Armed Bandit Context

The epsilon-greedy strategy was originally studied in the context of the multi-armed bandit problem, a simplified RL setting where an agent must choose among k actions (called "arms") at each time step, each providing a stochastic reward drawn from an unknown distribution. The goal is to maximize total reward over a sequence of trials.

In the bandit setting, the agent maintains an estimate of the expected reward for each arm, typically using the sample-average method:

Q_t(a) = (sum of rewards received when action a was selected) / (number of times action a was selected)

The epsilon-greedy agent then selects the arm with the highest Q_t(a) with probability 1 - ε, and a random arm with probability ε. Over T time steps, approximately εT actions will have been chosen randomly, ensuring continued exploration.

Typical epsilon values for bandit problems range from 0.01 to 0.1. Sutton and Barto demonstrated in their 10-armed testbed that ε = 0.1 outperforms ε = 0.01 in the short run due to faster exploration, but ε = 0.01 eventually achieves better performance because it exploits more frequently after sufficient learning.

Epsilon-Greedy in Q-Learning

Q-learning is an off-policy temporal-difference (TD) control algorithm that uses the epsilon-greedy policy for action selection during training. The algorithm proceeds as follows:

Initialize Q(s, a) for all state-action pairs (commonly to zero).
For each episode and each time step within the episode:
- Observe the current state s.
- Select action a using the epsilon-greedy policy based on Q(s, ·).
- Execute action a, observe reward r and next state s'.
- Update the Q-value: Q(s, a) ← Q(s, a) + α [r + γ max_a' Q(s', a') - Q(s, a)]
- Set s ← s'.

Here, α is the learning rate and γ is the discount factor. The key observation is that while the agent selects actions using the epsilon-greedy policy (the behavior policy), the Q-value update uses the maximum over all actions (the target policy is greedy). This off-policy nature allows Q-learning to learn the optimal policy even while following an exploratory behavior policy.

At the beginning of training, ε is typically set high (often 1.0 or close to it) so the agent explores broadly. As training progresses and Q-value estimates become more accurate, ε is reduced so the agent increasingly exploits its learned knowledge.

Epsilon-Greedy in Deep Q-Networks (DQN)

The Deep Q-Network (DQN) algorithm, introduced by Mnih et al. in 2013 and refined in 2015, uses epsilon-greedy as its primary exploration mechanism. In the original DQN paper that achieved human-level performance on Atari 2600 games, the authors used the following schedule:

Parameter	Value
Starting epsilon	1.0
Final epsilon	0.1
Annealing period	1,000,000 frames
Evaluation epsilon	0.05
Total training frames	10,000,000

During training, epsilon was linearly annealed from 1.0 to 0.1 over the first million frames, then held constant at 0.1 for the remainder of training. During evaluation, a lower epsilon of 0.05 was used to allow a small amount of stochasticity while predominantly exploiting the learned policy.

This linear annealing schedule became a standard practice in deep RL. The initial period of high exploration helps populate the experience replay buffer with diverse transitions, which is critical for stable training of the neural network function approximator.

Epsilon Decay Schedules

Rather than using a fixed value of epsilon throughout training, it is common practice to decay epsilon over time. The rationale is that the agent should explore more during early training when it knows little about the environment, and exploit more as its estimates improve. Several decay strategies are used in practice:

Linear Decay

Epsilon decreases at a constant rate per time step:

ε_t = max(ε_min, ε_start - (ε_start - ε_min) * t / T)

where ε_start is the initial epsilon, ε_min is the minimum epsilon, t is the current time step, and T is the total number of annealing steps. This is the schedule used in the original DQN paper.

Exponential Decay

Epsilon decreases multiplicatively:

ε_t = max(ε_min, ε_start * d^t)

where d is a decay factor slightly less than 1 (for example, 0.995 or 0.999). Exponential decay reduces epsilon rapidly at first and more slowly later, which can be beneficial when the agent needs to transition quickly from exploration to exploitation.

Inverse Decay (1/t Schedule)

ε_t = 1 / t

This schedule satisfies the theoretical conditions for Greedy in the Limit with Infinite Exploration (GLIE), which guarantees convergence of Monte Carlo control methods. Under GLIE, every state-action pair must be explored infinitely many times, and the policy must converge to a greedy policy in the limit.

Reward-Based Epsilon Decay (RBED)

Introduced by Aakash et al. (2019), this approach ties epsilon reduction to the agent's performance. Epsilon is decreased only when the agent crosses a reward threshold, providing evidence that learning has occurred. This avoids premature exploitation when the agent has not yet gathered sufficient information.

Decay Schedule	Formula	Characteristics
Constant	ε_t = ε	Simple; no adaptation over time
Linear	ε_t = ε_start - (ε_start - ε_min) * t / T	Steady decrease; easy to tune
Exponential	ε_t = ε_start * d^t	Fast early decay; slow later decay
Inverse (1/t)	ε_t = 1 / t	Satisfies GLIE conditions; can be slow
Reward-based	Decay on reward thresholds	Adaptive; tied to agent performance

Typical Epsilon Values

The choice of epsilon depends on the problem, the stage of training, and the algorithm. The following table summarizes common values used across different settings:

Setting	Typical ε Value	Notes
Multi-armed bandits	0.01 to 0.1	Lower values after sufficient exploration
Tabular Q-learning (early)	0.5 to 1.0	High exploration during initial learning
Tabular Q-learning (late)	0.01 to 0.1	Shift toward exploitation
DQN training (start)	1.0	Fully random to fill replay buffer
DQN training (final)	0.01 to 0.1	Mostly exploiting the learned policy
DQN evaluation	0.05	Small randomness for robustness
Production / deployment	0.0 to 0.05	Mostly or fully greedy

Constant vs. Decaying Epsilon

Using a constant epsilon is simpler but has drawbacks. If ε is set too high, the agent continues to take many random actions even after it has learned a good policy, degrading performance. If ε is set too low, the agent may not explore enough early on and can get stuck with poor estimates.

A decaying epsilon addresses this by starting with high exploration and gradually reducing it. The main considerations when choosing between constant and decaying epsilon include:

Stationary environments: Decaying epsilon works well because the optimal policy does not change, and the agent can safely commit to exploitation over time.
Non-stationary environments: A constant or slowly decaying epsilon may be preferable, since the reward distributions change and the agent needs to continue exploring to detect shifts.
Tuning complexity: Constant epsilon requires tuning one value, while decaying schedules introduce additional hyperparameters (decay rate, minimum epsilon, annealing period).

Comparison with Other Exploration Strategies

Several alternative exploration strategies exist, each with different tradeoffs relative to epsilon-greedy:

Softmax (Boltzmann) Exploration

Instead of choosing between a random action and the greedy action, softmax exploration samples actions from a Boltzmann distribution over Q-values:

P(a|s) = exp(Q(s,a) / τ) / Σ_b exp(Q(s,b) / τ)

The temperature parameter τ controls exploration: high τ produces near-uniform selection, while low τ concentrates probability on the best action. Unlike epsilon-greedy, softmax exploration considers the relative values of actions, giving higher probability to actions with higher Q-values even during exploration. However, tuning τ can be more difficult than tuning ε.

Upper Confidence Bound (UCB)

UCB algorithms select the action that maximizes an upper confidence bound on the estimated value:

a_t = argmax_a [Q_t(a) + c * sqrt(ln(t) / N_t(a))]

where N_t(a) is the number of times action a has been selected and c is an exploration constant. UCB is guided by the principle of "optimism in the face of uncertainty" and naturally explores less-visited actions without requiring a separate randomization step. UCB methods often outperform epsilon-greedy in bandit problems because they direct exploration toward uncertain actions rather than exploring uniformly at random.

Thompson Sampling

Thompson sampling maintains a Bayesian posterior distribution over the expected reward of each action. At each step, it samples a value from each action's posterior and selects the action with the highest sampled value. This approach naturally balances exploration and exploitation: actions with uncertain value estimates have wider posteriors and are more likely to produce high samples. Thompson sampling often provides strong empirical performance and has theoretical regret bounds, but it requires maintaining and sampling from posterior distributions, which adds computational complexity.

Noise-Based Exploration (NoisyNet)

NoisyNet, introduced by Fortunato et al. (2017), replaces epsilon-greedy exploration with learned parametric noise added directly to the neural network weights. The parameters of the noise are learned through gradient descent alongside the network weights. A key advantage of NoisyNet is that exploration becomes state-dependent: the degree and direction of exploration vary based on the current state, unlike epsilon-greedy where exploration is state-independent. NoisyNet has been shown to substantially improve performance over epsilon-greedy in DQN, Dueling DQN, and A3C agents across a wide range of Atari games.

Strategy	Exploration Type	Key Parameter	Strengths	Weaknesses
Epsilon-greedy	Uniform random	ε (epsilon)	Simple; easy to implement and tune	Explores uniformly; state-independent
Softmax / Boltzmann	Probability-weighted	τ (temperature)	Considers action values during exploration	Temperature tuning can be difficult
UCB	Optimism-based	c (confidence)	Directed exploration; strong theory	Harder to apply in large state spaces
Thompson Sampling	Posterior sampling	Prior distribution	Strong empirical and theoretical results	Requires Bayesian inference
NoisyNet	Learned noise	Noise parameters	State-dependent; no epsilon schedule needed	More complex; adds parameters

Greedy Policy (ε = 0)

A greedy policy is the special case where ε = 0. The agent always selects the action with the highest estimated value, performing no exploration at all. While a greedy policy maximizes expected immediate reward based on current estimates, it can fail to discover better actions and may converge to suboptimal behavior.

In practice, a greedy policy is typically used only during evaluation or deployment, after the agent has been trained with an exploratory policy. During training, even a very small amount of exploration (such as ε = 0.01) can significantly improve the quality of the learned policy.

Limitations of Epsilon-Greedy

Despite its popularity, the epsilon-greedy strategy has several notable limitations:

Uniform random exploration: When the agent explores, it selects actions uniformly at random. This means it does not prioritize exploring uncertain or promising actions. In environments with large action spaces, random exploration can be extremely inefficient.
State-independent exploration: The probability of exploring is the same regardless of the current state. The agent explores equally in well-understood states and poorly-understood states, which wastes exploration budget.
No exploration memory: Each exploration step is independent. The agent does not form coherent exploratory strategies that span multiple time steps, which can make it difficult to reach distant or hard-to-find rewards (the "sparse reward" problem).
Sensitivity to ε and decay schedule: Performance can be sensitive to the choice of ε and the decay schedule. Tuning these hyperparameters often requires experimentation, and optimal values may differ significantly across environments.
Continued suboptimal actions: Even with a small ε, the agent periodically takes random actions during deployment, which can be undesirable in safety-critical applications.
Large action spaces: In continuous or very large discrete action spaces, the probability of randomly selecting any particular informative action becomes vanishingly small, making epsilon-greedy exploration impractical without modifications.

Implementation

A basic epsilon-greedy action selection in Python can be implemented as follows:

import numpy as np

def epsilon_greedy_action(Q, state, epsilon, n_actions):
    """Select an action using the epsilon-greedy policy."""
    if np.random.random() < epsilon:
        # Explore: choose a random action
        return np.random.randint(n_actions)
    else:
        # Exploit: choose the action with the highest Q-value
        return np.argmax(Q[state])

For epsilon decay, a common pattern during training is:

epsilon = max(epsilon_min, epsilon * decay_rate)  # Exponential decay
# or
epsilon = max(epsilon_min, epsilon_start - step * (epsilon_start - epsilon_min) / total_steps)  # Linear decay

Explain Like I'm 5 (ELI5)

Imagine you are in a new city and want to find the best ice cream shop. You could try a different shop every day (exploring), or you could keep going back to the one you liked best so far (exploiting). The epsilon-greedy approach says: most of the time, go to your favorite shop. But once in a while, pick a random shop you have not tried. Maybe you will discover something even better! As you try more shops over time, you become more confident about which one is best, so you visit random shops less and less often. That is epsilon decay: you explore a lot at first when you do not know much, and gradually settle on your favorite as you learn.

References

Sutton, R. S., & Barto, A. G. (2018). *Reinforcement Learning: An Introduction* (2nd ed.). MIT Press. http://incompleteideas.net/book/the-book-2nd.html
Watkins, C. J. C. H., & Dayan, P. (1992). "Q-learning." *Machine Learning*, 8(3-4), 279-292.
Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2015). "Human-level control through deep reinforcement learning." *Nature*, 518(7540), 529-533.
Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2013). "Playing Atari with Deep Reinforcement Learning." *arXiv preprint arXiv:1312.5602*.
Fortunato, M., Azar, M. G., Piot, B., et al. (2018). "Noisy Networks for Exploration." *Proceedings of the International Conference on Learning Representations (ICLR)*.
Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). "Finite-time Analysis of the Multiarmed Bandit Problem." *Machine Learning*, 47(2-3), 235-256.
Tokic, M. (2010). "Adaptive ε-greedy Exploration in Reinforcement Learning Based on Value Differences." *Proceedings of the 33rd Annual German Conference on Artificial Intelligence (KI)*.
Weng, L. (2020). "Exploration Strategies in Deep Reinforcement Learning." *Lil'Log*. https://lilianweng.github.io/posts/2020-06-07-exploration-drl/
Aakash, Singh, K., & Annexstein, F. (2019). "Reward Based Epsilon Decay." *arXiv preprint arXiv:1910.13701*.
Dann, C., Mohri, M., Zhang, T., & Zimmert, J. (2022). "Guarantees for Epsilon-Greedy Reinforcement Learning with Function Approximation." *Proceedings of the 39th International Conference on Machine Learning (ICML)*.

Introduction

Formal Definition

The Exploration vs. Exploitation Tradeoff

Multi-Armed Bandit Context

Epsilon-Greedy in Q-Learning

Epsilon-Greedy in Deep Q-Networks (DQN)

Epsilon Decay Schedules

Linear Decay

Exponential Decay

Inverse Decay (1/t Schedule)

Reward-Based Epsilon Decay (RBED)

Typical Epsilon Values

Constant vs. Decaying Epsilon

Comparison with Other Exploration Strategies

Softmax (Boltzmann) Exploration

Upper Confidence Bound (UCB)

Thompson Sampling

Noise-Based Exploration (NoisyNet)

Greedy Policy (ε = 0)

Limitations of Epsilon-Greedy

Implementation

Explain Like I'm 5 (ELI5)

References

Improve this article

Related Articles

Machine learning terms/Reinforcement Learning

ARC-AGI 2

AlphaGo

State (Reinforcement Learning)

State-Action Value Function

Action (Reinforcement Learning)

Introduction

Formal Definition

The Exploration vs. Exploitation Tradeoff

Multi-Armed Bandit Context

Epsilon-Greedy in Q-Learning

Epsilon-Greedy in Deep Q-Networks (DQN)

Epsilon Decay Schedules

Linear Decay

Exponential Decay

Inverse Decay (1/t Schedule)

Reward-Based Epsilon Decay (RBED)

Typical Epsilon Values

Constant vs. Decaying Epsilon

Comparison with Other Exploration Strategies

Softmax (Boltzmann) Exploration

Upper Confidence Bound (UCB)

Thompson Sampling

Noise-Based Exploration (NoisyNet)

Greedy Policy (ε = 0)

Limitations of Epsilon-Greedy

Implementation

Explain Like I'm 5 (ELI5)

References

Related Articles

Machine learning terms/Reinforcement Learning

ARC-AGI 2

AlphaGo

State (Reinforcement Learning)

State-Action Value Function

Action (Reinforcement Learning)