Tabular Q-Learning

Tabular Q-learning is a reinforcement learning algorithm that stores action-value estimates in a lookup table, known as a Q-table, rather than approximating them with a function such as a neural network. Introduced by Chris Watkins in 1989, Q-learning is one of the most widely studied algorithms in machine learning, and its tabular form serves as the foundational version from which more advanced methods like Deep Q-Networks (DQN) were later developed. The algorithm learns an optimal policy by iteratively updating estimates of expected cumulative reward for each state-action pair, without requiring a model of the environment.

Explain like I'm 5 (ELI5)

Imagine you are exploring a new building with many rooms. Each room has several doors, and behind each door you might find candy or nothing at all. You carry a notebook where you write down how much candy you expect to find if you go through a specific door in a specific room. At first, your notebook is blank, so you just guess. Every time you go through a door and see what happens, you update your notebook with better numbers. After exploring for a long time, your notebook becomes really accurate, and you can just look at it to know the best door to pick in every room. That notebook is the Q-table, and the process of updating it is tabular Q-learning.

History and background

Chris Watkins introduced Q-learning in his 1989 doctoral dissertation, "Learning from Delayed Rewards," at the University of Cambridge. The formal convergence proof was published three years later by Watkins and Peter Dayan in a 1992 paper in the journal Machine Learning. The paper demonstrated that Q-learning converges to optimal action values with probability 1, provided that all state-action pairs are visited infinitely often and the action values are stored in a discrete table.

A conceptually similar system predates Watkins' work. In 1981, Stevo Bozinovski proposed the Crossbar Adaptive Array (CAA), which used a memory matrix W = ||w(a,s)|| that was functionally equivalent to what would later be called the Q-table. Bozinovski's system addressed "delayed reinforcement learning" in a framework that paralleled Q-learning's structure.

Q-learning attracted substantial attention in the decades that followed. In 2013, researchers at DeepMind combined Q-learning with deep convolutional neural networks to create the Deep Q-Network (DQN), which learned to play Atari 2600 games at expert human levels directly from pixel inputs. This result demonstrated that the principles underlying Q-learning could scale to high-dimensional problems when paired with function approximation.

Formal setting: Markov decision processes

Tabular Q-learning operates within the framework of a Markov decision process (MDP). An MDP is defined by a tuple (S, A, P, R, gamma), where:

Symbol	Name	Description
S	State space	A finite set of all possible states the agent can occupy
A	Action space	A finite set of all possible actions the agent can take
P(s'\|s, a)	Transition function	The probability of moving to state s' after taking action a in state s
R(s, a)	Reward function	The immediate reward received after taking action a in state s
gamma	Discount factor	A value in [0, 1) that determines how much future rewards are weighted relative to immediate rewards

The Markov property states that the future state depends only on the current state and action, not on the sequence of states and actions that preceded it. This memoryless property is what makes the MDP framework tractable for algorithms like Q-learning.

The goal in an MDP is to find a policy (a mapping from states to actions) that maximizes the expected cumulative discounted reward, also called the return:

G_t = R_{t+1} + gamma * R_{t+2} + gamma^2 * R_{t+3} + ...

The Q-table

The Q-table is a two-dimensional lookup table with one row for each state and one column for each action. Each cell Q(s, a) holds the agent's current estimate of the expected return from taking action a in state s and then following the optimal policy thereafter. This value is called the action-value, or Q-value.

For an environment with |S| states and |A| actions, the Q-table contains |S| x |A| entries. In small environments (for example, a 4x4 grid world with 4 movement actions), this results in a manageable table of 64 entries. In larger environments, the table can grow quickly; an environment with 10,000 states and 10 actions produces a table with 100,000 entries.

Initialization

The Q-table is typically initialized with all zeros, though other initialization strategies exist. Optimistic initialization sets all Q-values to a high number, which encourages the agent to explore broadly early in training because every action initially looks promising, and the agent is "disappointed" by actual rewards, driving it to try other actions. Pessimistic initialization does the opposite, setting values low.

Initialization strategy	Initial Q-values	Effect on behavior
Zero initialization	Q(s, a) = 0 for all s, a	Neutral starting point; relies on exploration strategy
Optimistic initialization	Q(s, a) = large positive value	Encourages broad exploration early in training
Pessimistic initialization	Q(s, a) = large negative value	Encourages conservative behavior; agent avoids untested actions
Random initialization	Q(s, a) = random value	Breaks symmetry but can introduce bias

The Q-learning algorithm

Q-learning is a model-free, off-policy, temporal difference (TD) learning algorithm. Each of these properties has a specific meaning:

Model-free: The algorithm does not require knowledge of the transition probabilities P(s'|s, a) or the reward function R(s, a). It learns entirely from sampled experience.
Off-policy: The algorithm learns the value of the optimal policy regardless of the policy the agent is actually following during training. The agent can explore using one policy (the behavior policy) while learning about a different, greedy policy (the target policy).
Temporal difference: The algorithm updates its estimates based partly on other estimates (bootstrapping), rather than waiting for a complete episode to finish, as in Monte Carlo methods.

The update rule

The Q-value for a state-action pair is updated after each transition using the following rule:

Q(s, a) <- Q(s, a) + alpha * [R + gamma * max_a' Q(s', a') - Q(s, a)]

This can also be written as:

Q(s, a) <- (1 - alpha) * Q(s, a) + alpha * [R + gamma * max_a' Q(s', a')]

The terms in the update rule are:

Term	Name	Role
Q(s, a)	Current Q-value	The existing estimate for this state-action pair
alpha	Learning rate	Controls how much the new information overrides the old estimate (0 < alpha <= 1)
R	Immediate reward	The reward received after taking action a in state s
gamma	Discount factor	Determines the weight of future rewards (0 <= gamma < 1)
max_a' Q(s', a')	Best next-state value	The highest Q-value achievable from the next state s'
R + gamma * max_a' Q(s', a')	TD target	The new estimate of the return
TD target - Q(s, a)	TD error	The difference between the new estimate and the current estimate

The update rule is derived from the Bellman equation for the optimal action-value function. The Bellman optimality equation states that the true optimal Q-value satisfies:

Q*(s, a) = E[R + gamma * max_a' Q*(s', a')]

Q-learning applies stochastic approximation to this equation, replacing the expectation with individual samples and iteratively refining the estimate.

Algorithm pseudocode

The full tabular Q-learning algorithm proceeds as follows:

Initialize Q(s, a) = 0 for all s in S, a in A
For each episode:
    Initialize state s
    For each step of the episode:
        Choose action a from s using an exploration policy (e.g., epsilon-greedy)
        Take action a, observe reward R and next state s'
        Update: Q(s, a) <- Q(s, a) + alpha * [R + gamma * max_a' Q(s', a') - Q(s, a)]
        s <- s'
    Until s is terminal

For terminal states, the Q-value is simply set to the observed reward, since there is no future state to consider. In other words, when s' is a terminal state, the update becomes Q(s, a) <- Q(s, a) + alpha * [R - Q(s, a)].

Exploration strategies

A core challenge in reinforcement learning is the tension between exploration (trying new actions to discover their values) and exploitation (choosing the action with the highest known Q-value). Without sufficient exploration, the agent may converge to a suboptimal policy because it never discovers better actions. Several strategies address this tradeoff.

Epsilon-greedy

The epsilon-greedy strategy is the most common exploration method used with tabular Q-learning. At each step, the agent selects a random action with probability epsilon and selects the action with the highest Q-value with probability (1 - epsilon).

When epsilon = 1.0, the agent acts entirely at random.
When epsilon = 0.0, the agent acts entirely greedily.
A common starting value is epsilon = 1.0, with gradual decay toward a small floor (for example, 0.01 or 0.1) as training progresses.

Common decay schedules include:

Decay schedule	Formula	Behavior
Linear decay	epsilon_t = epsilon_0 - (epsilon_0 - epsilon_min) * t / T	Steady decrease over T steps
Exponential decay	epsilon_t = epsilon_0 * decay_rate^t	Rapid early decrease that slows over time
Inverse decay	epsilon_t = epsilon_0 / (1 + decay_rate * t)	Gradual decrease with diminishing rate

Boltzmann (softmax) exploration

Boltzmann exploration uses a softmax function over Q-values to assign action probabilities. Instead of choosing randomly with fixed probability, it selects actions with higher Q-values more frequently. The temperature parameter tau controls the degree of exploration: a high temperature produces near-uniform random selection, while a low temperature concentrates probability on the greedy action.

P(a|s) = exp(Q(s, a) / tau) / sum_a' exp(Q(s, a') / tau)

Upper confidence bound (UCB)

UCB-based exploration adds a bonus to Q-values based on how infrequently a state-action pair has been visited. Actions that have been tried fewer times receive a larger exploration bonus, ensuring that under-explored actions are selected more often. The selection rule is:

a = argmax_a [Q(s, a) + c * sqrt(ln(t) / N(s, a))]

where N(s, a) is the visit count for the pair (s, a), t is the total number of steps taken, and c is a constant that controls the exploration weight.

Optimistic initialization

As mentioned earlier, setting initial Q-values to a high number naturally encourages exploration. The agent tries each action at least once because its initial optimistic estimate exceeds any real reward it can receive, prompting it to seek out new state-action pairs.

Hyperparameters

Tabular Q-learning has three primary hyperparameters that must be tuned for effective learning.

Learning rate (alpha)

The learning rate alpha determines how quickly the agent incorporates new information. A value of alpha = 1 causes the agent to completely replace the old Q-value with the new estimate, which is optimal in fully deterministic environments. In stochastic environments, alpha should be decreased over time. The convergence theorem requires that the learning rates satisfy:

sum_t alpha_t(s, a) = infinity (each state-action pair is updated infinitely often)
sum_t alpha_t(s, a)^2 < infinity (the updates become smaller over time)

These are standard Robbins-Monro conditions from stochastic approximation theory. In practice, a fixed learning rate (often between 0.1 and 0.5) is commonly used and works well, even though it technically violates the decreasing-rate condition.

Discount factor (gamma)

The discount factor gamma determines the relative importance of future rewards. A value close to 1 (such as 0.99) makes the agent far-sighted, placing high importance on rewards that occur many steps in the future. A value close to 0 makes the agent myopic, focusing almost entirely on the next immediate reward.

Gamma value	Behavior	Suitable for
0.0	Completely myopic; only cares about immediate reward	Tasks where only the next step matters
0.5	Moderate discounting; future rewards matter but decay quickly	Short-horizon tasks
0.9	Far-sighted; rewards 10 steps ahead still have significant weight	Medium-horizon tasks
0.99	Very far-sighted; considers long-term consequences	Long-horizon tasks with delayed rewards
>= 1.0	Not used; can cause Q-values to diverge	Not applicable

If gamma equals or exceeds 1 and there is no guaranteed terminal state, the cumulative return can become infinite, causing the algorithm to diverge.

Epsilon (exploration rate)

The epsilon parameter in the epsilon-greedy strategy controls how often the agent explores versus exploits. As discussed in the exploration strategies section, epsilon is typically decayed over the course of training.

Convergence

Watkins and Dayan (1992) proved that tabular Q-learning converges to the optimal Q-values with probability 1, provided the following conditions are met:

The state and action spaces are finite.
The Q-values are stored in a discrete table (not approximated by a function).
Every state-action pair is visited infinitely often during training.
The learning rates satisfy the Robbins-Monro conditions (sum to infinity, sum of squares is finite).
The discount factor gamma is strictly less than 1 (or the MDP has absorbing terminal states that are guaranteed to be reached).

The proof relies on the fact that the Bellman optimality operator is a contraction mapping with contraction coefficient gamma. When applied repeatedly, the operator brings any initial Q-value estimate closer to the true optimal Q-values. Each Q-learning update is a noisy, sampled version of this operator, and stochastic approximation theory guarantees that the noise averages out under the learning rate conditions above.

In practice, convergence can be slow for large state spaces because every state-action pair must be visited many times. The rate of convergence depends on the size of the state-action space, the exploration strategy, the learning rate schedule, and the stochasticity of the environment.

Contraction mapping intuition

A contraction mapping is a function that brings points closer together. If T is the Bellman operator and Q_1, Q_2 are any two Q-value tables, then:

||T(Q_1) - T(Q_2)|| <= gamma * ||Q_1 - Q_2||

Since gamma < 1, repeated application of T reduces the distance between any Q-value table and the fixed point (the optimal Q*) by a factor of gamma at each step. This guarantees that the iteration converges to a unique fixed point.

Tabular Q-learning belongs to a family of value-based reinforcement learning algorithms. Understanding how it relates to other methods clarifies its strengths and limitations.

Q-learning vs. SARSA

SARSA (State-Action-Reward-State-Action) is an on-policy variant of temporal difference learning. While Q-learning updates using the maximum Q-value in the next state (regardless of what action the agent actually takes next), SARSA updates using the Q-value of the action the agent actually selects in the next state.

Property	Q-learning	SARSA
Policy type	Off-policy	On-policy
Update target	max_a' Q(s', a')	Q(s', a') where a' is the action actually taken
Learned policy	Optimal (greedy)	The policy being followed (e.g., epsilon-greedy)
Exploration safety	May learn risky policies if exploring near dangerous states	Learns safer policies that account for exploration
Convergence	Converges to Q* regardless of behavior policy	Converges to Q^pi for the current policy pi
Use case	When the training policy differs from the deployment policy	When the agent must behave safely during training

A classic illustration is the cliff-walking problem: Q-learning learns the optimal path along the cliff edge (shorter but risky during exploration), while SARSA learns a safer path farther from the edge because it accounts for its own exploratory actions.

Tabular Q-learning vs. Deep Q-Networks (DQN)

The core difference is how Q-values are stored and computed. Tabular Q-learning stores exact values in a table, while DQN uses a deep neural network to approximate Q-values from high-dimensional inputs.

Property	Tabular Q-learning	Deep Q-Network (DQN)
Q-value storage	Lookup table	Neural network
State space	Must be small and discrete	Can be large, continuous, or high-dimensional (e.g., images)
Action space	Must be discrete	Must be discrete
Convergence guarantee	Yes (under standard conditions)	No formal guarantee
Techniques required	Basic table updates	Experience replay, target networks, gradient descent
Sample efficiency	Visits to each (s, a) pair needed	Generalizes across similar states via network weights
Scalability	Fails with large state spaces	Handles millions of states via function approximation
Training stability	Stable (simple table update)	Can be unstable; requires careful tuning
Implementation complexity	Simple	Complex

DQN introduced two key techniques to stabilize training: experience replay (storing past transitions in a replay buffer and sampling random mini-batches for training) and target networks (using a separate, slowly updated copy of the network to compute the TD target). These techniques address the instability that arises when combining nonlinear function approximation with bootstrapped TD updates.

Double Q-learning

Standard Q-learning can overestimate Q-values because the max operation in the update rule introduces a positive bias. When Q-value estimates are noisy (as they always are during learning), taking the maximum of noisy estimates tends to select an overestimated value.

Double Q-learning, proposed by Hado van Hasselt in 2010, addresses this by maintaining two independent Q-tables, Q_A and Q_B. At each update step, one table is used to select the best action, and the other is used to evaluate that action's value:

Q_A(s, a) <- Q_A(s, a) + alpha * [R + gamma * Q_B(s', argmax_a' Q_A(s', a')) - Q_A(s, a)]

By decoupling action selection from action evaluation, double Q-learning reduces the overestimation bias. This idea was later extended to Deep Q-Networks as Double DQN by van Hasselt, Guez, and Silver in 2016.

Delayed Q-learning

Delayed Q-learning, introduced by Strehl, Li, Wiewiora, Langford, and Littman in 2006, delays updates to Q-values until there is a statistically significant number of observations. Instead of updating Q-values after every transition, the algorithm waits until it has collected enough samples to be confident in the update. This approach provides probably approximately correct (PAC) learning guarantees, meaning the algorithm is guaranteed to find a near-optimal policy with high probability after a bounded number of steps.

Monte Carlo methods

Monte Carlo methods also estimate action values without a model of the environment, but they differ from Q-learning in a key way. Monte Carlo methods wait until the end of an episode to compute the actual return and then use that return to update the value estimate. Q-learning, by contrast, updates after every single step using a bootstrapped estimate of the return. This makes Q-learning more sample-efficient in practice and allows it to be used in continuing (non-episodic) tasks.

Property	Tabular Q-learning	Monte Carlo
Update timing	After every step (online)	After the episode ends
Bootstrapping	Yes (uses estimated future value)	No (uses actual return)
Bias	Biased estimates (due to bootstrapping)	Unbiased estimates
Variance	Lower variance	Higher variance
Episodic requirement	Works in both episodic and continuing tasks	Requires episodic tasks

Worked example: grid world

Consider a simple 3x3 grid world where an agent starts in the top-left cell and must reach the bottom-right cell (the goal). The agent receives a reward of +10 for reaching the goal, -1 for each step taken, and -5 for entering a trap cell.

[S][ ][ ]
[ ][T][ ]
[ ][ ][G]

S = start, T = trap, G = goal. The agent can move up, down, left, or right. Moving into a wall keeps the agent in its current cell.

The Q-table has 9 states (one per cell) and 4 actions (up, down, left, right), resulting in 36 entries. Training proceeds as follows:

The agent starts at S with Q(s, a) = 0 for all entries. Using epsilon = 1.0, it moves randomly.
Suppose it moves right (to cell (0,1)), receiving R = -1. The update is: Q(S, right) <- 0 + 0.1 * [-1 + 0.9 * 0 - 0] = -0.1.
Over many episodes, the agent repeatedly explores the grid. Q-values for paths leading to the goal gradually increase, while Q-values for paths leading through the trap decrease.
After sufficient training, the Q-table reveals the optimal path: right, right, down, down (or equivalent shortest paths avoiding the trap).

This illustrates how the Q-table gradually encodes information about which actions are best in each state, even though the agent starts with no knowledge of the environment.

Limitations

Tabular Q-learning has several well-known limitations that restrict its applicability.

Curse of dimensionality

The Q-table grows as the product of the number of states and the number of actions. For environments with high-dimensional state spaces, the table becomes too large to store in memory and too sparse for the agent to visit every entry a sufficient number of times. For example, a game like chess has an estimated 10^47 possible board positions, making a tabular approach completely impractical.

Continuous state or action spaces

Tabular Q-learning requires discrete state and action spaces. Environments with continuous variables (such as the position and velocity of a robot arm) cannot be represented directly. Discretizing continuous spaces into bins is possible but leads to exponential growth in the number of cells, and the discretization introduces approximation errors. Function approximation methods (such as DQN) address this limitation.

No generalization across states

Each entry in the Q-table is updated independently. The algorithm cannot recognize that two states are similar and transfer knowledge between them. If the agent learns that moving right is good in cell (0,0), that information does not help it in cell (0,1). Neural network-based approaches naturally generalize across similar states because similar inputs produce similar outputs through the network's learned weights.

Slow convergence in large spaces

Even when the state space is technically discrete and finite, convergence requires many visits to every state-action pair. In large environments, this translates to very long training times. The exploration problem becomes harder as the environment grows, because the probability of randomly stumbling upon rewarding transitions decreases.

Applications

Despite its limitations, tabular Q-learning remains useful in practice and as an educational tool.

Grid world problems are the most common testbed for tabular Q-learning. The agent navigates a grid, avoids obstacles or traps, and seeks a goal. These environments have small, discrete state and action spaces that are perfectly suited to a tabular approach.

Simple game playing

Tabular Q-learning has been applied to simple games such as Tic-Tac-Toe, where the state space (possible board configurations) is small enough to store in a table. The agent learns to play optimally through self-play, updating Q-values after each game.

Robotics (small state spaces)

In robotics settings where the state space can be meaningfully discretized (for example, a robot navigating a small set of waypoints), tabular Q-learning provides a simple and reliable learning algorithm with convergence guarantees.

Resource allocation

Tabular Q-learning has been applied to resource allocation problems where the set of possible resource configurations and allocation decisions is small and discrete, such as simple scheduling and inventory management tasks.

Educational use

Tabular Q-learning is frequently used as an introductory algorithm in reinforcement learning courses and textbooks because it is simple to implement, easy to visualize (the Q-table can be inspected directly), and demonstrates the core ideas of value-based reinforcement learning without the complexity of function approximation.

Implementation considerations

When implementing tabular Q-learning, several practical details affect performance.

Data structures

The Q-table can be implemented as a two-dimensional array (if states and actions are integers), a dictionary or hash map (if states are more complex but hashable), or a combination of these. For environments with sparse state spaces (where most states are never visited), a dictionary is more memory-efficient than a full array.

Episode termination

The algorithm must handle terminal states correctly. When the agent reaches a terminal state, there is no next state to evaluate, so the TD target simplifies to just the immediate reward R. Failing to handle this case correctly can cause the Q-values to be incorrect.

Reward shaping

The design of the reward function significantly affects how quickly and reliably the agent learns. Sparse rewards (where the agent only receives a reward at the goal) can make learning slow because the agent must randomly reach the goal before it receives any useful signal. Reward shaping adds intermediate rewards to guide the agent, but it must be done carefully to avoid changing the optimal policy.

Learning rate scheduling

In stochastic environments, using a fixed learning rate can prevent exact convergence to the optimal Q-values. A common compromise is to use a fixed learning rate that is small enough to produce stable estimates, even though it technically results in a non-zero steady-state error.

Relationship to the Bellman equation

The Q-learning update rule is directly connected to the Bellman equation, which is a fundamental equation in dynamic programming and reinforcement learning. The Bellman optimality equation for the action-value function states:

Q*(s, a) = sum_s' P(s'|s, a) * [R(s, a, s') + gamma * max_a' Q*(s', a')]

This equation says that the true optimal Q-value for any state-action pair equals the expected immediate reward plus the discounted value of the best action in the next state, averaged over all possible next states.

Q-learning approximates this equation by replacing the expectation (the weighted sum over all next states) with a single sampled transition. Each time the agent takes action a in state s, it observes one specific next state s' and reward R, and uses that sample to nudge the Q-value estimate toward the TD target. Over many samples, these single-sample updates average out to approximate the true expectation, and the Q-values converge to the solution of the Bellman optimality equation.

This connection to the Bellman equation is also the basis for the convergence proof. The Bellman optimality operator T, defined as:

(TQ)(s, a) = E[R + gamma * max_a' Q(s', a')]

is a contraction mapping in the infinity norm with contraction factor gamma. The unique fixed point of T is the optimal Q-value function Q*. Q-learning performs stochastic fixed-point iteration on this operator, and the Robbins-Monro conditions on the learning rate ensure that the stochastic noise does not prevent convergence.

Extensions and variants

Several extensions to basic tabular Q-learning have been developed to address specific challenges.

Eligibility traces (Q(lambda))

Eligibility traces combine the single-step TD update of Q-learning with the multi-step returns of Monte Carlo methods. Instead of using only the one-step TD target, Q(lambda) uses a weighted combination of n-step returns, where the parameter lambda (between 0 and 1) controls the weighting. When lambda = 0, Q(lambda) reduces to standard Q-learning; when lambda = 1, it approximates Monte Carlo methods.

Speedy Q-learning

Speedy Q-learning modifies the standard update rule to achieve faster convergence rates in certain settings. It uses a different combination of the current and target Q-values that reduces the error at a faster rate, measured in terms of the number of iterations required.

Distributed Q-learning

Distributional Q-learning (also called distributional reinforcement learning) goes beyond estimating the expected return and instead models the full probability distribution of returns. Rather than a single Q-value per state-action pair, the agent maintains a distribution over possible returns. This provides richer information for decision-making and has been shown to improve performance in the deep reinforcement learning setting (for example, the C51 and QR-DQN algorithms).

Multi-agent Q-learning

When multiple agents interact in the same environment, single-agent Q-learning can be extended to the multi-agent setting. Minimax Q-learning, proposed by Michael Littman in 1994, adapts Q-learning for two-player zero-sum games. Nash Q-learning extends this to general-sum games by computing Nash equilibria at each update step.

Summary of key properties

Property	Value
Introduced by	Chris Watkins (1989)
Type	Model-free, off-policy, value-based
Core data structure	Q-table (
Update method	Temporal difference (one-step bootstrap)
Convergence	Guaranteed (finite MDP, Robbins-Monro learning rates, infinite exploration)
State space requirement	Finite and discrete
Action space requirement	Finite and discrete
Key hyperparameters	Learning rate (alpha), discount factor (gamma), exploration rate (epsilon)
Common exploration strategy	Epsilon-greedy
Main limitation	Does not scale to large or continuous state spaces

References

Watkins, C. J. C. H. (1989). *Learning from Delayed Rewards*. PhD thesis, University of Cambridge.
Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. *Machine Learning*, 8(3-4), 279-292.
Sutton, R. S., & Barto, A. G. (2018). *Reinforcement Learning: An Introduction* (2nd ed.). MIT Press.
Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2015). Human-level control through deep reinforcement learning. *Nature*, 518(7540), 529-533.
van Hasselt, H. (2010). Double Q-learning. *Advances in Neural Information Processing Systems*, 23.
van Hasselt, H., Guez, A., & Silver, D. (2016). Deep reinforcement learning with double Q-learning. *Proceedings of the AAAI Conference on Artificial Intelligence*, 30(1).
Strehl, A. L., Li, L., Wiewiora, E., Langford, J., & Littman, M. L. (2006). PAC model-free reinforcement learning. *Proceedings of the 23rd International Conference on Machine Learning*, 881-888.
Littman, M. L. (1994). Markov games as a framework for multi-agent reinforcement learning. *Proceedings of the 11th International Conference on Machine Learning*, 157-163.
Bozinovski, S. (1981). A self-learning system using secondary reinforcement. In R. Trappl (Ed.), *Cybernetics and Systems Research*, 397-402.
Robbins, H., & Monro, S. (1951). A stochastic approximation method. *Annals of Mathematical Statistics*, 22(3), 400-407.
Bellman, R. (1957). *Dynamic Programming*. Princeton University Press.
Tsitsiklis, J. N. (1994). Asynchronous stochastic approximation and Q-learning. *Machine Learning*, 16(3), 185-202.
Singh, S. P., Jaakkola, T., Littman, M. L., & Szepesvari, C. (2000). Convergence results for single-step on-policy reinforcement-learning algorithms. *Machine Learning*, 38(3), 287-308.

Explain like I'm 5 (ELI5)

History and background

Formal setting: Markov decision processes

The Q-table

Initialization

The Q-learning algorithm

The update rule

Algorithm pseudocode

Exploration strategies

Epsilon-greedy

Boltzmann (softmax) exploration

Upper confidence bound (UCB)

Optimistic initialization

Hyperparameters

Learning rate (alpha)

Discount factor (gamma)

Epsilon (exploration rate)

Convergence

Contraction mapping intuition

Comparison with related algorithms

Q-learning vs. SARSA

Tabular Q-learning vs. Deep Q-Networks (DQN)

Double Q-learning

Delayed Q-learning

Monte Carlo methods

Worked example: grid world

Limitations

Curse of dimensionality

Continuous state or action spaces

No generalization across states

Slow convergence in large spaces

Applications

Grid worlds and maze navigation

Simple game playing

Robotics (small state spaces)

Resource allocation

Educational use

Implementation considerations

Data structures

Episode termination

Reward shaping

Learning rate scheduling

Relationship to the Bellman equation

Extensions and variants

Eligibility traces (Q(lambda))

Speedy Q-learning

Distributed Q-learning

Multi-agent Q-learning

Summary of key properties

References

Improve this article

Related Articles

Machine learning terms/Reinforcement Learning

ARC-AGI 2

AlphaGo

AUC-ROC

Machine learning terms/Clustering

Machine learning terms/Decision Forests

Explain like I'm 5 (ELI5)

History and background

Formal setting: Markov decision processes

The Q-table

Initialization

The Q-learning algorithm

The update rule

Algorithm pseudocode

Exploration strategies

Epsilon-greedy

Boltzmann (softmax) exploration

Upper confidence bound (UCB)

Optimistic initialization

Hyperparameters

Learning rate (alpha)

Discount factor (gamma)

Epsilon (exploration rate)

Convergence

Contraction mapping intuition

Comparison with related algorithms

Q-learning vs. SARSA

Tabular Q-learning vs. Deep Q-Networks (DQN)