Q-Function
Last reviewed
Jun 2, 2026
Sources
24 citations
Review status
Source-backed
Revision
v5 ยท 4,955 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 2, 2026
Sources
24 citations
Review status
Source-backed
Revision
v5 ยท 4,955 words
Add missing citations, update stale details, or suggest a clearer explanation.
The Q-function, also called the action-value function or state-action value function, is a core concept in reinforcement learning (RL). It maps a state-action pair to a scalar value representing the expected cumulative discounted reward an agent will receive by taking a particular action in a given state and then following a specific policy thereafter. Denoted Q(s, a), the Q-function serves as the foundation for a large family of RL algorithms, including Q-learning [1], Deep Q-Networks [4], and Soft Actor-Critic [9].
This article focuses on the Q-function as the engine behind value-based control algorithms: how it is learned from experience, how it is approximated at scale, and the algorithmic family it has spawned. For a complementary treatment that emphasizes the mathematical theory (Monte Carlo versus temporal-difference estimation, the contraction-mapping convergence argument, and the maximization-bias analysis), see the companion article on the state-action value function.
Imagine you are playing a board game, and in every square you land on, you can pick from a few different moves. The Q-function is like a cheat sheet that tells you, "If you are on this square and you pick this move, here is how many points you can expect to earn by the end of the game." At first, the cheat sheet is blank because you have never played before. Every time you play and see what happens after a move, you update the cheat sheet a little bit. After many games, the cheat sheet becomes very accurate, and you can just look at it to pick the best move every time.
Sometimes, the board is so big that writing down every single square and move would fill an entire library. In that case, you train a computer brain (a neural network) to memorize the patterns on the cheat sheet instead of writing everything down. That is what a Deep Q-Network does.
Given a Markov decision process (MDP) defined by a tuple (S, A, P, R, gamma), where S is the state space, A is the action space, P is the state-transition probability function, R is the reward function, and gamma is the discount factor (0 <= gamma < 1), the Q-function under a policy pi is defined as [11]:
Q^pi(s, a) = E_pi [ sum_{t=0}^{infinity} gamma^t * R(s_t, a_t) | s_0 = s, a_0 = a ]
Here, E_pi denotes the expectation over trajectories generated by following policy pi after taking action a in state s. The discount factor gamma controls the trade-off between immediate and future rewards. When gamma is close to 0, the agent behaves myopically, prioritizing immediate rewards. When gamma is close to 1, the agent gives nearly equal weight to distant future rewards.
The optimal Q-function, denoted Q*(s, a), corresponds to the Q-function under the optimal policy pi*:
Q*(s, a) = max_pi Q^pi(s, a)
Once Q* is known, an optimal policy can be recovered simply by choosing the action with the highest Q-value in each state:
pi*(s) = argmax_a Q*(s, a)
The Q-function is closely related to two other functions that appear throughout reinforcement learning theory.
| Function | Notation | Definition | Depends on |
|---|---|---|---|
| State-value function | V^pi(s) | Expected return starting from state s and following policy pi | State only |
| Action-value function (Q-function) | Q^pi(s, a) | Expected return starting from state s, taking action a, then following pi | State and action |
| Advantage function | A^pi(s, a) | How much better action a is compared to the average action under pi | State and action |
The state-value function V^pi(s) gives the expected return from state s under policy pi. It relates to the Q-function by marginalizing over actions:
V^pi(s) = sum_a pi(a|s) * Q^pi(s, a)
For the optimal value function, V*(s) = max_a Q*(s, a).
The advantage function measures the relative benefit of taking a specific action over the average:
A^pi(s, a) = Q^pi(s, a) - V^pi(s)
The advantage function is used in algorithms such as Advantage Actor-Critic (A2C) and Dueling DQN to reduce variance in value estimates. By construction, the advantage of the best action under pi is non-negative, and the expected advantage across all actions under the policy is zero.
The Q-function satisfies a recursive relationship known as the Bellman equation, first introduced by Richard Bellman in his 1957 work on dynamic programming [12], which decomposes the value of a state-action pair into an immediate reward plus the discounted expected future value.
For a given policy pi:
Q^pi(s, a) = R(s, a) + gamma * sum_{s'} P(s'|s, a) * sum_{a'} pi(a'|s') * Q^pi(s', a')
This equation states that the Q-value of taking action a in state s equals the immediate reward plus the discounted Q-value of the next state-action pair, averaged over the transition dynamics and the policy. When the state and action spaces are finite and the transition dynamics are known, this forms a system of linear equations that can be solved exactly.
The optimal Q-function satisfies a nonlinear version of the Bellman equation:
Q*(s, a) = R(s, a) + gamma * sum_{s'} P(s'|s, a) * max_{a'} Q*(s', a')
The key difference from the expectation equation is the max operator: instead of averaging over actions according to a policy, the agent selects the action that yields the highest Q-value. This nonlinear system requires iterative algorithms (such as value iteration or Q-learning) to solve.
Q-learning, introduced by Christopher Watkins in his 1989 PhD thesis "Learning from Delayed Rewards" at King's College, University of Cambridge [1], is a model-free, off-policy algorithm that learns the optimal Q-function directly from experience without requiring knowledge of the environment's transition dynamics. It belongs to the broader family of temporal difference control methods; the trade-offs between TD bootstrapping and Monte Carlo estimation of Q-values are discussed in detail in the state-action value function article.
The agent maintains a table of Q-values (one entry per state-action pair) and updates them after each transition using temporal difference (TD) learning:
Q(s, a) <- Q(s, a) + alpha * [ R + gamma * max_{a'} Q(s', a') - Q(s, a) ]
In this equation:
Watkins and Dayan (1992) proved that tabular Q-learning converges to the optimal Q-function Q* with probability 1, provided that [2]:
These conditions ensure that the algorithm explores enough of the state-action space and that the updates diminish over time, preventing oscillation. The proof rests on the fact that the Bellman optimality operator is a gamma-contraction in the supremum norm, so repeated application drives any initial estimate toward the unique fixed point Q* (this contraction argument is developed in the companion state-action value function article). It was among the first rigorous convergence guarantees for a reinforcement learning control algorithm, establishing Q-learning as a theoretically grounded method rather than a heuristic [2].
Initialize Q(s, a) arbitrarily for all s in S, a in A
For each episode:
Initialize state s
Repeat for each step:
Choose action a from s using policy derived from Q (e.g., epsilon-greedy)
Take action a, observe reward R and next state s'
Q(s, a) <- Q(s, a) + alpha * [R + gamma * max_{a'} Q(s', a') - Q(s, a)]
s <- s'
Until s is terminal
A central challenge in Q-learning is the exploration-exploitation trade-off: the agent must balance between exploiting its current best-known actions and exploring new actions that might yield higher returns. The Q-function is directly used by several exploration strategies to guide this balance.
| Strategy | Description | Advantages | Disadvantages |
|---|---|---|---|
| Epsilon-greedy | With probability epsilon, choose a random action; otherwise choose argmax_a Q(s, a) | Simple to implement and tune | Explores uniformly at random; does not prioritize promising actions |
| Boltzmann (softmax) | Sample actions from P(a) proportional to exp(Q(s, a) / tau), where tau is a temperature parameter | Assigns higher probability to actions with higher Q-values | Sensitive to temperature parameter tuning |
| Upper confidence bound (UCB) | Select argmax_a [Q(s, a) + c * sqrt(ln(t) / N(s, a))] | Explicitly accounts for uncertainty; efficient exploration | Requires visit counts; harder to apply with function approximation |
| Optimistic initialization | Initialize Q-values to high values, encouraging the agent to try all actions early | Simple; no extra computation | Effect diminishes over time |
The most commonly used strategy in practice is epsilon-greedy with a decaying epsilon, where epsilon starts high (more exploration) and gradually decreases toward zero (more exploitation) as the agent accumulates experience.
Q-learning is an off-policy algorithm because the update rule uses the maximum Q-value of the next state (corresponding to the greedy policy) regardless of what action the agent actually took. This means the agent can learn about the optimal policy while following a different, more exploratory behavior policy.
In contrast, SARSA (State-Action-Reward-State-Action), first described by Rummery and Niranjan in a 1994 Cambridge technical report under the name Modified Connectionist Q-Learning [16], is an on-policy TD algorithm that updates the Q-function based on the action the agent actually takes in the next state:
Q(s, a) <- Q(s, a) + alpha * [ R + gamma * Q(s', a') - Q(s, a) ]
Here, a' is the action actually selected by the current policy in state s', rather than the action that maximizes Q. This distinction has practical consequences.
| Property | Q-learning (off-policy) | SARSA (on-policy) |
|---|---|---|
| Update target | max_{a'} Q(s', a') | Q(s', a') where a' is the action actually taken |
| Learns about | Optimal policy | Current policy (including exploration) |
| Convergence speed | Generally faster to optimal policy | Slower; converges to a policy that accounts for exploration |
| Safety during learning | May learn risky policies if exploration causes dangerous states | Learns safer policies because it accounts for exploratory actions |
| Classic example | In cliff-walking, learns the optimal (risky) path along the cliff edge | In cliff-walking, learns a safer path away from the cliff edge |
Expected SARSA is a variant that takes the expectation over all possible next actions weighted by the policy, rather than using a single sample. Its update target is sum_{a'} pi(a'|s') * Q(s', a'). Expected SARSA reduces variance compared to standard SARSA and, when using a greedy target policy, becomes equivalent to Q-learning.
Tabular Q-learning stores one Q-value for every state-action pair, which becomes infeasible when the state or action spaces are large or continuous. Function approximation addresses this by representing the Q-function with a parameterized model Q(s, a; theta), where theta is a vector of learnable parameters.
In linear Q-function approximation, the Q-value is expressed as a linear combination of hand-crafted features:
Q(s, a; theta) = theta^T * phi(s, a)
where phi(s, a) is a feature vector. Linear approximation is attractive because the loss function is convex, guaranteeing convergence to a global optimum under appropriate conditions. However, the quality of the learned Q-function depends heavily on the choice of features.
Greedy-GQ is a variant of Q-learning designed for use with linear function approximation that guarantees convergence, addressing the well-known instability of combining off-policy learning with function approximation (the "deadly triad").
Neural networks can automatically learn useful feature representations from raw inputs, eliminating the need for manual feature engineering. However, using nonlinear function approximators introduces instability: Tsitsiklis and Van Roy (1997) showed that temporal-difference learning combined with a nonlinear function approximator can diverge [17]. Techniques such as experience replay, target networks, and gradient clipping are used to mitigate this instability.
The Deep Q-Network (DQN), introduced by Mnih et al. (2013) [3] and refined in their 2015 Nature paper [4], was the first algorithm to successfully combine deep neural networks with Q-learning at scale. DQN used a convolutional neural network to learn Q-values directly from raw pixel inputs, achieving human-level performance on a range of Atari 2600 games.
DQN introduced two techniques to stabilize training with neural network function approximation:
Experience replay. The agent stores transitions (s, a, r, s') in a replay buffer and trains on randomly sampled mini-batches. This breaks temporal correlations between consecutive samples and allows each transition to be reused multiple times, improving data efficiency. In the canonical Atari configuration the buffer held the most recent one million frames [4].
Target network. A separate copy of the Q-network, called the target network, is used to compute the TD target. The target network's parameters theta^- are held fixed and refreshed periodically (every C steps) by copying the parameters from the main network; the Nature agent used C = 10,000 [4]. This prevents the target from changing too rapidly, which would otherwise cause training instability.
The DQN loss function is:
L(theta) = E[ (R + gamma * max_{a'} Q(s', a'; theta^-) - Q(s, a; theta))^2 ]
where theta^- denotes the parameters of the target network.
In the 2015 Nature paper, DQN was evaluated on 49 Atari games using a single architecture and one set of hyperparameters. It outperformed the best prior reinforcement learning methods on 43 of the games without using any of the game-specific prior knowledge that earlier approaches relied on, and it reached more than 75 percent of a professional human games tester's score on 29 of the 49 games [4]. This demonstrated that a single RL agent could learn diverse skills directly from high-dimensional sensory input.
Following the original DQN, researchers developed several improvements that address specific weaknesses of the base algorithm.
The underlying idea predates deep learning. Hado van Hasselt's 2010 Double Q-learning paper showed that the single max operator produces a positive maximization bias, and that maintaining two independent estimators (one to select the maximizing action, the other to evaluate it) removes that systematic overestimation [18]. The paper's roulette illustration is striking: an action whose true expected value is negative was still estimated by ordinary Q-learning to be worth almost ten dollars even after 100,000 plays, whereas Double Q-learning tracked the correct sign [18].
Van Hasselt, Guez, and Silver (2016) carried this idea into deep networks [5]. They showed that DQN tends to overestimate Q-values because the max operator in the target simultaneously selects and evaluates actions, and Double DQN decouples these steps by reusing the existing online and target networks: the main network selects the best action, and the target network evaluates it:
Y = R + gamma * Q(s', argmax_{a'} Q(s', a'; theta); theta^-)
This simple change significantly reduces overestimation bias and improves performance across the Atari suite [5]. A deeper analysis of why the max operator overestimates (its connection to Jensen's inequality and the winner's curse) appears in the state-action value function article.
Wang et al. (2016) proposed the Dueling Network Architecture, which separates the Q-function into two streams [6]:
Q(s, a; theta, alpha, beta) = V(s; theta, beta) + A(s, a; theta, alpha) - mean_a A(s, a; theta, alpha)
One stream estimates the state-value function V(s), and the other estimates the advantage function A(s, a). This decomposition helps the network learn which states are valuable without needing to evaluate every action, particularly in states where the choice of action does not matter much.
Schaul et al. (2016) observed that sampling transitions uniformly from the replay buffer is suboptimal [7]. Prioritized experience replay assigns higher sampling probability to transitions with larger TD errors, allowing the agent to learn more from surprising or informative experiences. Importance-sampling weights are used to correct the bias this non-uniform sampling introduces. When added to DQN, it raised the median human-normalized score on the Atari benchmark from 48 percent to 106 percent and improved performance on 41 of the 49 games [7].
Bellemare, Dabney, and Munos (2017) proposed learning the full distribution of returns rather than just the expected value [13]. The C51 algorithm represents the return distribution using 51 fixed atoms and learns the probability of each atom. Dabney et al. (2018) improved on this with QR-DQN (Quantile Regression DQN), which uses fixed uniform probabilities with learnable quantile locations, directly minimizing a quantile regression loss [19].
Fortunato et al. (2018) replaced epsilon-greedy exploration with learned parametric noise injected into the network weights [20]. The noise parameters are trained by gradient descent alongside the ordinary weights, so the network learns how much exploration is appropriate for each state, resulting in more efficient state-dependent exploration. The paper uses a factorised Gaussian noise scheme to keep the number of extra random variables low (one per input and one per output of each noisy layer rather than one per weight) [20].
Hessel et al. (2018) combined six DQN extensions into a single agent called Rainbow [8]:
| Component | Purpose |
|---|---|
| Double Q-learning | Reduces overestimation bias |
| Prioritized experience replay | Focuses learning on informative transitions |
| Dueling networks | Separates state-value and advantage estimation |
| Multi-step learning | Uses n-step returns for faster reward propagation |
| Distributional RL (C51) | Learns the full return distribution |
| Noisy networks | Enables state-dependent exploration |
Rainbow achieved state-of-the-art performance on the Atari 2600 benchmark, outperforming each individual component by a large margin in both data efficiency and final performance. An ablation study in the same paper found that prioritized experience replay and multi-step learning were the two most important components: removing either caused the largest drop in median performance [8].
Standard Q-learning requires computing max_a Q(s, a), which is straightforward when the action space is discrete and finite but becomes an optimization problem when actions are continuous.
Lillicrap et al. (2016) proposed DDPG, an actor-critic algorithm that maintains a Q-network (the critic) alongside a separate policy network (the actor) [15]. The actor learns a deterministic policy mu(s) that approximates argmax_a Q(s, a), avoiding the need to explicitly maximize over a continuous action space. DDPG uses experience replay and target networks, similar to DQN.
Gu et al. (2016) proposed NAF as a simpler alternative to DDPG [21]. NAF parameterizes Q(s, a) as a quadratic function with respect to the action:
Q(s, a; theta) = V(s; theta) - (a - mu(s; theta))^T P(s; theta) (a - mu(s; theta))
where V(s) is the state value, mu(s) is the optimal action, and P(s) is a state-dependent positive-definite matrix. This quadratic form allows the maximum to be found in closed form: max_a Q(s, a) = V(s), achieved at a = mu(s). NAF showed faster and more stable learning than DDPG on several continuous control benchmarks.
Haarnoja et al. (2018) introduced the Soft Actor-Critic (SAC) algorithm, which uses a soft Q-function defined under the maximum entropy framework [9]:
Q^pi_soft(s, a) = R(s, a) + gamma * E_{s'} [ V^pi_soft(s') ]
where the soft value function includes an entropy bonus:
V^pi_soft(s) = E_{a ~ pi} [ Q^pi_soft(s, a) - alpha * log pi(a|s) ]
The temperature parameter alpha controls the trade-off between reward maximization and entropy (policy randomness). A distinctive feature of the maximum entropy setting is that the hard max in the ordinary Bellman optimality equation is replaced by a "soft" maximum, the log-sum-exp operator: V_soft(s) = alpha * log sum_a exp(Q_soft(s, a) / alpha). This soft maximum reduces to the ordinary max as the temperature alpha goes to zero, and it makes the optimal soft policy a Boltzmann distribution over Q-values. The soft Q-function was first developed in the soft Q-learning framework of Haarnoja et al. (2017), which expressed the optimal policy as an energy-based model defined by the Q-function [22]. SAC uses twin Q-networks and takes the minimum of the two estimates (the same mechanism as Double Q-learning and TD3) to reduce overestimation, together with a stochastic policy, making it well-suited for continuous control tasks [9]. SAC has become one of the most widely used algorithms for continuous control due to its stability and sample efficiency.
Fitted Q-iteration, introduced by Ernst, Geurts, and Wehenkel (2005), is a batch reinforcement learning algorithm that treats each iteration of Q-learning as a supervised regression problem [10]. Given a fixed dataset of transitions {(s_i, a_i, r_i, s'i)}, FQI iteratively computes regression targets y_i = r_i + gamma * max{a'} Q_k(s'i, a') and fits a new regressor Q{k+1} to these targets. Any regression method (decision trees, neural networks, kernel methods) can be used as the function approximator.
Neural Fitted Q-iteration (NFQ), proposed by Riedmiller (2005), is a variant that uses multilayer perceptrons as the regressor and updates them with the Rprop batch optimizer, representing an early precursor to DQN [23].
In offline (or batch) RL, the agent cannot interact with the environment and must learn entirely from a fixed dataset. Naive application of Q-learning to offline data suffers from distribution shift: the Q-function may assign high values to state-action pairs that are poorly represented in the dataset, and because the bootstrapped target maximizes over actions, these inflated estimates feed back into the update and compound. Algorithms such as Conservative Q-Learning (CQL), proposed by Kumar et al. (2020), address this by adding a regularizer that penalizes Q-values for out-of-distribution actions, producing a Q-function that lower-bounds the true value [14].
Several well-known challenges arise when learning or using Q-functions:
Curse of dimensionality. Tabular Q-learning requires storing and updating Q-values for every state-action pair. The number of entries grows exponentially with the dimensionality of the state and action spaces, making tabular methods impractical for most real-world problems.
Overestimation bias. The max operator in the Q-learning update systematically overestimates the true Q-values, especially in stochastic environments. Double Q-learning and related techniques mitigate this issue but do not eliminate it entirely [18].
The deadly triad. Sutton and Barto (2018) identified three factors that, when combined, can cause Q-learning to diverge: (1) function approximation, (2) bootstrapping (using estimated values in the update target), and (3) off-policy learning [11]. All three are present in standard DQN, which is why experience replay and target networks are necessary for stability.
Sample efficiency. Model-free Q-learning algorithms typically require millions of environment interactions to learn good policies. Model-based methods, which learn a model of the environment's dynamics, can be more sample-efficient but introduce model bias.
Continuous actions. Computing max_a Q(s, a) is straightforward for discrete actions but requires solving a potentially non-convex optimization problem for continuous actions, leading to the development of actor-critic methods.
Reward shaping. The quality of the learned Q-function depends on the design of the reward signal. Sparse rewards (e.g., +1 for winning, 0 otherwise) make learning difficult because the agent receives no feedback on intermediate states.
Q-function-based algorithms have been applied across a wide range of domains:
| Domain | Example application | Algorithm used |
|---|---|---|
| Video games | Playing Atari 2600 games from raw pixels | DQN, Rainbow |
| Board games | Learning to play Go, chess, and shogi | AlphaZero (uses MCTS with Q-value estimates) |
| Robotics | Object manipulation and grasping | DDPG, SAC |
| Autonomous driving | Lane keeping and decision-making at intersections | Double DQN, DDPG |
| Resource management | Data center cooling optimization (Google DeepMind reported a 40 percent cut in cooling energy in 2016) [24] | DQN |
| Finance | Portfolio optimization and trading strategies | Q-learning, DQN |
| Recommendation systems | Personalized content recommendations | Fitted Q-iteration |
| Natural language processing | Dialogue policy optimization | DQN |
| Year | Event |
|---|---|
| 1957 | Richard Bellman publishes the Bellman equation for dynamic programming |
| 1989 | Christopher Watkins introduces Q-learning in his PhD thesis at the University of Cambridge |
| 1992 | Watkins and Peter Dayan publish a convergence proof for Q-learning |
| 1994 | Rummery and Niranjan introduce SARSA (as Modified Connectionist Q-Learning) |
| 1997 | Tsitsiklis and Van Roy analyze TD learning with function approximation and show divergence is possible with nonlinear approximators |
| 2005 | Ernst et al. propose fitted Q-iteration; Riedmiller proposes Neural Fitted Q-iteration |
| 2010 | Van Hasselt proposes Double Q-learning |
| 2013 | Mnih et al. (DeepMind) publish the DQN paper, playing Atari from pixels |
| 2015 | Mnih et al. publish the refined DQN in Nature; Schaul et al. propose prioritized experience replay |
| 2016 | Van Hasselt et al. publish Double DQN; Wang et al. propose Dueling DQN; Lillicrap et al. propose DDPG; Gu et al. propose NAF |
| 2017 | Bellemare et al. introduce distributional RL with C51; Haarnoja et al. introduce soft Q-learning |
| 2018 | Hessel et al. combine six improvements in Rainbow; Haarnoja et al. propose SAC; Dabney et al. propose QR-DQN; Fortunato et al. propose noisy networks |
| 2020 | Kumar et al. propose Conservative Q-Learning (CQL) for offline RL |