# State-Action Value Function

> Source: https://aiwiki.ai/wiki/state-action_value_function
> Updated: 2026-06-26
> Categories: Machine Learning, Reinforcement Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

The **state-action value function**, written Q^π(s, a) and also called the **action-value function** or **Q-function**, gives the expected discounted return an agent obtains by taking action a in state s and thereafter following policy π. Formally, Q^π(s, a) = E_π[ G_t | S_t = s, A_t = a ], where G_t is the discounted sum of future rewards. It is the foundational object of value-based [reinforcement learning](/wiki/reinforcement_learning) (RL): the [Q-learning](/wiki/q_learning) algorithm and [Deep Q-Networks](/wiki/deep_q-network_dqn) (DQN) both learn a Q-function, and it satisfies the [Bellman equation](/wiki/bellman_equation) that underlies nearly all value-based methods. Sutton and Barto define it as "the expected return when starting from s, taking the action a, and thereafter following policy π" [1].

The **optimal action-value function** Q*(s, a) = max_π Q^π(s, a) is the single most useful version, because once it is known the optimal policy is just the greedy choice π*(s) = arg max_a Q*(s, a). An agent that knows Q* can therefore act optimally with a one-step lookahead and no model of the environment, which is what makes the Q-function so widely used. The optimal state value satisfies V*(s) = max_a Q*(s, a), so the Q-function generalizes the state-value function by also conditioning on the first action.

The concept originates from the theory of [Markov decision processes](/wiki/markov_decision_process_mdp) (MDPs) and dynamic programming, where it was formalized through the [Bellman equation](/wiki/bellman_equation) [15]. Chris Watkins introduced Q-learning in his 1989 doctoral thesis [2], and the convergence proof was later published by Watkins and Peter Dayan in 1992 [3]. Since then, the Q-function has become one of the most studied and applied constructs in both classical and deep reinforcement learning, with the deep Q-network of Mnih et al. (2015) demonstrating that a single learned Q-function could reach human-level performance directly from pixels [4].

## ELI5 (explain like I'm 5)

Imagine you are in a big room full of toys, and you want to collect as many gold stars as possible. Every time you pick up a toy or move to a different spot, you might get some stars. The Q-function is like a magic notebook that tells you: "If you are standing in this spot and you pick up that toy, here is roughly how many total stars you will end up with." By checking the notebook for every toy you could pick up, you can always choose the one that gives you the most stars. Over time, as you keep playing and learning, the notebook gets more and more accurate until it always tells you the best choice.

## What is the state-action value function (Q-function)?

The state-action value function is defined within the framework of a Markov decision process [1], which consists of:

- A set of states S
- A set of actions A
- A transition function P(s' | s, a) describing the probability of moving to state s' after taking action a in state s
- A reward function R(s, a) giving the immediate reward for taking action a in state s
- A discount factor γ (gamma), where 0 ≤ γ ≤ 1, which determines how much future rewards are weighted relative to immediate rewards

Sutton and Barto place the function at the center of the field: "Almost all reinforcement learning algorithms involve estimating value functions, functions of states (or of state-action pairs) that estimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state)" [1].

### On-policy action-value function

The on-policy action-value function, denoted Q^π(s, a), represents the expected return when an agent starts in state s, takes action a, and then follows policy π for all subsequent decisions:

**Q^π(s, a) = E_π[ G_t | s_t = s, a_t = a ]**

where G_t is the discounted return defined as:

**G_t = R_{t+1} + γR_{t+2} + γ²R_{t+3} + ... = Σ_{k=0}^{∞} γ^k R_{t+k+1}**

Here, R_{t+1} is the immediate reward received after taking action a in state s, and subsequent rewards are discounted by increasing powers of γ.

### Optimal action-value function

The optimal action-value function, denoted Q*(s, a), represents the maximum expected return achievable from state s after taking action a, over all possible policies:

**Q*(s, a) = max_π Q^π(s, a)**

Once Q* is known, the optimal policy can be derived directly by selecting the action that maximizes Q* in each state:

**π*(s) = arg max_a Q*(s, a)**

This property makes Q* especially useful: an agent that knows Q* can act optimally without needing an explicit model of the environment.

## What are the Bellman equations for the Q-function?

The Bellman equations express the recursive relationship between the value of a state-action pair and the values of successor state-action pairs. These equations are the backbone of nearly all value-based RL algorithms.

### Bellman expectation equation

For a given policy π, the Bellman expectation equation for Q^π is:

**Q^π(s, a) = E[ R_{t+1} + γ Σ_{a'} π(a' | s') Q^π(s', a') | s_t = s, a_t = a ]**

This states that the value of taking action a in state s under policy π equals the expected immediate reward plus the discounted expected value of the next state-action pair, where the next action a' is chosen according to policy π.

### Bellman optimality equation

The Bellman optimality equation for Q* is:

**Q*(s, a) = E[ R_{t+1} + γ max_{a'} Q*(s', a') | s_t = s, a_t = a ]**

The key difference from the expectation equation is the max operator: instead of averaging over actions weighted by the policy, the optimal equation selects the action that yields the highest value. This equation forms the basis of Q-learning and value iteration algorithms [1].

Sutton and Barto emphasize a useful practical consequence of this formulation. Once Q* is available, any policy that is greedy with respect to Q*, meaning it always picks an action of maximal value in each state, is an optimal policy. Because the greedy choice already incorporates the long-term consequences of every action through the discounted sum of future rewards, optimal behavior reduces to a one-step lookahead over Q* with no further planning or search required [1].

## How is the Q-function different from the value function?

The Q-function is closely related to two other value constructs in reinforcement learning: the state value function and the advantage function. The essential difference is that the [state value function](/wiki/value_function) V^π(s) measures how good it is to be in a state, while the action-value function Q^π(s, a) measures how good it is to take a particular action from that state and then follow π. Conditioning on the action is exactly what lets an agent choose between actions without a model of the environment, which is why control algorithms learn Q rather than V.

| Function | Notation | Definition | What it measures |
|---|---|---|---|
| State value function | V^π(s) | E_π[ G_t \| s_t = s ] | Expected return from state s following policy π |
| Action-value function | Q^π(s, a) | E_π[ G_t \| s_t = s, a_t = a ] | Expected return from state s, taking action a, then following π |
| Optimal state value | V*(s) | max_a Q*(s, a) | Maximum expected return from state s under any policy |
| Optimal action-value | Q*(s, a) | max_π Q^π(s, a) | Maximum expected return from state s after taking action a |
| Advantage function | A^π(s, a) | Q^π(s, a) - V^π(s) | How much better action a is compared to the average action under π |

### Connection between V and Q

The state value function can be expressed in terms of the Q-function:

**V^π(s) = Σ_a π(a | s) Q^π(s, a)**

In other words, V^π(s) is the weighted average of Q^π(s, a) over all actions, where the weights come from the policy π. For the optimal case:

**V*(s) = max_a Q*(s, a)**

### Advantage function

The advantage function A^π(s, a) = Q^π(s, a) - V^π(s) measures how much better (or worse) a specific action is compared to the average action selected by the policy. By construction the advantage function has zero expectation under the policy, since Σ_a π(a | s) A^π(s, a) = Σ_a π(a | s) Q^π(s, a) - V^π(s) = 0. It plays a significant role in policy gradient methods such as A2C and A3C, where subtracting V^π(s) acts as a baseline that lowers the variance of the gradient estimate without changing its expected direction, as well as in the [Dueling DQN](/wiki/dueling_dqn) architecture [8].

## How are Q-values estimated?

There are several approaches for estimating or learning the Q-function, each with different trade-offs in terms of bias, variance, data efficiency, and computational cost.

### Monte Carlo estimation

Monte Carlo methods estimate Q(s, a) by averaging the actual returns observed after taking action a in state s across multiple complete episodes. The estimate is updated as:

**Q(s, a) ← Q(s, a) + α [ G_t - Q(s, a) ]**

where G_t is the actual observed return from that point forward and α is the [learning rate](/wiki/learning_rate).

Monte Carlo estimation is unbiased because it uses actual returns rather than estimates. However, it has high variance because returns can differ substantially from one episode to another. It also requires complete episodes, making it unsuitable for continuing (non-episodic) tasks [1].

### Temporal difference learning

[Temporal difference](/wiki/temporal_difference_learning) (TD) methods, formalized by Sutton in 1988 [16], update Q-values after every step, using the observed reward plus an estimate of future value (a technique called bootstrapping). The TD(0) update for the Q-function is:

**Q(s, a) ← Q(s, a) + α [ R_{t+1} + γ Q(s', a') - Q(s, a) ]**

The term R_{t+1} + γ Q(s', a') is called the **TD target**, and the difference between the TD target and the current estimate is called the **TD error** (δ). TD methods have lower variance than Monte Carlo methods because they bootstrap from existing estimates, but they introduce bias because those estimates may be inaccurate, especially early in training.

### Comparison of estimation methods

| Property | Monte Carlo | Temporal difference (TD) |
|---|---|---|
| Update frequency | End of episode | Every time step |
| Bootstrapping | No | Yes |
| Bias | Unbiased | Biased (from bootstrapping) |
| Variance | High | Lower |
| Requires complete episodes | Yes | No |
| Works for continuing tasks | No | Yes |
| Data efficiency | Lower | Higher |
| Sensitivity to initial values | Lower | Higher |

## How is the Q-function used in Q-learning and DQN?

Many of the most well-known reinforcement learning algorithms are built around learning or approximating the Q-function.

### Q-learning

Q-learning, introduced by Chris Watkins in 1989 [2], is an off-policy TD control algorithm. It updates Q-values using the greedy action in the next state, regardless of the action the agent actually takes:

**Q(s, a) ← Q(s, a) + α [ R_{t+1} + γ max_{a'} Q(s', a') - Q(s, a) ]**

Because the update uses max_{a'} Q(s', a') rather than the action chosen by the current policy, Q-learning learns the optimal Q-function Q* directly. Watkins's thesis introduced "a new algorithm, which was dubbed Q-learning, that could in principle learn optimal control directly without modelling the transition probabilities or expected rewards of the MDP" [2]. Watkins and Dayan (1992) proved that tabular Q-learning converges to Q* with probability 1, provided that all state-action pairs are visited infinitely often and the learning rate satisfies certain decay conditions [3]. Their proof was among the first rigorous convergence guarantees for a reinforcement learning control algorithm, and it established Q-learning as a theoretically grounded method rather than a heuristic.

### SARSA

SARSA (State-Action-Reward-State-Action) is an on-policy TD control algorithm. It was first described by G. A. Rummery and Mahesan Niranjan in a 1994 Cambridge University technical report, where they called it Modified Connectionist Q-Learning [17]. The now-standard name SARSA was proposed by Rich Sutton and originally appeared only as a footnote, reflecting the five elements (S, A, R, S, A) that make up each update. Unlike Q-learning, SARSA uses the action actually selected by the policy in the next state:

**Q(s, a) ← Q(s, a) + α [ R_{t+1} + γ Q(s', a') - Q(s, a) ]**

where a' is the action chosen by the current policy in state s'. Because SARSA evaluates the policy it is actually following (including exploration), it tends to learn safer policies in environments where exploratory actions can lead to bad outcomes. The classic illustration is the cliff-walking gridworld used as Example 6.6 in Sutton and Barto, an undiscounted episodic task where stepping off the cliff costs a large negative reward and resets the agent to the start. There SARSA learns the safer route along the top of the grid, while Q-learning learns the optimal policy that travels right along the cliff edge. Because Q-learning still acts epsilon-greedily during training, it occasionally falls off the cliff, so SARSA achieves higher online reward during learning even though Q-learning has learned the better target policy [1].

### Expected SARSA

[Expected SARSA](/wiki/expected_sarsa) replaces the sampled next-state value in the SARSA update with its expectation under the current policy, using Σ_{a'} π(a' | s') Q(s', a') in place of Q(s', a'). Because it averages over all next actions rather than relying on a single sampled action, the update has lower variance, which permits larger learning rates and faster learning; in deterministic environments the next-state term has zero variance and a learning rate of 1 can be used. Van Seijen et al. (2009) analyzed the method theoretically and empirically, proved that it converges under the same conditions as SARSA, and showed it can outperform both SARSA and Q-learning on several tasks [19]. Expected SARSA can be run on-policy, but if the expectation is taken over a greedy target policy it reduces exactly to Q-learning, which is why it is often described as spanning the gap between the two.

### Deep Q-Network (DQN)

The [Deep Q-Network](/wiki/deep_q-network_dqn) (DQN), introduced by Mnih et al. at DeepMind in 2013 [5] and refined in 2015 [4], approximates Q*(s, a) using a deep [neural network](/wiki/neural_network) instead of a lookup table. The network takes a state (such as raw pixel input from an Atari game) and outputs Q-values for each possible action. In the canonical Atari setup the input is a stack of four consecutive grayscale frames downsampled to 84 by 84 pixels, which gives the network a sense of motion, followed by three convolutional layers and two fully connected layers, with one output unit per valid action [4]. Two key innovations stabilize training:

- **Experience replay**: transitions are stored in a replay buffer and sampled randomly for training, breaking correlations between consecutive samples
- **Target network**: a separate, periodically updated copy of the Q-network provides stable TD targets during training

The 2013 workshop paper applied the method to seven Atari 2600 games, outperforming all prior approaches on six of them and a human expert on three [5]. The 2015 Nature paper scaled this to 49 games using, in the authors' words, "the same algorithm, network architecture and hyperparameters" on every game [4]. DQN outperformed the best prior reinforcement learning methods on 43 of the 49 games, scored above 75 percent of a professional human games tester's score on 29 of the 49 games, and reached a level the authors described as "comparable to that of a professional human games tester," marking a major milestone in deep reinforcement learning [4].

### Overestimation (maximization) bias

A subtle but important issue affects any algorithm that uses a maximum over estimated values, including standard Q-learning. The problem, first identified by Thrun and Schwartz in 1993, is that the max operator applied to noisy estimates tends to produce a value that is systematically too high [18]. Intuitively, the action whose estimate happens to be inflated by noise is exactly the one the max selects, so errors do not cancel out but accumulate upward. This effect is a consequence of Jensen's inequality, since the expectation of a maximum is at least as large as the maximum of the expectations, and it is closely related to the winner's curse in auction theory [6]. The bias is most damaging in environments with stochastic rewards and when function approximation introduces additional estimation error, where it can slow learning or steer the agent toward poor policies.

### Double Q-learning

Standard Q-learning can overestimate action values because it uses the same Q-function both to select the best action and to evaluate its value, which amplifies the maximization bias described above. Double Q-learning, proposed by Hado van Hasselt in 2010, addresses this by maintaining two independent estimators and decoupling action selection from value estimation: one estimator chooses the maximizing action while the other supplies the value used in the update [6]. Van Hasselt proved that the resulting algorithm still converges to the optimal policy and showed that, although it can sometimes underestimate values, it avoids the systematic overestimation of ordinary Q-learning [6]. The idea was later extended to deep networks as Double DQN by van Hasselt, Guez, and Silver in 2016, which reused DQN's online and target networks for the two roles and produced more accurate value estimates and better scores across the Atari suite [7].

### Dueling DQN

The Dueling DQN architecture, introduced by Wang et al. (2016), separates the Q-function into two components within the neural network [8]:

**Q(s, a) = V(s) + A(s, a)**

where V(s) is the state value stream and A(s, a) is the advantage stream. In practice the two streams are recombined with the mean advantage subtracted, so that Q(s, a) = V(s) + (A(s, a) - mean_{a'} A(s, a')), which removes an identifiability problem and keeps the decomposition stable during training [8]. This structure allows the network to learn which states are valuable independently of the actions available, improving performance in environments where many actions have similar effects.

### Distributional reinforcement learning

Distributional RL, introduced by Bellemare, Dabney, and Munos (2017), goes beyond estimating the expected Q-value and instead models the full probability distribution of returns [9]. Rather than learning a single number Q(s, a), the agent learns a distribution Z(s, a) whose expectation equals Q(s, a). Their C51 algorithm represents this distribution as a categorical distribution over 51 fixed, evenly spaced return values (the "atoms" that give the method its name) and is trained by minimizing a Kullback-Leibler divergence after a projection step. The authors also showed that the distributional Bellman operator is a gamma-contraction in the Wasserstein metric, providing theoretical support for the approach, which set a new state of the art on the Atari benchmark [9].

### Rainbow

Rainbow, introduced by Hessel et al. (2018), is not a new Q-function variant on its own but a combination of six independent improvements to DQN into a single agent [13]. The six components are Double Q-learning, prioritized experience replay, the dueling network architecture, multi-step (n-step) learning, distributional RL via C51, and noisy networks for exploration. On the Atari benchmark Rainbow substantially outperformed each of the individual extensions, and an ablation study found that prioritized replay and multi-step learning were the two most important contributors, since removing either caused the largest drop in performance [13].

### Summary of Q-function-based algorithms

| Algorithm | Type | On/Off-policy | Key feature |
|---|---|---|---|
| [Q-learning](/wiki/q_learning) | Tabular | Off-policy | Learns Q* directly via max operator |
| [SARSA](/wiki/sarsa) | Tabular | On-policy | Evaluates the policy being followed |
| [Expected SARSA](/wiki/expected_sarsa) | Tabular | Off-policy (can be on-policy) | Uses expected Q-value under policy instead of sampled action |
| [DQN](/wiki/deep_q-network_dqn) | Deep | Off-policy | Neural network Q-function with experience replay |
| [Double DQN](/wiki/double_dqn) | Deep | Off-policy | Decouples action selection and evaluation to reduce overestimation |
| [Dueling DQN](/wiki/dueling_dqn) | Deep | Off-policy | Separates V(s) and A(s, a) streams |
| [Distributional RL](/wiki/distributional_reinforcement_learning) | Deep | Off-policy | Learns return distributions (C51) rather than expectations |
| [Rainbow](/wiki/rainbow_dqn) | Deep | Off-policy | Combines six DQN improvements into one agent |

## How is the Q-function represented with function approximation?

In problems with small, finite state and action spaces, Q-values can be stored in a simple lookup table (tabular Q-learning). However, most real-world problems have state spaces that are far too large for tabular methods. Function approximation techniques allow the Q-function to generalize across similar states.

### Tabular representation

Tabular Q-learning maintains a table with one entry for every (state, action) pair. This approach is guaranteed to converge to Q* under the right conditions but becomes infeasible as the number of states or actions grows. A grid world with a few hundred states works well with tables, but a game with pixel-based observations has far too many possible states.

### Linear function approximation

Linear function approximation represents Q(s, a) as a weighted sum of features:

**Q(s, a; θ) ≈ θ^T φ(s, a)**

where φ(s, a) is a feature vector describing the state-action pair and θ is a weight vector learned during training. Linear methods benefit from a convex loss surface, which provides stronger convergence guarantees than nonlinear methods. However, they require careful manual feature engineering, and a linear function may fail to capture complex, nonlinear relationships in the data.

### Neural network approximation

Neural network function approximation, as used in DQN and its variants, represents Q(s, a) with a deep neural network parameterized by weights θ. The network automatically learns useful features from raw input (such as images), removing the need for hand-crafted features. The loss function for training is typically the mean squared TD error:

**L(θ) = E[ (R_{t+1} + γ max_{a'} Q(s', a'; θ^-) - Q(s, a; θ))² ]**

where θ^- denotes the parameters of the target network [4]. Neural network approximation can represent highly complex Q-functions but introduces challenges including training instability, lack of convergence guarantees, and sensitivity to hyperparameters.

### Comparison of representation methods

| Method | Scalability | Convergence guarantees | Feature engineering needed | Expressiveness |
|---|---|---|---|---|
| Tabular | Low (small state spaces only) | Strong (provably converges to Q*) | None | Exact (stores all values) |
| Linear | Moderate | Moderate (convex optimization) | Yes (requires hand-crafted features) | Limited to linear relationships |
| Neural network | High (handles images, continuous states) | Weak (no general guarantees) | No (learns features automatically) | High (can represent complex functions) |

## How does the Q-function handle continuous action spaces?

A significant challenge arises when the action space is continuous rather than discrete. In standard Q-learning and DQN, selecting the optimal action requires computing arg max_a Q(s, a), which is straightforward when there are finitely many actions but becomes an optimization problem in continuous spaces.

Several approaches address this challenge:

- **Normalized Advantage Functions (NAF)**: Introduced by Gu et al. (2016), NAF parameterizes the Q-function so that it is quadratic in the action variable. This makes the argmax analytically solvable while still using a neural network for the state-dependent components [14]. The trade-off is that the quadratic form may not capture the true Q-function shape in all domains.
- **Actor-critic methods**: Instead of computing the argmax of Q, actor-critic architectures maintain a separate policy network (the actor) that learns to output actions maximizing Q. Algorithms such as [DDPG](/wiki/ddpg) [10], [TD3](/wiki/td3) [11], and [SAC](/wiki/soft_actor_critic) [12] follow this approach, using the Q-function as a critic to evaluate the actor's actions.
- **CAQL (Continuous Action Q-Learning)**: Formulates the inner maximization as a mixed-integer optimization problem, allowing exact or approximate solutions for continuous action Q-learning.

## What role does the Q-function play in actor-critic methods?

Although the Q-function is most commonly associated with value-based methods, it also plays a central role in actor-critic algorithms. In these methods, the critic estimates Q^π(s, a) (or sometimes V^π(s)), and the actor uses this estimate to update the policy in a direction that increases expected return.

- **DDPG (Deep Deterministic Policy Gradient)**: Uses a neural network critic to estimate Q(s, a) and a deterministic actor network that outputs continuous actions. The actor is trained by computing the gradient of Q with respect to the action and backpropagating through the actor [10].
- **TD3 (Twin Delayed DDPG)**: Extends DDPG by using two Q-networks (critics) and taking the minimum of their estimates to reduce the same overestimation bias that motivated Double Q-learning. It also delays policy updates relative to value updates and adds noise to target actions to smooth the value estimate [11].
- **SAC (Soft Actor-Critic)**: Uses a soft Q-function that incorporates an entropy bonus, so the policy evaluation step backs up r(s, a) + γ E[V(s')] with V(s') including a -α log π(a' | s') term. This encourages the policy to remain stochastic and explore more broadly, turning the objective into maximizing expected return plus expected entropy. Like TD3, SAC uses twin Q-networks and can automatically tune the temperature α that trades off reward against entropy [12].

## Does Q-learning converge to the optimal Q-function?

The convergence of Q-learning to the optimal Q-function Q* is one of the most well-studied results in reinforcement learning theory.

### Tabular convergence

Watkins and Dayan (1992) proved that tabular Q-learning converges to Q* with probability 1, provided the following conditions hold [3]:

1. All state-action pairs are visited infinitely often
2. The learning rate α_t satisfies Σ α_t = ∞ and Σ α_t² < ∞ (the Robbins-Monro conditions)
3. The MDP has bounded rewards

This result follows from the theory of stochastic approximation and the contraction properties of the Bellman optimality operator. The optimality operator is a gamma-contraction in the max norm (the supremum norm over state-action pairs), meaning that applying it to any two action-value functions shrinks the largest difference between them by a factor of at least γ. By the Banach fixed-point theorem, a contraction on a complete space has exactly one fixed point, which here is Q*, and repeated application of the operator converges to it from any starting estimate [1]. This is the same property that guarantees the convergence of value iteration in dynamic programming [15].

### Function approximation convergence

With function approximation, the convergence picture becomes more complex. Linear Q-function approximation can diverge under off-policy learning (a phenomenon sometimes called the "deadly triad" when combining function approximation, bootstrapping, and off-policy training). Various methods have been developed to address this, including Gradient TD methods and the Greedy GQ algorithm. For neural network approximation, there are no general convergence guarantees, though techniques like experience replay and target networks provide empirical stability.

### The deadly triad

Sutton and Barto describe the "deadly triad" as the combination of three elements that can cause instability in value function learning [1]:

1. **Function approximation**: representing the value function with a parameterized function rather than a table
2. **Bootstrapping**: updating estimates based partly on other estimates (as in TD learning)
3. **Off-policy learning**: learning about a policy different from the one generating the data

When all three are present simultaneously, learning can become unstable or diverge. DQN's experience replay and target network mechanisms were designed in part to mitigate these issues.

## How does exploration interact with the Q-function?

Learning an accurate Q-function requires exploring the state-action space sufficiently. Several exploration strategies interact directly with Q-value estimates:

- **Epsilon-greedy**: With probability ε, the agent selects a random action; otherwise, it selects the action with the highest Q-value. This ensures all actions are tried occasionally.
- **Boltzmann (softmax) exploration**: Actions are selected with probability proportional to exp(Q(s, a) / τ), where τ is a temperature parameter. Higher Q-values are more likely to be selected, but all actions have nonzero probability.
- **Optimistic initialization**: Q-values are initialized to high values, so the agent is "disappointed" by early results and tries other actions, naturally encouraging exploration.
- **Upper confidence bound (UCB)**: Adds an exploration bonus to Q-values based on how rarely a state-action pair has been visited, balancing exploitation and exploration.

## What is the Q-function used for?

The Q-function and algorithms built around it have been applied across a wide range of domains:

- **Game playing**: DQN achieved human-level performance on Atari 2600 games (Mnih et al., 2015) [4]. Extensions have been applied to board games and real-time strategy games.
- **Robotics**: Q-learning-based methods are used for robotic manipulation, grasping, and locomotion, often trained in simulation and transferred to physical robots (sim-to-real transfer). A notable real-world example is QT-Opt (Kalashnikov et al., 2018), which trained a deep Q-function on more than 580,000 real robot grasp attempts and reached a 96 percent grasp success rate on previously unseen objects, learning closed-loop behaviors such as repositioning and regrasping [20].
- **Autonomous driving**: Deep Q-learning variants have been explored for lane-keeping, intersection navigation, and other driving tasks, though real-world deployment remains limited.
- **Resource management**: Q-learning has been applied to network routing, data center cooling (such as Google DeepMind's work on reducing data center energy consumption), and inventory management.
- **Finance**: Q-function-based methods have been used for portfolio optimization and algorithmic trading, where the state captures market conditions and actions correspond to trading decisions.
- **Natural language processing**: Reinforcement learning with Q-functions has been applied to dialogue systems, text summarization, and information extraction tasks.

## When was the Q-function developed?

The development of the state-action value function and related algorithms spans several decades.

| Year | Development | Contributors |
|---|---|---|
| 1957 | Dynamic programming and the Bellman equation introduced | Richard Bellman [15] |
| 1981 | Crossbar Adaptive Array (precursor to Q-learning) | Stevo Bozinovski [21] |
| 1988 | Temporal difference learning formalized | Richard Sutton [16] |
| 1989 | Q-learning introduced | Chris Watkins [2] |
| 1992 | Convergence proof for Q-learning published | Chris Watkins, Peter Dayan [3] |
| 1994 | SARSA (Modified Connectionist Q-Learning) described | G. A. Rummery, M. Niranjan [17] |
| 2009 | Expected SARSA analyzed | van Seijen et al. [19] |
| 2010 | Double Q-learning proposed | Hado van Hasselt [6] |
| 2013 | DQN applied to Atari games | Mnih et al. (DeepMind) [5] |
| 2015 | DQN published in Nature, achieving human-level performance | Mnih et al. (DeepMind) [4] |
| 2015 | DDPG introduced for continuous control | Lillicrap et al. [10] |
| 2016 | Double DQN, Dueling DQN, and Prioritized Experience Replay published | van Hasselt et al. [7], Wang et al. [8], Schaul et al. [22] |
| 2016 | NAF for continuous action Q-learning | Gu et al. [14] |
| 2017 | Distributional RL (C51) introduced | Bellemare, Dabney, Munos [9] |
| 2018 | Rainbow combines six DQN improvements | Hessel et al. [13] |
| 2018 | SAC and TD3 for continuous control | Haarnoja et al. [12], Fujimoto et al. [11] |

## See also

- [Reinforcement learning](/wiki/reinforcement_learning)
- [Bellman equation](/wiki/bellman_equation)
- [Q-learning](/wiki/q_learning)
- [Deep Q-Network (DQN)](/wiki/deep_q-network_dqn)
- [Policy](/wiki/policy)
- [Reward](/wiki/reward)
- [Markov decision process](/wiki/markov_decision_process_mdp)
- [Temporal difference learning](/wiki/temporal_difference_learning)
- [Experience replay](/wiki/experience_replay)
- [Target network](/wiki/target_network)

## References

1. Sutton, R. S., & Barto, A. G. (2018). *Reinforcement Learning: An Introduction* (2nd ed.). MIT Press. Available at: http://incompleteideas.net/book/the-book-2nd.html
2. Watkins, C. J. C. H. (1989). *Learning from Delayed Rewards*. PhD thesis, King's College, Cambridge.
3. Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. *Machine Learning*, 8(3-4), 279-292.
4. Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2015). Human-level control through deep reinforcement learning. *Nature*, 518(7540), 529-533.
5. Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2013). Playing Atari with Deep Reinforcement Learning. *arXiv preprint arXiv:1312.5602*.
6. van Hasselt, H. (2010). Double Q-learning. *Advances in Neural Information Processing Systems*, 23.
7. van Hasselt, H., Guez, A., & Silver, D. (2016). Deep Reinforcement Learning with Double Q-learning. *Proceedings of the AAAI Conference on Artificial Intelligence*, 30(1).
8. Wang, Z., Schaul, T., Hessel, M., et al. (2016). Dueling Network Architectures for Deep Reinforcement Learning. *Proceedings of the 33rd International Conference on Machine Learning (ICML)*.
9. Bellemare, M. G., Dabney, W., & Munos, R. (2017). A Distributional Perspective on Reinforcement Learning. *Proceedings of the 34th International Conference on Machine Learning (ICML)*.
10. Lillicrap, T. P., Hunt, J. J., Pritzel, A., et al. (2015). Continuous control with deep reinforcement learning. *arXiv preprint arXiv:1509.02971*.
11. Fujimoto, S., van Hoof, H., & Meger, D. (2018). Addressing Function Approximation Error in Actor-Critic Methods. *Proceedings of the 35th International Conference on Machine Learning (ICML)*.
12. Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. *Proceedings of the 35th International Conference on Machine Learning (ICML)*.
13. Hessel, M., Modayil, J., van Hasselt, H., et al. (2018). Rainbow: Combining Improvements in Deep Reinforcement Learning. *Proceedings of the AAAI Conference on Artificial Intelligence*, 32(1).
14. Gu, S., Lillicrap, T., Sutskever, I., & Levine, S. (2016). Continuous Deep Q-Learning with Model-based Acceleration. *Proceedings of the 33rd International Conference on Machine Learning (ICML)*.
15. Bellman, R. (1957). *Dynamic Programming*. Princeton University Press.
16. Sutton, R. S. (1988). Learning to Predict by the Methods of Temporal Differences. *Machine Learning*, 3, 9-44. Available at: https://link.springer.com/article/10.1007/BF00115009
17. Rummery, G. A., & Niranjan, M. (1994). *On-line Q-learning Using Connectionist Systems*. Technical Report CUED/F-INFENG/TR 166, Cambridge University Engineering Department. Available at: http://mi.eng.cam.ac.uk/reports/svr-ftp/auto-pdf/rummery_tr166.pdf
18. Thrun, S., & Schwartz, A. (1993). Issues in Using Function Approximation for Reinforcement Learning. *Proceedings of the Fourth Connectionist Models Summer School*. Available at: https://www.ri.cmu.edu/pub_files/pub1/thrun_sebastian_1993_1/thrun_sebastian_1993_1.pdf
19. van Seijen, H., van Hasselt, H., Whiteson, S., & Wiering, M. (2009). A Theoretical and Empirical Analysis of Expected Sarsa. *IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL)*, 177-184. Available at: https://hadovanhasselt.com/wp-content/uploads/2015/12/a_theoretical_and_empirical_analysis_of_expected_sarsa.pdf
20. Kalashnikov, D., Irpan, A., Pastor, P., et al. (2018). QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation. *Conference on Robot Learning (CoRL)*. arXiv preprint arXiv:1806.10293. Available at: https://arxiv.org/abs/1806.10293
21. Bozinovski, S. (1982). A self-learning system using secondary reinforcement. In R. Trappl (Ed.), *Cybernetics and Systems Research*, 397-402. North-Holland.
22. Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2016). Prioritized Experience Replay. *International Conference on Learning Representations (ICLR)*. arXiv preprint arXiv:1511.05952. Available at: https://arxiv.org/abs/1511.05952

