A critic in machine learning is a component that evaluates the quality of actions or outputs produced by another component, typically called the actor or policy. The critic learns to estimate a value function that predicts the expected cumulative reward from a given state or state-action pair. This evaluation signal guides the learning process by telling the actor how good its decisions are relative to what was expected.
The critic concept is most prominent in reinforcement learning (RL), where it serves as the backbone of actor-critic architectures. However, critics also appear in generative adversarial networks (where the discriminator acts as a critic) and in reinforcement learning from human feedback (RLHF) pipelines used to fine-tune large language models.
Imagine you are learning to throw a basketball into a hoop. Every time you throw, your coach watches and says something like "that was pretty good" or "aim a little higher next time." The coach does not throw the ball for you; instead, the coach just watches and gives feedback.
In machine learning, the critic is like that coach. There is also a player (called the "actor") who actually makes decisions. The critic watches what the actor does and scores each decision. If the score is higher than expected, the actor learns to do more of that. If the score is lower than expected, the actor adjusts. Over time, the actor gets better and better because the critic keeps giving helpful feedback.
The idea of separating evaluation from action selection in learning systems dates back to the late 1970s. Ian Witten described an early form of adaptive controller with a distinct evaluative component in 1977. The terms "actor" and "critic" were introduced by Andrew Barto, Richard Sutton, and Charles Anderson in their 1983 paper "Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problems." In that work, the actor was called the Associative Search Element (ASE) and the critic was called the Adaptive Critic Element (ACE). The two modules interacted so that the ACE learned to predict future reinforcement signals while the ASE used those predictions to improve its action selections.
Richard Sutton formalized temporal difference (TD) learning in 1988, providing the theoretical foundation for how critics learn value estimates from sequential experience without waiting for complete episodes to finish. This paper, "Learning to Predict by the Methods of Temporal Differences," showed that TD methods require less memory and computation than Monte Carlo methods while producing more accurate predictions for many real-world tasks.
The convergence properties of actor-critic algorithms were rigorously analyzed by Vijay Konda and John Tsitsiklis in their 1999 NeurIPS paper and subsequent 2003 journal article "On Actor-Critic Algorithms." They proved convergence for a class of two-timescale algorithms where the critic uses TD learning with linear function approximation and the actor updates follow an approximate policy gradient.
In reinforcement learning, an agent interacts with an environment by selecting actions according to a policy. After each action, the environment returns a reward and transitions to a new state. The goal is to learn a policy that maximizes the expected sum of discounted future rewards.
The critic's job is to estimate how much total reward the agent can expect from a given situation. This estimate takes one of several forms.
The state value function V(s) estimates the expected cumulative discounted reward starting from state s and following the current policy thereafter:
V^π(s) = E_π [ Σ_{t=0}^{∞} γ^t R_{t+1} | S_0 = s ]
Here, γ (gamma) is the discount factor between 0 and 1 that controls how much the agent values future rewards relative to immediate ones. A critic that learns V(s) can compute the advantage of any action by comparing the actual outcome to the baseline prediction.
The action value function Q(s, a) estimates the expected return after taking action a in state s and following the current policy afterward:
Q^π(s, a) = E_π [ Σ_{t=0}^{∞} γ^t R_{t+1} | S_0 = s, A_0 = a ]
Q-function critics are used in algorithms like DQN, DDPG, and SAC. They are especially useful in continuous action spaces where the critic must evaluate specific state-action pairs rather than just states.
The advantage function combines both of the above:
A^π(s, a) = Q^π(s, a) - V^π(s)
The advantage tells the agent how much better (or worse) a particular action is compared to the average action in that state. A positive advantage means the action is better than average; a negative advantage means it is worse. Using the advantage function reduces variance in policy gradient estimates, which speeds up and stabilizes learning.
The most common way a critic learns is through temporal difference (TD) learning, introduced by Sutton in 1988. Unlike Monte Carlo methods that wait until an episode ends to update value estimates, TD methods update estimates after each step using the observed reward and the current estimate of the next state's value.
The one-step TD error (also called the TD residual) is defined as:
δ_t = R_{t+1} + γ V(S_{t+1}) - V(S_t)
This quantity represents the difference between what the critic predicted (V(S_t)) and a better estimate formed by combining the actual reward R_{t+1} with the discounted value of the next state γ V(S_{t+1}). When the TD error is positive, the outcome was better than expected; when negative, it was worse.
The critic updates its parameters to minimize the squared TD error, typically using stochastic gradient descent:
φ ← φ + α δ_t ∇_φ V_φ(S_t)
where α is the learning rate and φ represents the critic's parameters.
TD(0) uses only a single step of actual reward before bootstrapping from the value estimate. TD(λ) generalizes this by blending multi-step returns through eligibility traces. When λ = 0, the method reduces to standard one-step TD; when λ = 1, it becomes equivalent to Monte Carlo estimation. Intermediate values of λ trade off bias (from bootstrapping) against variance (from using longer sequences of actual rewards).
Generalized Advantage Estimation (GAE), proposed by John Schulman and colleagues in 2016, extends the idea of TD(λ) to advantage estimation. GAE computes an exponentially weighted sum of multi-step TD errors:
A_t^GAE(γ,λ) = Σ_{l=0}^{∞} (γλ)^l δ_{t+l}
The hyperparameter λ controls the bias-variance tradeoff. Lower values of λ (closer to 0) produce lower variance but higher bias estimates, while higher values (closer to 1) give lower bias but higher variance. GAE is used extensively in modern on-policy algorithms such as PPO and TRPO.
Actor-critic methods combine a policy (the actor) and a value function (the critic) into a single learning framework. The actor decides which actions to take, and the critic evaluates those decisions. This architecture addresses limitations of both pure policy gradient methods and pure value-based methods.
The training loop follows these steps:
The policy gradient update for the actor takes the form:
θ ← θ + α_actor δ_t ∇_θ log π_θ(A_t | S_t)
The TD error (or advantage) acts as a scaling factor. Actions that led to higher-than-expected returns get reinforced, while actions that led to lower-than-expected returns get suppressed.
| Approach | Strengths | Weaknesses |
|---|---|---|
| Pure policy gradient (e.g., REINFORCE) | Directly optimizes the policy; works with continuous actions | High variance; requires complete episodes; slow convergence |
| Pure value-based (e.g., Q-learning) | Sample-efficient through off-policy learning; lower variance | Cannot handle continuous action spaces natively; policy is implicit |
| Actor-critic | Combines strengths of both; lower variance than REINFORCE; works with continuous actions; can update online (per step) | More complex; two sets of parameters to tune; potential instability from interacting updates |
In deep reinforcement learning, the actor and critic can be implemented as separate neural networks or as a single network with two output heads. Shared networks use a common feature extractor (such as a convolutional neural network for image inputs) and branch into separate fully connected layers for the policy output and value output. Sharing features reduces the total number of parameters and can improve learning speed, but it can also introduce conflicting gradient signals if the two objectives compete.
Separate networks avoid gradient interference at the cost of higher memory usage and slower feature learning. In practice, the choice depends on the problem: shared architectures are common in environments with pixel observations (e.g., Atari games), while separate networks are more common in continuous control tasks.
The following table summarizes the most widely used actor-critic algorithms and how their critics are designed.
| Algorithm | Year | Critic type | Action space | On/off-policy | Key idea |
|---|---|---|---|---|---|
| A2C (Advantage Actor-Critic) | 2016 | V(s) with advantage | Discrete or continuous | On-policy | Synchronous parallel workers; advantage-based updates |
| A3C (Asynchronous Advantage Actor-Critic) | 2016 | V(s) with advantage | Discrete or continuous | On-policy | Asynchronous parallel workers for decorrelated updates |
| DDPG (Deep Deterministic Policy Gradient) | 2016 | Q(s, a) | Continuous | Off-policy | Deterministic policy with experience replay and target networks |
| PPO (Proximal Policy Optimization) | 2017 | V(s) with GAE | Discrete or continuous | On-policy | Clipped surrogate objective prevents large policy updates |
| TD3 (Twin Delayed DDPG) | 2018 | Twin Q(s, a) | Continuous | Off-policy | Two critics; delayed policy updates; target smoothing |
| SAC (Soft Actor-Critic) | 2018 | Twin Q(s, a) | Continuous | Off-policy | Maximum entropy framework; stochastic policy |
| IMPALA | 2018 | V(s) with V-trace | Discrete or continuous | Off-policy corrected | V-trace importance weighting for distributed training |
| MADDPG | 2017 | Centralized Q(s, a) | Continuous | Off-policy | Centralized critic, decentralized actors for multi-agent settings |
The Advantage Actor-Critic (A2C) and Asynchronous Advantage Actor-Critic (A3C) algorithms were introduced by Volodymyr Mnih and colleagues at DeepMind in 2016. Both methods use a critic that estimates V(s) and compute the advantage from the TD error to update the actor.
A3C runs multiple worker agents in parallel, each interacting with its own copy of the environment. Workers asynchronously send gradient updates to a shared set of parameters. This parallelism serves two purposes: it speeds up data collection, and it decorrelates the training data (since different workers encounter different states), which improves stability compared to a single agent learning from correlated sequential experience.
A2C is the synchronous variant. All workers collect a batch of experience, then their gradients are averaged and applied in a single update. While A2C does not have the decorrelation benefit of asynchronous updates, it often performs comparably to A3C in practice and is simpler to implement. Interestingly, A2C can be viewed as a special case of PPO when PPO's number of optimization epochs per batch is set to one.
DDPG, introduced by Timothy Lillicrap and colleagues in 2016, extends the deterministic policy gradient (DPG) theorem to work with deep neural networks. DDPG uses a Q-function critic that evaluates state-action pairs. Because the policy is deterministic, the Q-function is differentiable with respect to the action, which allows efficient gradient-based policy updates.
DDPG borrows two techniques from DQN to stabilize learning. First, it uses an experience replay buffer to store past transitions and sample mini-batches for training, breaking the correlation between consecutive samples. Second, it uses target networks for both the actor and critic, which are soft-updated (Polyak averaging) to track the main networks slowly. These target networks provide stable targets for the TD error computation.
DDPG demonstrated strong performance on over 20 continuous control tasks, including cartpole swing-up, dexterous manipulation, and legged locomotion.
PPO, published by John Schulman and colleagues at OpenAI in 2017, uses a critic that estimates V(s) combined with GAE to compute advantages. The defining feature of PPO is its clipped surrogate objective, which prevents the policy from changing too much in a single update:
L^CLIP(θ) = E_t [ min( r_t(θ) A_t, clip(r_t(θ), 1-ε, 1+ε) A_t ) ]
where r_t(θ) = π_θ(a_t|s_t) / π_θ_old(a_t|s_t) is the probability ratio between the new and old policies, and ε is a small hyperparameter (typically 0.1 or 0.2).
PPO has become one of the most widely used RL algorithms due to its combination of simplicity, stability, and strong performance. It was the RL algorithm used in the original RLHF pipeline for training ChatGPT and other aligned language models.
TD3, proposed by Scott Fujimoto, Herke van Hoof, and David Meger in 2018, addresses overestimation bias in DDPG through three modifications, two of which directly involve the critic.
The first modification is clipped double Q-learning: TD3 trains two independent Q-function critics and uses the minimum of their predictions when computing the target value. This reduces the systematic overestimation that occurs when a single critic's errors compound through the Bellman backup. The target is computed as:
y = r + γ min_{i=1,2} Q_φ_targ,i(s', a')
The second modification is delayed policy updates: the actor is updated less frequently than the critics (typically once for every two critic updates). This gives the critics time to converge before the actor uses their estimates, reducing volatility.
The third modification is target policy smoothing: noise is added to the target action to smooth out Q-function estimates and prevent the policy from exploiting narrow peaks in the Q-function.
SAC, introduced by Tuomas Haarnoja and colleagues at UC Berkeley in 2018, operates within the maximum entropy reinforcement learning framework. The agent maximizes a modified objective that includes an entropy bonus:
J(π) = Σ_t E [ r(s_t, a_t) + α H(π(·|s_t)) ]
where H is the entropy of the policy and α is a temperature parameter controlling the exploration-exploitation tradeoff.
SAC uses two Q-function critics (similar to TD3) to reduce overestimation bias. The policy is stochastic rather than deterministic, which naturally encourages exploration. Because SAC is off-policy, it can reuse past experience from a replay buffer, making it more sample-efficient than on-policy methods like PPO.
There are two common variants of SAC: one with a fixed temperature α and one that automatically adjusts α by enforcing an entropy constraint. The entropy-constrained variant is generally preferred because it adapts the level of exploration over the course of training.
IMPALA (Importance Weighted Actor-Learner Architectures), published by Lasse Espeholt and colleagues at DeepMind in 2018, is a distributed actor-critic architecture designed for scale. In IMPALA, many actors collect experience in parallel while a centralized learner updates the policy and value function.
Because the actors may run policies that lag behind the learner's current policy by several updates, the learning becomes off-policy. IMPALA corrects for this discrepancy using V-trace, an off-policy correction method based on importance sampling with truncated importance weights. V-trace allows IMPALA to maintain the stability of on-policy learning while achieving the throughput of massively parallel data collection. IMPALA achieves data throughput rates of 250,000 frames per second, over 30 times faster than single-machine A3C.
The design of the critic depends heavily on whether the action space is discrete or continuous.
| Feature | Discrete actions | Continuous actions |
|---|---|---|
| Typical critic output | V(s), or Q(s, a) for all actions simultaneously | Q(s, a) for a single (s, a) pair |
| Action selection | argmax over finite set of Q-values | Requires a separate actor network |
| Common algorithms | DQN, A2C, A3C | DDPG, TD3, SAC |
| Challenge | Scales poorly to large action spaces | Function approximation error; overestimation bias |
In discrete settings, a single forward pass through the critic can produce Q-values for all possible actions. The policy can then be derived by selecting the action with the highest Q-value (greedy) or by sampling from a softmax distribution over the Q-values.
In continuous settings, this approach is impossible because the action space is infinite. The critic must accept a specific action as input alongside the state and output a single scalar value. A separate actor network is needed to propose actions, and the critic evaluates them. This is why all continuous-control actor-critic algorithms (DDPG, TD3, SAC) use explicit actor and critic networks.
In multi-agent settings, the design of the critic becomes more nuanced. The standard approach is centralized training with decentralized execution (CTDE), introduced in algorithms like MADDPG (Lowe et al., 2017) and MAPPO.
During training, each agent's critic has access to the observations and actions of all agents. This centralized critic can learn a more accurate value function because it accounts for the behavior of other agents, which would otherwise appear as non-stationarity from any single agent's perspective.
During execution, each agent uses only its own actor (which depends only on local observations), so no communication between agents is needed at inference time. The critic is only used during training and can be discarded afterward.
Other approaches to multi-agent critics include:
Critics play a central role in reinforcement learning from human feedback (RLHF), the technique used to align large language models with human preferences. In the standard RLHF pipeline:
The reward model (critic) in RLHF is typically initialized from the same pretrained language model and fine-tuned on comparison data. Its quality directly determines how well the fine-tuned model aligns with human intentions. Inaccurate reward models can lead to reward hacking, where the actor learns to produce outputs that score highly according to the critic but do not actually satisfy human preferences.
More recent approaches like Direct Preference Optimization (DPO) and Reinforcement Learning from AI Feedback (RLAIF) modify or replace the explicit critic in this pipeline, but the original RLHF formulation with PPO and a reward model critic remains widely studied.
Training critics in deep RL is subject to several well-known challenges.
Q-function critics tend to overestimate action values because the max operator in the Bellman update selectively propagates estimation errors upward. This problem was first identified in tabular Q-learning and becomes worse with function approximation. Double Q-learning, Double DQN, and the twin critics in TD3 and SAC all address this issue by decorrelating the action selection from the value estimation.
Richard Sutton identified the "deadly triad" as the combination of (1) function approximation, (2) bootstrapping (as in TD learning), and (3) off-policy learning. When all three are present, value estimates can diverge. This is directly relevant to critics because most deep RL critics use neural network function approximation with TD bootstrapping, and many algorithms (DDPG, TD3, SAC) are off-policy. Techniques like target networks, Polyak averaging, and gradient clipping help mitigate divergence but do not eliminate the risk entirely.
In actor-critic methods, the actor and critic depend on each other: the critic's value estimates are only accurate for the current policy, but the policy keeps changing based on the critic's feedback. An inaccurate critic step can mislead the actor, which in turn changes the data distribution, further degrading the critic's estimates. This feedback loop can lead to oscillation or divergence. Two-timescale learning (updating the critic faster than the actor) is the standard mitigation strategy, providing the critic with enough updates to track the changing policy.
When the agent cannot observe the full state of the environment (a partially observable Markov decision process, or POMDP), the critic must estimate values from incomplete information. This makes the value function harder to learn and can introduce systematic errors. Solutions include using recurrent neural networks (such as LSTM) in the critic to maintain a memory of past observations, or providing the critic with additional information during training that is unavailable during execution (the asymmetric actor-critic approach).
While not traditionally described using RL terminology, generative adversarial networks (GANs) contain a component that functions as a critic. In the original GAN formulation by Ian Goodfellow and colleagues (2014), the discriminator evaluates generated samples and provides a learning signal to the generator.
In the Wasserstein GAN (WGAN) variant introduced by Martin Arjovsky and colleagues in 2017, the discriminator is explicitly renamed the "critic." Instead of outputting a probability that a sample is real, the WGAN critic outputs an unbounded scalar score. The critic is trained to approximate the Wasserstein distance (earth mover's distance) between the real and generated distributions, providing a more stable training signal than the original GAN formulation.
Target networks are copies of the critic (and sometimes the actor) that are updated slowly, either through periodic hard copies or continuous Polyak averaging. They provide stable targets for the TD error computation, preventing the instability that arises when the same network is used for both prediction and target computation. Target networks were introduced in DQN and are used in DDPG, TD3, SAC, and many other algorithms.
Off-policy critic training benefits from experience replay, where past transitions (s, a, r, s') are stored in a buffer and sampled randomly for training. This breaks the temporal correlation between consecutive samples and allows each transition to be used multiple times, improving sample efficiency. Prioritized experience replay further improves learning by sampling transitions with larger TD errors more frequently.
The critic's learning rate relative to the actor's is an important hyperparameter. If the critic learns too slowly, it provides stale or inaccurate feedback. If it learns too fast, its estimates may oscillate. A common practice is to set the critic learning rate equal to or slightly higher than the actor's learning rate, and to perform multiple critic updates per actor update (as in TD3 and SAC).
For low-dimensional state spaces (e.g., joint angles in robotics), critics typically use fully connected networks with 2 to 3 hidden layers of 256 or 512 units each. For image-based observations, convolutional feature extractors are used. When the critic takes both state and action as input (as in DDPG, TD3, SAC), the action is typically concatenated with the state features after the first hidden layer rather than at the input, which has been found to improve performance.
The critic is one of the two fundamental components of modern actor-critic reinforcement learning. By learning to predict expected returns, the critic provides a low-variance training signal that guides the actor toward better policies. From the early adaptive critic elements of the 1980s to the twin Q-networks and entropy-regularized critics of contemporary algorithms, the design of the critic has been a consistent focus of RL research. The same concept extends beyond RL proper, appearing in GANs (where the discriminator acts as a critic) and in RLHF pipelines (where the reward model serves as a critic for language model alignment).