Critic

A critic in machine learning is a component that evaluates the quality of actions or outputs produced by another component, typically called the actor or policy. The critic learns to estimate a value function that predicts the expected cumulative reward from a given state or state-action pair. This evaluation signal guides the learning process by telling the actor how good its decisions are relative to what was expected.

The critic concept is most prominent in reinforcement learning (RL), where it serves as the backbone of actor-critic architectures. However, critics also appear in generative adversarial networks (where the discriminator acts as a critic) and in reinforcement learning from human feedback (RLHF) pipelines used to fine-tune large language models.

ELI5: Explain like I'm 5

Imagine you are learning to throw a basketball into a hoop. Every time you throw, your coach watches and says something like "that was pretty good" or "aim a little higher next time." The coach does not throw the ball for you; instead, the coach just watches and gives feedback.

In machine learning, the critic is like that coach. There is also a player (called the "actor") who actually makes decisions. The critic watches what the actor does and scores each decision. If the score is higher than expected, the actor learns to do more of that. If the score is lower than expected, the actor adjusts. Over time, the actor gets better and better because the critic keeps giving helpful feedback.

Historical background

The idea of separating evaluation from action selection in learning systems dates back to the late 1970s. Ian Witten described an early form of adaptive controller with a distinct evaluative component in 1977. The terms "actor" and "critic" were introduced by Andrew Barto, Richard Sutton, and Charles Anderson in their 1983 paper "Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problems." In that work, the actor was called the Associative Search Element (ASE) and the critic was called the Adaptive Critic Element (ACE). The two modules interacted so that the ACE learned to predict future reinforcement signals while the ASE used those predictions to improve its action selections.

Richard Sutton formalized temporal difference (TD) learning in 1988, providing the theoretical foundation for how critics learn value estimates from sequential experience without waiting for complete episodes to finish. This paper, "Learning to Predict by the Methods of Temporal Differences," showed that TD methods require less memory and computation than Monte Carlo methods while producing more accurate predictions for many real-world tasks.

The convergence properties of actor-critic algorithms were rigorously analyzed by Vijay Konda and John Tsitsiklis in their 1999 NeurIPS paper and subsequent 2003 journal article "On Actor-Critic Algorithms." They proved convergence for a class of two-timescale algorithms where the critic uses TD learning with linear function approximation and the actor updates follow an approximate policy gradient.

The role of the critic in reinforcement learning

In reinforcement learning, an agent interacts with an environment by selecting actions according to a policy. After each action, the environment returns a reward and transitions to a new state. The goal is to learn a policy that maximizes the expected sum of discounted future rewards.

The critic's job is to estimate how much total reward the agent can expect from a given situation. This estimate takes one of several forms.

State value function V(s)

The state value function V(s) estimates the expected cumulative discounted reward starting from state s and following the current policy thereafter:

V^π(s) = E_π [ Σ_{t=0}^{∞} γ^t R_{t+1} | S_0 = s ]

Here, γ (gamma) is the discount factor between 0 and 1 that controls how much the agent values future rewards relative to immediate ones. A critic that learns V(s) can compute the advantage of any action by comparing the actual outcome to the baseline prediction.

Action value function Q(s, a)

The action value function Q(s, a) estimates the expected return after taking action a in state s and following the current policy afterward:

Q^π(s, a) = E_π [ Σ_{t=0}^{∞} γ^t R_{t+1} | S_0 = s, A_0 = a ]

Q-function critics are used in algorithms like DQN, DDPG, and SAC. They are especially useful in continuous action spaces where the critic must evaluate specific state-action pairs rather than just states.

Advantage function A(s, a)

The advantage function combines both of the above:

A^π(s, a) = Q^π(s, a) - V^π(s)

The advantage tells the agent how much better (or worse) a particular action is compared to the average action in that state. A positive advantage means the action is better than average; a negative advantage means it is worse. Using the advantage function reduces variance in policy gradient estimates, which speeds up and stabilizes learning.

Temporal difference learning

The most common way a critic learns is through temporal difference (TD) learning, introduced by Sutton in 1988. Unlike Monte Carlo methods that wait until an episode ends to update value estimates, TD methods update estimates after each step using the observed reward and the current estimate of the next state's value.

The TD error

The one-step TD error (also called the TD residual) is defined as:

δ_t = R_{t+1} + γ V(S_{t+1}) - V(S_t)

This quantity represents the difference between what the critic predicted (V(S_t)) and a better estimate formed by combining the actual reward R_{t+1} with the discounted value of the next state γ V(S_{t+1}). When the TD error is positive, the outcome was better than expected; when negative, it was worse.

The critic updates its parameters to minimize the squared TD error, typically using stochastic gradient descent:

φ ← φ + α δ_t ∇_φ V_φ(S_t)

where α is the learning rate and φ represents the critic's parameters.

TD(λ) and eligibility traces

TD(0) uses only a single step of actual reward before bootstrapping from the value estimate. TD(λ) generalizes this by blending multi-step returns through eligibility traces. When λ = 0, the method reduces to standard one-step TD; when λ = 1, it becomes equivalent to Monte Carlo estimation. Intermediate values of λ trade off bias (from bootstrapping) against variance (from using longer sequences of actual rewards).

Generalized advantage estimation

Generalized Advantage Estimation (GAE), proposed by John Schulman and colleagues in 2016, extends the idea of TD(λ) to advantage estimation. GAE computes an exponentially weighted sum of multi-step TD errors:

A_t^GAE(γ,λ) = Σ_{l=0}^{∞} (γλ)^l δ_{t+l}

The hyperparameter λ controls the bias-variance tradeoff. Lower values of λ (closer to 0) produce lower variance but higher bias estimates, while higher values (closer to 1) give lower bias but higher variance. GAE is used extensively in modern on-policy algorithms such as PPO and TRPO.

Actor-critic architecture

Actor-critic methods combine a policy (the actor) and a value function (the critic) into a single learning framework. The actor decides which actions to take, and the critic evaluates those decisions. This architecture addresses limitations of both pure policy gradient methods and pure value-based methods.

How actor-critic learning works

The training loop follows these steps:

The actor observes the current state s and selects an action a according to its policy π_θ(a|s).
The environment returns a reward r and a new state s'.
The critic computes the TD error δ = r + γ V_φ(s') - V_φ(s).
The critic updates its parameters φ to reduce the TD error.
The actor updates its parameters θ using the policy gradient, scaled by the TD error or advantage estimate.
Steps 1 through 5 repeat.

The policy gradient update for the actor takes the form:

θ ← θ + α_actor δ_t ∇_θ log π_θ(A_t | S_t)

The TD error (or advantage) acts as a scaling factor. Actions that led to higher-than-expected returns get reinforced, while actions that led to lower-than-expected returns get suppressed.

Advantages over pure methods

Approach	Strengths	Weaknesses
Pure policy gradient (e.g., REINFORCE)	Directly optimizes the policy; works with continuous actions	High variance; requires complete episodes; slow convergence
Pure value-based (e.g., Q-learning)	Sample-efficient through off-policy learning; lower variance	Cannot handle continuous action spaces natively; policy is implicit
Actor-critic	Combines strengths of both; lower variance than REINFORCE; works with continuous actions; can update online (per step)	More complex; two sets of parameters to tune; potential instability from interacting updates

Shared vs. separate networks

In deep reinforcement learning, the actor and critic can be implemented as separate neural networks or as a single network with two output heads. Shared networks use a common feature extractor (such as a convolutional neural network for image inputs) and branch into separate fully connected layers for the policy output and value output. Sharing features reduces the total number of parameters and can improve learning speed, but it can also introduce conflicting gradient signals if the two objectives compete.

Separate networks avoid gradient interference at the cost of higher memory usage and slower feature learning. In practice, the choice depends on the problem: shared architectures are common in environments with pixel observations (e.g., Atari games), while separate networks are more common in continuous control tasks.

Major actor-critic algorithms

The following table summarizes the most widely used actor-critic algorithms and how their critics are designed.

Algorithm	Year	Critic type	Action space	On/off-policy	Key idea
A2C (Advantage Actor-Critic)	2016	V(s) with advantage	Discrete or continuous	On-policy	Synchronous parallel workers; advantage-based updates
A3C (Asynchronous Advantage Actor-Critic)	2016	V(s) with advantage	Discrete or continuous	On-policy	Asynchronous parallel workers for decorrelated updates
DDPG (Deep Deterministic Policy Gradient)	2016	Q(s, a)	Continuous	Off-policy	Deterministic policy with experience replay and target networks
PPO (Proximal Policy Optimization)	2017	V(s) with GAE	Discrete or continuous	On-policy	Clipped surrogate objective prevents large policy updates
TD3 (Twin Delayed DDPG)	2018	Twin Q(s, a)	Continuous	Off-policy	Two critics; delayed policy updates; target smoothing
SAC (Soft Actor-Critic)	2018	Twin Q(s, a)	Continuous	Off-policy	Maximum entropy framework; stochastic policy
IMPALA	2018	V(s) with V-trace	Discrete or continuous	Off-policy corrected	V-trace importance weighting for distributed training
MADDPG	2017	Centralized Q(s, a)	Continuous	Off-policy	Centralized critic, decentralized actors for multi-agent settings

A2C and A3C

The Advantage Actor-Critic (A2C) and Asynchronous Advantage Actor-Critic (A3C) algorithms were introduced by Volodymyr Mnih and colleagues at DeepMind in 2016. Both methods use a critic that estimates V(s) and compute the advantage from the TD error to update the actor.

A3C runs multiple worker agents in parallel, each interacting with its own copy of the environment. Workers asynchronously send gradient updates to a shared set of parameters. This parallelism serves two purposes: it speeds up data collection, and it decorrelates the training data (since different workers encounter different states), which improves stability compared to a single agent learning from correlated sequential experience.

A2C is the synchronous variant. All workers collect a batch of experience, then their gradients are averaged and applied in a single update. While A2C does not have the decorrelation benefit of asynchronous updates, it often performs comparably to A3C in practice and is simpler to implement. Interestingly, A2C can be viewed as a special case of PPO when PPO's number of optimization epochs per batch is set to one.

Deep deterministic policy gradient (DDPG)

DDPG, introduced by Timothy Lillicrap and colleagues in 2016, extends the deterministic policy gradient (DPG) theorem to work with deep neural networks. DDPG uses a Q-function critic that evaluates state-action pairs. Because the policy is deterministic, the Q-function is differentiable with respect to the action, which allows efficient gradient-based policy updates.

DDPG borrows two techniques from DQN to stabilize learning. First, it uses an experience replay buffer to store past transitions and sample mini-batches for training, breaking the correlation between consecutive samples. Second, it uses target networks for both the actor and critic, which are soft-updated (Polyak averaging) to track the main networks slowly. These target networks provide stable targets for the TD error computation.

DDPG demonstrated strong performance on over 20 continuous control tasks, including cartpole swing-up, dexterous manipulation, and legged locomotion.

Proximal policy optimization (PPO)

PPO, published by John Schulman and colleagues at OpenAI in 2017, uses a critic that estimates V(s) combined with GAE to compute advantages. The defining feature of PPO is its clipped surrogate objective, which prevents the policy from changing too much in a single update:

L^CLIP(θ) = E_t [ min( r_t(θ) A_t, clip(r_t(θ), 1-ε, 1+ε) A_t ) ]

where r_t(θ) = π_θ(a_t|s_t) / π_θ_old(a_t|s_t) is the probability ratio between the new and old policies, and ε is a small hyperparameter (typically 0.1 or 0.2).

PPO has become one of the most widely used RL algorithms due to its combination of simplicity, stability, and strong performance. It was the RL algorithm used in the original RLHF pipeline for training ChatGPT and other aligned language models.

Twin delayed DDPG (TD3)

TD3, proposed by Scott Fujimoto, Herke van Hoof, and David Meger in 2018, addresses overestimation bias in DDPG through three modifications, two of which directly involve the critic.

The first modification is clipped double Q-learning: TD3 trains two independent Q-function critics and uses the minimum of their predictions when computing the target value. This reduces the systematic overestimation that occurs when a single critic's errors compound through the Bellman backup. The target is computed as:

y = r + γ min_{i=1,2} Q_φ_targ,i(s', a')

The second modification is delayed policy updates: the actor is updated less frequently than the critics (typically once for every two critic updates). This gives the critics time to converge before the actor uses their estimates, reducing volatility.

The third modification is target policy smoothing: noise is added to the target action to smooth out Q-function estimates and prevent the policy from exploiting narrow peaks in the Q-function.

Soft actor-critic (SAC)

SAC, introduced by Tuomas Haarnoja and colleagues at UC Berkeley in 2018, operates within the maximum entropy reinforcement learning framework. The agent maximizes a modified objective that includes an entropy bonus:

J(π) = Σ_t E [ r(s_t, a_t) + α H(π(·|s_t)) ]

where H is the entropy of the policy and α is a temperature parameter controlling the exploration-exploitation tradeoff.

SAC uses two Q-function critics (similar to TD3) to reduce overestimation bias. The policy is stochastic rather than deterministic, which naturally encourages exploration. Because SAC is off-policy, it can reuse past experience from a replay buffer, making it more sample-efficient than on-policy methods like PPO.

There are two common variants of SAC: one with a fixed temperature α and one that automatically adjusts α by enforcing an entropy constraint. The entropy-constrained variant is generally preferred because it adapts the level of exploration over the course of training.

IMPALA and V-trace

IMPALA (Importance Weighted Actor-Learner Architectures), published by Lasse Espeholt and colleagues at DeepMind in 2018, is a distributed actor-critic architecture designed for scale. In IMPALA, many actors collect experience in parallel while a centralized learner updates the policy and value function.

Because the actors may run policies that lag behind the learner's current policy by several updates, the learning becomes off-policy. IMPALA corrects for this discrepancy using V-trace, an off-policy correction method based on importance sampling with truncated importance weights. V-trace allows IMPALA to maintain the stability of on-policy learning while achieving the throughput of massively parallel data collection. IMPALA achieves data throughput rates of 250,000 frames per second, over 30 times faster than single-machine A3C.

Critics in continuous vs. discrete action spaces

The design of the critic depends heavily on whether the action space is discrete or continuous.

Feature	Discrete actions	Continuous actions
Typical critic output	V(s), or Q(s, a) for all actions simultaneously	Q(s, a) for a single (s, a) pair
Action selection	argmax over finite set of Q-values	Requires a separate actor network
Common algorithms	DQN, A2C, A3C	DDPG, TD3, SAC
Challenge	Scales poorly to large action spaces	Function approximation error; overestimation bias

In discrete settings, a single forward pass through the critic can produce Q-values for all possible actions. The policy can then be derived by selecting the action with the highest Q-value (greedy) or by sampling from a softmax distribution over the Q-values.

In continuous settings, this approach is impossible because the action space is infinite. The critic must accept a specific action as input alongside the state and output a single scalar value. A separate actor network is needed to propose actions, and the critic evaluates them. This is why all continuous-control actor-critic algorithms (DDPG, TD3, SAC) use explicit actor and critic networks.

Critics in multi-agent reinforcement learning

In multi-agent settings, the design of the critic becomes more nuanced. The standard approach is centralized training with decentralized execution (CTDE), introduced in algorithms like MADDPG (Lowe et al., 2017) and MAPPO.

During training, each agent's critic has access to the observations and actions of all agents. This centralized critic can learn a more accurate value function because it accounts for the behavior of other agents, which would otherwise appear as non-stationarity from any single agent's perspective.

During execution, each agent uses only its own actor (which depends only on local observations), so no communication between agents is needed at inference time. The critic is only used during training and can be discarded afterward.

Other approaches to multi-agent critics include:

Value function factorization (VDN, QMIX): decomposes the joint value function into individual agent contributions while maintaining a centralized training signal.
COMA (Counterfactual Multi-Agent policy gradients): uses a centralized critic that can compute counterfactual baselines for each agent.

The critic in RLHF for language models

Critics play a central role in reinforcement learning from human feedback (RLHF), the technique used to align large language models with human preferences. In the standard RLHF pipeline:

A base language model generates multiple responses to a given prompt.
Human annotators rank the responses by quality.
A reward model is trained on the human preference data. This reward model functions as a critic, scoring any prompt-response pair.
The language model (acting as the actor) is then fine-tuned using PPO or a similar RL algorithm, with the reward model providing the training signal.

The reward model (critic) in RLHF is typically initialized from the same pretrained language model and fine-tuned on comparison data. Its quality directly determines how well the fine-tuned model aligns with human intentions. Inaccurate reward models can lead to reward hacking, where the actor learns to produce outputs that score highly according to the critic but do not actually satisfy human preferences.

More recent approaches like Direct Preference Optimization (DPO) and Reinforcement Learning from AI Feedback (RLAIF) modify or replace the explicit critic in this pipeline, but the original RLHF formulation with PPO and a reward model critic remains widely studied.

Challenges and failure modes

Training critics in deep RL is subject to several well-known challenges.

Overestimation bias

Q-function critics tend to overestimate action values because the max operator in the Bellman update selectively propagates estimation errors upward. This problem was first identified in tabular Q-learning and becomes worse with function approximation. Double Q-learning, Double DQN, and the twin critics in TD3 and SAC all address this issue by decorrelating the action selection from the value estimation.

The deadly triad

Richard Sutton identified the "deadly triad" as the combination of (1) function approximation, (2) bootstrapping (as in TD learning), and (3) off-policy learning. When all three are present, value estimates can diverge. This is directly relevant to critics because most deep RL critics use neural network function approximation with TD bootstrapping, and many algorithms (DDPG, TD3, SAC) are off-policy. Techniques like target networks, Polyak averaging, and gradient clipping help mitigate divergence but do not eliminate the risk entirely.

Coupled learning dynamics

In actor-critic methods, the actor and critic depend on each other: the critic's value estimates are only accurate for the current policy, but the policy keeps changing based on the critic's feedback. An inaccurate critic step can mislead the actor, which in turn changes the data distribution, further degrading the critic's estimates. This feedback loop can lead to oscillation or divergence. Two-timescale learning (updating the critic faster than the actor) is the standard mitigation strategy, providing the critic with enough updates to track the changing policy.

Partial observability

When the agent cannot observe the full state of the environment (a partially observable Markov decision process, or POMDP), the critic must estimate values from incomplete information. This makes the value function harder to learn and can introduce systematic errors. Solutions include using recurrent neural networks (such as LSTM) in the critic to maintain a memory of past observations, or providing the critic with additional information during training that is unavailable during execution (the asymmetric actor-critic approach).

The critic in generative adversarial networks

While not traditionally described using RL terminology, generative adversarial networks (GANs) contain a component that functions as a critic. In the original GAN formulation by Ian Goodfellow and colleagues (2014), the discriminator evaluates generated samples and provides a learning signal to the generator.

In the Wasserstein GAN (WGAN) variant introduced by Martin Arjovsky and colleagues in 2017, the discriminator is explicitly renamed the "critic." Instead of outputting a probability that a sample is real, the WGAN critic outputs an unbounded scalar score. The critic is trained to approximate the Wasserstein distance (earth mover's distance) between the real and generated distributions, providing a more stable training signal than the original GAN formulation.

Practical considerations for training critics

Target networks

Target networks are copies of the critic (and sometimes the actor) that are updated slowly, either through periodic hard copies or continuous Polyak averaging. They provide stable targets for the TD error computation, preventing the instability that arises when the same network is used for both prediction and target computation. Target networks were introduced in DQN and are used in DDPG, TD3, SAC, and many other algorithms.

Replay buffers

Off-policy critic training benefits from experience replay, where past transitions (s, a, r, s') are stored in a buffer and sampled randomly for training. This breaks the temporal correlation between consecutive samples and allows each transition to be used multiple times, improving sample efficiency. Prioritized experience replay further improves learning by sampling transitions with larger TD errors more frequently.

Critic learning rate

The critic's learning rate relative to the actor's is an important hyperparameter. If the critic learns too slowly, it provides stale or inaccurate feedback. If it learns too fast, its estimates may oscillate. A common practice is to set the critic learning rate equal to or slightly higher than the actor's learning rate, and to perform multiple critic updates per actor update (as in TD3 and SAC).

Network architecture

For low-dimensional state spaces (e.g., joint angles in robotics), critics typically use fully connected networks with 2 to 3 hidden layers of 256 or 512 units each. For image-based observations, convolutional feature extractors are used. When the critic takes both state and action as input (as in DDPG, TD3, SAC), the action is typically concatenated with the state features after the first hidden layer rather than at the input, which has been found to improve performance.

Summary

The critic is one of the two fundamental components of modern actor-critic reinforcement learning. By learning to predict expected returns, the critic provides a low-variance training signal that guides the actor toward better policies. From the early adaptive critic elements of the 1980s to the twin Q-networks and entropy-regularized critics of contemporary algorithms, the design of the critic has been a consistent focus of RL research. The same concept extends beyond RL proper, appearing in GANs (where the discriminator acts as a critic) and in RLHF pipelines (where the reward model serves as a critic for language model alignment).

References

Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). "Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problems." IEEE Transactions on Systems, Man, and Cybernetics, SMC-13(5), 834-846.
Sutton, R. S. (1988). "Learning to Predict by the Methods of Temporal Differences." Machine Learning, 3(1), 9-44.
Konda, V. R., & Tsitsiklis, J. N. (2003). "On Actor-Critic Algorithms." SIAM Journal on Control and Optimization, 42(4), 1143-1166.
Mnih, V., et al. (2016). "Asynchronous Methods for Deep Reinforcement Learning." Proceedings of the 33rd International Conference on Machine Learning (ICML).
Lillicrap, T. P., et al. (2016). "Continuous Control with Deep Reinforcement Learning." Proceedings of the 4th International Conference on Learning Representations (ICLR).
Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2016). "High-Dimensional Continuous Control Using Generalized Advantage Estimation." Proceedings of the 4th International Conference on Learning Representations (ICLR).
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). "Proximal Policy Optimization Algorithms." arXiv preprint arXiv:1707.06347.
Fujimoto, S., van Hoof, H., & Meger, D. (2018). "Addressing Function Approximation Error in Actor-Critic Methods." Proceedings of the 35th International Conference on Machine Learning (ICML).
Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). "Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor." Proceedings of the 35th International Conference on Machine Learning (ICML).
Espeholt, L., et al. (2018). "IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures." Proceedings of the 35th International Conference on Machine Learning (ICML).
Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P., & Mordatch, I. (2017). "Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments." Advances in Neural Information Processing Systems 30 (NeurIPS).
Goodfellow, I., et al. (2014). "Generative Adversarial Networks." Advances in Neural Information Processing Systems 27 (NeurIPS).
Arjovsky, M., Chintala, S., & Bottou, L. (2017). "Wasserstein Generative Adversarial Networks." Proceedings of the 34th International Conference on Machine Learning (ICML).
Sutton, R. S., & Barto, A. G. (2018). *Reinforcement Learning: An Introduction* (2nd ed.). MIT Press.

ELI5: Explain like I'm 5

Historical background

The role of the critic in reinforcement learning

State value function V(s)

Action value function Q(s, a)

Advantage function A(s, a)

Temporal difference learning

The TD error

TD(λ) and eligibility traces

Generalized advantage estimation

Actor-critic architecture

How actor-critic learning works

Advantages over pure methods

Shared vs. separate networks

Major actor-critic algorithms

A2C and A3C

Deep deterministic policy gradient (DDPG)

Proximal policy optimization (PPO)

Twin delayed DDPG (TD3)

Soft actor-critic (SAC)

IMPALA and V-trace

Critics in continuous vs. discrete action spaces

Critics in multi-agent reinforcement learning

The critic in RLHF for language models

Challenges and failure modes

Overestimation bias

The deadly triad

Coupled learning dynamics

Partial observability

The critic in generative adversarial networks

Practical considerations for training critics

Target networks

Replay buffers

Critic learning rate

Network architecture

Summary

References

Improve this article

Related Articles

Machine learning terms/Reinforcement Learning

Sparse autoencoder

ARC-AGI 2

AlphaGo

GELU (Gaussian Error Linear Unit)

LeNet

ELI5: Explain like I'm 5

Historical background

The role of the critic in reinforcement learning

State value function V(s)

Action value function Q(s, a)

Advantage function A(s, a)

Temporal difference learning

The TD error

TD(λ) and eligibility traces

Generalized advantage estimation

Actor-critic architecture

How actor-critic learning works

Advantages over pure methods

Shared vs. separate networks

Major actor-critic algorithms

A2C and A3C

Deep deterministic policy gradient (DDPG)

Proximal policy optimization (PPO)

Twin delayed DDPG (TD3)

Soft actor-critic (SAC)

IMPALA and V-trace

Critics in continuous vs. discrete action spaces

Critics in multi-agent reinforcement learning

The critic in RLHF for language models

Challenges and failure modes

Overestimation bias

The deadly triad

Coupled learning dynamics

Partial observability

The critic in generative adversarial networks

Practical considerations for training critics

Target networks

Replay buffers

Critic learning rate

Network architecture