Soft Actor-Critic
Last reviewed
May 2, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 · 2,862 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 2, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 · 2,862 words
Add missing citations, update stale details, or suggest a clearer explanation.
Soft Actor-Critic (SAC) is an off-policy actor-critic deep reinforcement learning algorithm grounded in the maximum entropy reinforcement learning framework. It was introduced by Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine at the University of California, Berkeley, and presented at the ICML 2018 conference. SAC trains a stochastic policy that simultaneously maximizes expected return and policy entropy, which encourages exploration and prevents premature collapse to a deterministic strategy. The algorithm became one of the standard baselines for continuous control because it combines the sample efficiency of off-policy methods with the stability that has long been a weak point of older actor-critic approaches like DDPG.
| Field | Detail |
|---|---|
| Introduced | January 2018 (arXiv preprint), ICML 2018 (peer-reviewed) |
| Authors (v1) | Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, Sergey Levine |
| Authors (v2) | Haarnoja, Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Abbeel, Levine |
| Original paper | "Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor" (arXiv:1801.01290) |
| Follow-up paper | "Soft Actor-Critic Algorithms and Applications" (arXiv:1812.05905) |
| Algorithm type | Model-free, off-policy, actor-critic, maximum entropy RL |
| Action space | Continuous (original); discrete variant by Christodoulou (2019) |
| Reference code | github.com/haarnoja/sac, github.com/rail-berkeley/softlearning |
| Common implementations | Stable-Baselines3, RLlib, Tianshou, OpenAI Spinning Up, CleanRL |
In standard reinforcement learning, an agent searches for a policy that maximizes the expected sum of discounted rewards. The maximum entropy framework adds an extra term to the objective: the entropy of the policy at each state. The agent is rewarded both for completing the task and for keeping its action distribution as broad as possible while doing so.
Formally, instead of maximizing the standard return, the agent maximizes
J(π) = Σ_t E_(s_t, a_t) ~ ρ_π [ r(s_t, a_t) + α · H(π(·|s_t)) ]
where H(π(·|s_t)) is the Shannon entropy of the action distribution at state s_t, and α is a temperature coefficient that trades off reward against entropy. As α goes to zero the objective collapses back to ordinary RL.
This framing has practical consequences. A high-entropy policy keeps trying alternative actions even when one looks slightly better, which makes it less likely to commit to a brittle local optimum. It also tends to learn multiple ways of solving a task, useful when a primary strategy is blocked. The maximum entropy idea predates SAC, going back to work on relative-entropy and energy-based RL such as Ziebart's maximum entropy inverse RL and Haarnoja's earlier Soft Q-Learning paper, but SAC was the version that made it practical for high-dimensional continuous control.
SAC defines two value functions adapted to the entropy-augmented objective. The soft state-value function V(s_t) includes the entropy bonus from every step including the current one. The soft action-value function Q(s_t, a_t) includes entropy bonuses from every step after the current one. The relationship between them is
V(s_t) = E_(a_t ~ π)[ Q(s_t, a_t) - α log π(a_t | s_t) ]
The soft Bellman equation for Q is the usual recursion plus an entropy term:
Q(s_t, a_t) = r(s_t, a_t) + γ · E_(s_(t+1))[ V(s_(t+1)) ]
The theoretical backbone of SAC is soft policy iteration, which alternates between policy evaluation (fitting Q to the current policy) and policy improvement (updating the policy to minimize the KL divergence between itself and the exponentiated soft Q-function). Haarnoja et al. proved that this scheme converges to the optimal maximum entropy policy in tabular settings, and SAC is the practical deep-network instantiation of that procedure.
Following the trick popularized by TD3, the second SAC paper drops the explicit value network and trains two Q-networks Q_φ1 and Q_φ2 instead. When constructing targets and when computing the policy gradient, SAC uses the minimum of the two estimates. This clipped double-Q trick reduces the systematic overestimation bias that plagued earlier algorithms based on Q-learning.
For each Q-network there is a target network with parameters that are an exponential moving average (Polyak average) of the online parameters,
φ_targ ← τ · φ + (1 - τ) · φ_targ
with a small τ, often around 0.005. Target networks slow the moving target problem in temporal-difference learning and were borrowed from DQN-style algorithms.
The actor in SAC is a squashed Gaussian. The network outputs a mean μ(s) and a log standard deviation log σ(s). An action is produced by sampling Gaussian noise ξ ~ N(0, I) and computing
a = tanh( μ(s) + σ(s) ⊙ ξ )
The tanh keeps actions bounded inside the action range. Because the random component ξ does not depend on the policy parameters, you can backpropagate through the action sample directly. This is the same reparameterization trick used in variational autoencoders. The policy loss minimizes
J_π(θ) = E_(s ~ D, ξ ~ N) [ α · log π_θ(a_θ(s, ξ) | s) - min_(i=1,2) Q_(φ_i)(s, a_θ(s, ξ)) ]
Minimizing this loss pushes the policy toward high-Q actions while keeping its entropy from collapsing.
The temperature α is the most important hyperparameter in SAC. Pick it too small and the policy turns near-deterministic and stops exploring. Pick it too large and the policy stays too random to ever exploit what it learns. The first paper hand-tuned α per environment and used a reward scale instead.
The second paper, "Soft Actor-Critic Algorithms and Applications," reformulated the problem as a constrained optimization where the policy maximizes return subject to a target entropy H_target (often set to -|A|, the negative of the action dimensionality). Solving the Lagrangian dual gives a simple loss for α:
J(α) = E_(a ~ π_t) [ -α · ( log π_t(a | s) + H_target ) ]
In practice you treat log α as a learnable parameter and update it with the same optimizer used for everything else. When the policy is too deterministic relative to the target entropy, α rises and pushes it back; when the policy is too random, α falls. The result is an algorithm with one fewer hyperparameter to tune, and one of the main reasons practitioners reach for SAC over earlier off-policy methods.
The outline below follows OpenAI Spinning Up's presentation, which matches the version with twin Q-networks and automatic temperature tuning from the second paper:
Input: initial policy parameters θ, Q-function parameters φ1, φ2,
initial temperature α, empty replay buffer D
φ_targ,1 ← φ1, φ_targ,2 ← φ2
repeat until convergence:
# Collect experience
observe state s
sample action a ~ π_θ(· | s)
execute a in the environment
observe next state s', reward r, done signal d
store (s, a, r, s', d) in replay buffer D
if s' is terminal:
reset environment
# Update networks
if it is time to update:
for j in 1..K updates:
sample minibatch B = {(s, a, r, s', d)} from D
# Compute target for both Q-networks
sample ã' ~ π_θ(· | s')
y = r + γ (1 - d) ( min_(i=1,2) Q_(φ_targ,i)(s', ã')
- α · log π_θ(· | s') )
# Update each Q-network by gradient descent on
# MSE( Q_φi(s, a), y )
# Update policy by gradient descent on
# α · log π_θ(ã | s) - min_i Q_φi(s, ã)
# where ã is reparameterized from π_θ
# Update temperature α by gradient descent on
# -α · ( log π_θ(ã | s) + H_target )
# Soft-update target networks
φ_targ,i ← τ · φ_i + (1 - τ) · φ_targ,i for i in 1, 2
A few practical notes that are worth knowing if you read implementations. The number of gradient steps K per environment step is usually 1 in the original paper but is often raised in research code. The temperature update can be skipped if α is fixed. The squashed Gaussian needs a Jacobian correction in the log-probability computation, which most reference implementations get subtly wrong on the first try.
SAC sits in a small family of deep RL methods aimed at continuous control. The differences are easier to see side by side than in prose.
| Algorithm | Policy class | On/off policy | Action space | Sample efficiency | Key idea |
|---|---|---|---|---|---|
| DDPG | Deterministic | Off-policy | Continuous | Moderate | Deterministic policy gradient with target networks; brittle in practice |
| TD3 | Deterministic | Off-policy | Continuous | High | Adds twin Q-networks, target policy smoothing, delayed actor updates |
| PPO | Stochastic | On-policy | Continuous, discrete | Low | Clipped surrogate objective with trust-region intuition; very stable but data-hungry |
| SAC (v1) | Stochastic (squashed Gaussian) | Off-policy | Continuous | High | Maximum entropy objective, separate value and Q-networks |
| SAC (v2) | Stochastic (squashed Gaussian) | Off-policy | Continuous | High | Twin Q-networks, automatic temperature tuning, no separate V-network |
The practical reading is roughly: PPO is the safest first try when you can afford lots of environment steps; TD3 and SAC are the off-policy alternatives when sample efficiency matters; SAC tends to be slightly more forgiving about hyperparameters thanks to entropy regularization and automatic α tuning. DDPG still appears in textbooks but is rarely the right pick for a new project.
SAC has more reference implementations than almost any other continuous-control RL algorithm. The community has converged on a small set of high-quality libraries.
| Implementation | Language / framework | Notes | Repository |
|---|---|---|---|
Original (haarnoja/sac) | Python, TensorFlow | Code accompanying the ICML 2018 paper | github.com/haarnoja/sac |
| Softlearning | Python, TensorFlow | Berkeley RAIL's broader maximum entropy RL framework, includes the v2 algorithm | github.com/rail-berkeley/softlearning |
| Stable-Baselines3 | Python, PyTorch | Production-quality SAC with auto entropy and gSDE exploration | stable-baselines3.readthedocs.io |
| RLlib | Python, PyTorch and TensorFlow | Distributed and multi-agent SAC with both continuous and discrete variants | docs.ray.io |
| Tianshou | Python, PyTorch | Modular research-friendly DRL library, ships continuous and discrete SAC | github.com/thu-ml/tianshou |
| OpenAI Spinning Up | Python, PyTorch and TensorFlow | Educational implementation with detailed documentation | spinningup.openai.com |
| CleanRL | Python, PyTorch | Single-file SAC for continuous and Atari, easy to fork | docs.cleanrl.dev |
For anyone learning SAC, the OpenAI Spinning Up writeup paired with the CleanRL sac_continuous_action.py file is a useful combination: one explains the math, the other is short enough to read in a sitting.
The original two papers tested SAC on the standard MuJoCo tasks distributed with OpenAI Gym, including Hopper, Walker2d, HalfCheetah, Ant, and the more difficult Humanoid environment. SAC matched or beat DDPG, TD3, and PPO on every task, with the gap widest on high-dimensional control like Humanoid where exploration matters most. The variance across random seeds was also notably smaller, addressing one of the most criticized weaknesses of deep RL benchmarks at the time.
The second paper went further and ran SAC on real hardware. The most-cited demo is the Minitaur quadruped from Ghost Robotics, which learned to walk on flat ground in about two hours of real-world training and then generalized to ramps, steps, and obstacles without retraining. The same paper showed a 9-DOF Dynamixel hand learning to rotate a valve from raw RGB images in around 20 hours, and a 7-DOF Sawyer arm learning to stack a Lego block on top of another in about two hours. These results convinced a lot of people that off-policy deep RL had finally crossed into the territory where it could train physical robots in tractable wall-clock time.
Since then SAC has shown up in a wide range of application papers, from learned controllers in autonomous driving simulators, to dexterous manipulation in research labs, to financial market making, to data-center cooling. It is a default choice for any continuous control problem where you have access to a fast simulator and need decent sample efficiency without spending weeks tuning hyperparameters.
The SAC framework spawned a small ecosystem of variants. A few of the more widely used ones:
| Variant | Authors / year | What it changes |
|---|---|---|
| Soft Q-Learning | Haarnoja et al., 2017 | The predecessor; stochastic policy via amortized Stein variational gradient descent |
| SAC v2 (with auto-α) | Haarnoja et al., 2018 | Twin Q-networks, automatic temperature tuning, drops the V-network |
| Discrete SAC | Christodoulou, 2019 | Adapts SAC to discrete action spaces by computing expectations directly over the categorical policy; benchmarked on Atari |
| SAC-X (Scheduled Auxiliary Control) | Riedmiller et al., 2018 | Learns auxiliary tasks alongside the main task, with a learned scheduler choosing which auxiliary policy to execute; targets sparse-reward problems |
| Distributional SAC (DSAC) | Duan et al., 2020 | Replaces the scalar Q-function with a distributional one in the spirit of C51 / QR-DQN |
| SAC with prioritized experience replay | Various | Plug-in replacement of the uniform replay buffer with PER; commonly used in implementations |
| REDQ | Chen et al., 2021 | An ensemble-of-Q approach building on SAC that pushes sample efficiency much further on MuJoCo |
SAC has also been combined with model-based RL ideas, for example as the policy optimizer inside MBPO (Model-Based Policy Optimization), and with offline RL methods like CQL that add a conservative penalty to the Q-loss.
SAC quickly displaced DDPG as the default off-policy baseline in continuous control benchmarks. It is now one of the most cited reinforcement learning algorithms of the late-2010s era, with the two original papers together accumulating tens of thousands of citations on Google Scholar. The ICML 2018 paper is among the most influential ICML papers from that year by citation count.
Several things contributed to that uptake. The combination of off-policy sample efficiency and the stability that comes with stochastic policies and entropy regularization produced an algorithm that simply worked across many tasks without much tuning. The follow-up paper added automatic temperature adjustment, which removed one of the most awkward hyperparameters. The Berkeley team released clean reference code and BAIR's blog post with real-robot videos gave people a vivid demonstration that the algorithm worked outside simulation. By 2020 SAC was a near-default choice in textbooks and courses on deep RL.
The algorithm is not without limits. Like most off-policy actor-critic methods, it can fail in environments with very sparse rewards where exploration cannot be solved by entropy alone. The auto-α mechanism, while a clear improvement, can still misbehave on environments with non-standard action ranges or unusual reward scales. And SAC remains a continuous-action method by design; the discrete variant by Christodoulou works but has not displaced PPO and DQN-style algorithms as the standard discrete-action choice. None of this has stopped SAC from becoming part of the basic toolbox for anyone doing continuous control.