Soft Actor-Critic

Soft Actor-Critic (SAC) is an off-policy actor-critic deep reinforcement learning algorithm grounded in the maximum entropy reinforcement learning framework. It was introduced by Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine at the University of California, Berkeley, and presented at the ICML 2018 conference. SAC trains a stochastic policy that simultaneously maximizes expected return and policy entropy, which encourages exploration and prevents premature collapse to a deterministic strategy. The algorithm became one of the standard baselines for continuous control because it combines the sample efficiency of off-policy methods with the stability that has long been a weak point of older actor-critic approaches like DDPG.

Infobox

Field	Detail
Introduced	January 2018 (arXiv preprint), ICML 2018 (peer-reviewed)
Authors (v1)	Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, Sergey Levine
Authors (v2)	Haarnoja, Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, Henry Zhu, Abhishek Gupta, Abbeel, Levine
Original paper	"Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor" (arXiv:1801.01290)
Follow-up paper	"Soft Actor-Critic Algorithms and Applications" (arXiv:1812.05905)
Algorithm type	Model-free, off-policy, actor-critic, maximum entropy RL
Action space	Continuous (original); discrete variant by Christodoulou (2019)
Reference code	github.com/haarnoja/sac, github.com/rail-berkeley/softlearning
Common implementations	Stable-Baselines3, RLlib, Tianshou, OpenAI Spinning Up, CleanRL

Background: maximum entropy reinforcement learning

In standard reinforcement learning, an agent searches for a policy that maximizes the expected sum of discounted rewards. The maximum entropy framework adds an extra term to the objective: the entropy of the policy at each state. The agent is rewarded both for completing the task and for keeping its action distribution as broad as possible while doing so.

Formally, instead of maximizing the standard return, the agent maximizes

J(π) = Σ_t E_(s_t, a_t) ~ ρ_π [ r(s_t, a_t) + α · H(π(·|s_t)) ]

where H(π(·|s_t)) is the Shannon entropy of the action distribution at state s_t, and α is a temperature coefficient that trades off reward against entropy. As α goes to zero the objective collapses back to ordinary RL.

This framing has practical consequences. A high-entropy policy keeps trying alternative actions even when one looks slightly better, which makes it less likely to commit to a brittle local optimum. It also tends to learn multiple ways of solving a task, useful when a primary strategy is blocked. The maximum entropy idea predates SAC, going back to work on relative-entropy and energy-based RL such as Ziebart's maximum entropy inverse RL and Haarnoja's earlier Soft Q-Learning paper, but SAC was the version that made it practical for high-dimensional continuous control.

The algorithm

Soft value functions

SAC defines two value functions adapted to the entropy-augmented objective. The soft state-value function V(s_t) includes the entropy bonus from every step including the current one. The soft action-value function Q(s_t, a_t) includes entropy bonuses from every step after the current one. The relationship between them is

V(s_t) = E_(a_t ~ π)[ Q(s_t, a_t) - α log π(a_t | s_t) ]

The soft Bellman equation for Q is the usual recursion plus an entropy term:

Q(s_t, a_t) = r(s_t, a_t) + γ · E_(s_(t+1))[ V(s_(t+1)) ]

Soft policy iteration

The theoretical backbone of SAC is soft policy iteration, which alternates between policy evaluation (fitting Q to the current policy) and policy improvement (updating the policy to minimize the KL divergence between itself and the exponentiated soft Q-function). Haarnoja et al. proved that this scheme converges to the optimal maximum entropy policy in tabular settings, and SAC is the practical deep-network instantiation of that procedure.

Twin Q-networks

Following the trick popularized by TD3, the second SAC paper drops the explicit value network and trains two Q-networks Q_φ1 and Q_φ2 instead. When constructing targets and when computing the policy gradient, SAC uses the minimum of the two estimates. This clipped double-Q trick reduces the systematic overestimation bias that plagued earlier algorithms based on Q-learning.

Target networks

For each Q-network there is a target network with parameters that are an exponential moving average (Polyak average) of the online parameters,

φ_targ ← τ · φ + (1 - τ) · φ_targ

with a small τ, often around 0.005. Target networks slow the moving target problem in temporal-difference learning and were borrowed from DQN-style algorithms.

Policy update via reparameterization

The actor in SAC is a squashed Gaussian. The network outputs a mean μ(s) and a log standard deviation log σ(s). An action is produced by sampling Gaussian noise ξ ~ N(0, I) and computing

a = tanh( μ(s) + σ(s) ⊙ ξ )

The tanh keeps actions bounded inside the action range. Because the random component ξ does not depend on the policy parameters, you can backpropagate through the action sample directly. This is the same reparameterization trick used in variational autoencoders. The policy loss minimizes

J_π(θ) = E_(s ~ D, ξ ~ N) [ α · log π_θ(a_θ(s, ξ) | s) - min_(i=1,2) Q_(φ_i)(s, a_θ(s, ξ)) ]

Minimizing this loss pushes the policy toward high-Q actions while keeping its entropy from collapsing.

Temperature alpha and automatic tuning

The temperature α is the most important hyperparameter in SAC. Pick it too small and the policy turns near-deterministic and stops exploring. Pick it too large and the policy stays too random to ever exploit what it learns. The first paper hand-tuned α per environment and used a reward scale instead.

The second paper, "Soft Actor-Critic Algorithms and Applications," reformulated the problem as a constrained optimization where the policy maximizes return subject to a target entropy H_target (often set to -|A|, the negative of the action dimensionality). Solving the Lagrangian dual gives a simple loss for α:

J(α) = E_(a ~ π_t) [ -α · ( log π_t(a | s) + H_target ) ]

In practice you treat log α as a learnable parameter and update it with the same optimizer used for everything else. When the policy is too deterministic relative to the target entropy, α rises and pushes it back; when the policy is too random, α falls. The result is an algorithm with one fewer hyperparameter to tune, and one of the main reasons practitioners reach for SAC over earlier off-policy methods.

Pseudocode

The outline below follows OpenAI Spinning Up's presentation, which matches the version with twin Q-networks and automatic temperature tuning from the second paper:

Input: initial policy parameters θ, Q-function parameters φ1, φ2,
       initial temperature α, empty replay buffer D
φ_targ,1 ← φ1,  φ_targ,2 ← φ2

repeat until convergence:
    # Collect experience
    observe state s
    sample action a ~ π_θ(· | s)
    execute a in the environment
    observe next state s', reward r, done signal d
    store (s, a, r, s', d) in replay buffer D
    if s' is terminal:
        reset environment

    # Update networks
    if it is time to update:
        for j in 1..K updates:
            sample minibatch B = {(s, a, r, s', d)} from D

            # Compute target for both Q-networks
            sample ã' ~ π_θ(· | s')
            y = r + γ (1 - d) ( min_(i=1,2) Q_(φ_targ,i)(s', ã')
                                  - α · log π_θ(· | s') )

            # Update each Q-network by gradient descent on
            # MSE( Q_φi(s, a), y )

            # Update policy by gradient descent on
            # α · log π_θ(ã | s) - min_i Q_φi(s, ã)
            # where ã is reparameterized from π_θ

            # Update temperature α by gradient descent on
            # -α · ( log π_θ(ã | s) + H_target )

            # Soft-update target networks
            φ_targ,i ← τ · φ_i + (1 - τ) · φ_targ,i  for i in 1, 2

A few practical notes that are worth knowing if you read implementations. The number of gradient steps K per environment step is usually 1 in the original paper but is often raised in research code. The temperature update can be skipped if α is fixed. The squashed Gaussian needs a Jacobian correction in the log-probability computation, which most reference implementations get subtly wrong on the first try.

SAC sits in a small family of deep RL methods aimed at continuous control. The differences are easier to see side by side than in prose.

Algorithm	Policy class	On/off policy	Action space	Sample efficiency	Key idea
DDPG	Deterministic	Off-policy	Continuous	Moderate	Deterministic policy gradient with target networks; brittle in practice
TD3	Deterministic	Off-policy	Continuous	High	Adds twin Q-networks, target policy smoothing, delayed actor updates
PPO	Stochastic	On-policy	Continuous, discrete	Low	Clipped surrogate objective with trust-region intuition; very stable but data-hungry
SAC (v1)	Stochastic (squashed Gaussian)	Off-policy	Continuous	High	Maximum entropy objective, separate value and Q-networks
SAC (v2)	Stochastic (squashed Gaussian)	Off-policy	Continuous	High	Twin Q-networks, automatic temperature tuning, no separate V-network

The practical reading is roughly: PPO is the safest first try when you can afford lots of environment steps; TD3 and SAC are the off-policy alternatives when sample efficiency matters; SAC tends to be slightly more forgiving about hyperparameters thanks to entropy regularization and automatic α tuning. DDPG still appears in textbooks but is rarely the right pick for a new project.

Implementations

SAC has more reference implementations than almost any other continuous-control RL algorithm. The community has converged on a small set of high-quality libraries.

Implementation	Language / framework	Notes	Repository
Original (`haarnoja/sac`)	Python, TensorFlow	Code accompanying the ICML 2018 paper	github.com/haarnoja/sac
Softlearning	Python, TensorFlow	Berkeley RAIL's broader maximum entropy RL framework, includes the v2 algorithm	github.com/rail-berkeley/softlearning
Stable-Baselines3	Python, PyTorch	Production-quality SAC with auto entropy and gSDE exploration	stable-baselines3.readthedocs.io
RLlib	Python, PyTorch and TensorFlow	Distributed and multi-agent SAC with both continuous and discrete variants	docs.ray.io
Tianshou	Python, PyTorch	Modular research-friendly DRL library, ships continuous and discrete SAC	github.com/thu-ml/tianshou
OpenAI Spinning Up	Python, PyTorch and TensorFlow	Educational implementation with detailed documentation	spinningup.openai.com
CleanRL	Python, PyTorch	Single-file SAC for continuous and Atari, easy to fork	docs.cleanrl.dev

For anyone learning SAC, the OpenAI Spinning Up writeup paired with the CleanRL sac_continuous_action.py file is a useful combination: one explains the math, the other is short enough to read in a sitting.

Applications and benchmarks

The original two papers tested SAC on the standard MuJoCo tasks distributed with OpenAI Gym, including Hopper, Walker2d, HalfCheetah, Ant, and the more difficult Humanoid environment. SAC matched or beat DDPG, TD3, and PPO on every task, with the gap widest on high-dimensional control like Humanoid where exploration matters most. The variance across random seeds was also notably smaller, addressing one of the most criticized weaknesses of deep RL benchmarks at the time.

The second paper went further and ran SAC on real hardware. The most-cited demo is the Minitaur quadruped from Ghost Robotics, which learned to walk on flat ground in about two hours of real-world training and then generalized to ramps, steps, and obstacles without retraining. The same paper showed a 9-DOF Dynamixel hand learning to rotate a valve from raw RGB images in around 20 hours, and a 7-DOF Sawyer arm learning to stack a Lego block on top of another in about two hours. These results convinced a lot of people that off-policy deep RL had finally crossed into the territory where it could train physical robots in tractable wall-clock time.

Since then SAC has shown up in a wide range of application papers, from learned controllers in autonomous driving simulators, to dexterous manipulation in research labs, to financial market making, to data-center cooling. It is a default choice for any continuous control problem where you have access to a fast simulator and need decent sample efficiency without spending weeks tuning hyperparameters.

Variants and extensions

The SAC framework spawned a small ecosystem of variants. A few of the more widely used ones:

Variant	Authors / year	What it changes
Soft Q-Learning	Haarnoja et al., 2017	The predecessor; stochastic policy via amortized Stein variational gradient descent
SAC v2 (with auto-α)	Haarnoja et al., 2018	Twin Q-networks, automatic temperature tuning, drops the V-network
Discrete SAC	Christodoulou, 2019	Adapts SAC to discrete action spaces by computing expectations directly over the categorical policy; benchmarked on Atari
SAC-X (Scheduled Auxiliary Control)	Riedmiller et al., 2018	Learns auxiliary tasks alongside the main task, with a learned scheduler choosing which auxiliary policy to execute; targets sparse-reward problems
Distributional SAC (DSAC)	Duan et al., 2020	Replaces the scalar Q-function with a distributional one in the spirit of C51 / QR-DQN
SAC with prioritized experience replay	Various	Plug-in replacement of the uniform replay buffer with PER; commonly used in implementations
REDQ	Chen et al., 2021	An ensemble-of-Q approach building on SAC that pushes sample efficiency much further on MuJoCo

SAC has also been combined with model-based RL ideas, for example as the policy optimizer inside MBPO (Model-Based Policy Optimization), and with offline RL methods like CQL that add a conservative penalty to the Q-loss.

Reception and impact

SAC quickly displaced DDPG as the default off-policy baseline in continuous control benchmarks. It is now one of the most cited reinforcement learning algorithms of the late-2010s era, with the two original papers together accumulating tens of thousands of citations on Google Scholar. The ICML 2018 paper is among the most influential ICML papers from that year by citation count.

Several things contributed to that uptake. The combination of off-policy sample efficiency and the stability that comes with stochastic policies and entropy regularization produced an algorithm that simply worked across many tasks without much tuning. The follow-up paper added automatic temperature adjustment, which removed one of the most awkward hyperparameters. The Berkeley team released clean reference code and BAIR's blog post with real-robot videos gave people a vivid demonstration that the algorithm worked outside simulation. By 2020 SAC was a near-default choice in textbooks and courses on deep RL.

The algorithm is not without limits. Like most off-policy actor-critic methods, it can fail in environments with very sparse rewards where exploration cannot be solved by entropy alone. The auto-α mechanism, while a clear improvement, can still misbehave on environments with non-standard action ranges or unusual reward scales. And SAC remains a continuous-action method by design; the discrete variant by Christodoulou works but has not displaced PPO and DQN-style algorithms as the standard discrete-action choice. None of this has stopped SAC from becoming part of the basic toolbox for anyone doing continuous control.

References

Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). "Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor." *Proceedings of the 35th International Conference on Machine Learning*, PMLR 80:1861-1870. arXiv:1801.01290.
Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P., & Levine, S. (2018). "Soft Actor-Critic Algorithms and Applications." arXiv:1812.05905.
Haarnoja, T., Tang, H., Abbeel, P., & Levine, S. (2017). "Reinforcement Learning with Deep Energy-Based Policies." *Proceedings of the 34th International Conference on Machine Learning*, PMLR 70:1352-1361. arXiv:1702.08165.
Achiam, J. (2018). "Soft Actor-Critic." *OpenAI Spinning Up Documentation*. https://spinningup.openai.com/en/latest/algorithms/sac.html
Berkeley AI Research blog (2018). "Soft Actor Critic, Deep Reinforcement Learning with Real-World Robots." https://bair.berkeley.edu/blog/2018/12/14/sac/
Fujimoto, S., van Hoof, H., & Meger, D. (2018). "Addressing Function Approximation Error in Actor-Critic Methods" (TD3). *Proceedings of the 35th International Conference on Machine Learning*. arXiv:1802.09477.
Christodoulou, P. (2019). "Soft Actor-Critic for Discrete Action Settings." arXiv:1910.07207.
Riedmiller, M., Hafner, R., Lampe, T., Neunert, M., Degrave, J., Van de Wiele, T., Mnih, V., Heess, N., & Springenberg, J. T. (2018). "Learning by Playing: Solving Sparse-Reward Tasks from Scratch" (SAC-X). *Proceedings of the 35th International Conference on Machine Learning*. arXiv:1802.10567.
Raffin, A., Hill, A., Gleave, A., Kanervisto, A., Ernestus, M., & Dormann, N. (2021). "Stable-Baselines3: Reliable Reinforcement Learning Implementations." *Journal of Machine Learning Research*, 22(268):1-8.
Huang, S., Dossa, R. F. J., Ye, C., Braga, J., Chakraborty, D., Mehta, K., & Araujo, J. G. M. (2022). "CleanRL: High-quality Single-file Implementations of Deep Reinforcement Learning Algorithms." *Journal of Machine Learning Research*, 23(274):1-18.
Liang, E., Liaw, R., Nishihara, R., Moritz, P., Fox, R., Goldberg, K., Gonzalez, J., Jordan, M., & Stoica, I. (2018). "RLlib: Abstractions for Distributed Reinforcement Learning." *Proceedings of the 35th International Conference on Machine Learning*.
Weng, J., et al. (2022). "Tianshou: A Highly Modularized Deep Reinforcement Learning Library." *Journal of Machine Learning Research*, 23(267):1-6. arXiv:2107.14171.

Soft Actor-Critic

Infobox

Background: maximum entropy reinforcement learning

The algorithm

Soft value functions

Soft policy iteration

Twin Q-networks

Target networks

Policy update via reparameterization

Temperature alpha and automatic tuning

Pseudocode

Implementations

Applications and benchmarks

Variants and extensions

Reception and impact

See also

References

Improve this article

Infobox

Background: maximum entropy reinforcement learning

The algorithm

Soft value functions

Soft policy iteration

Twin Q-networks

Target networks

Policy update via reparameterization

Temperature alpha and automatic tuning

Pseudocode

Implementations

Applications and benchmarks

Variants and extensions

Reception and impact

See also

References

Infobox

Background: maximum entropy reinforcement learning

The algorithm

Soft value functions

Soft policy iteration

Twin Q-networks

Target networks

Policy update via reparameterization

Temperature alpha and automatic tuning

Pseudocode

Comparison with related algorithms

Implementations

Applications and benchmarks

Variants and extensions

Reception and impact

See also

References

Improve this article

Related Articles

Machine learning terms/Reinforcement Learning

AlphaGo

Twin Delayed DDPG

Sparse autoencoder

GELU (Gaussian Error Linear Unit)

LeNet

Infobox

Background: maximum entropy reinforcement learning

The algorithm

Soft value functions

Soft policy iteration

Twin Q-networks

Target networks

Policy update via reparameterization

Temperature alpha and automatic tuning

Pseudocode

Comparison with related algorithms

Implementations

Applications and benchmarks

Variants and extensions

Reception and impact

See also

References

Related Articles

Machine learning terms/Reinforcement Learning

AlphaGo

Twin Delayed DDPG

Sparse autoencoder

GELU (Gaussian Error Linear Unit)

LeNet