Tanh (hyperbolic tangent)
Last reviewed
Sources
17 citations
Review status
Source-backed
Revision
v2 · 3,713 words
The hyperbolic tangent, written tanh, is a smooth, S-shaped activation function that maps any real number into the open interval (-1, 1), passing through the origin so that tanh(0) = 0 [7][8]. Unlike the logistic sigmoid, which is centred at 0.5 and outputs only positive values, tanh is zero-centred: its outputs are symmetric about zero (tanh(-x) = -tanh(x)), which tends to produce more balanced gradients and faster convergence in neural networks [1][8]. For most of the 1990s and 2000s tanh was the default nonlinearity in hidden layers of feedforward and recurrent models. Although it has been displaced from generic deep feedforward networks by the Rectified Linear Unit (ReLU) and its variants, tanh remains the standard squashing nonlinearity inside LSTM and GRU cells, and it is still used in output layers when targets are bounded between -1 and 1 [4][5].
What is the definition of tanh?
The hyperbolic tangent is defined for any real input x by the ratio of hyperbolic sine and hyperbolic cosine [7]:
tanh(x) = sinh(x) / cosh(x) = (e^x - e^(-x)) / (e^x + e^(-x))
Multiplying numerator and denominator by e^x gives an equivalent, often more numerically convenient form:
tanh(x) = (e^(2x) - 1) / (e^(2x) + 1)
The range is the open interval (-1, 1). The function passes through the origin with tanh(0) = 0, asymptotes to +1 as x grows positive, and asymptotes to -1 as x grows negative [7][8]. The Wikipedia article on hyperbolic functions lists these identities together with the half-angle formula tanh(x) = 2t / (1 + t^2) where t = tanh(x/2), and notes that tanh is the unique solution of the differential equation f' = 1 - f^2 with f(0) = 0 [7].
Derivative
A convenient property for backpropagation is that the derivative of tanh can be written entirely in terms of the function's own value [7]:
d/dx tanh(x) = 1 - tanh^2(x) = sech^2(x)
The maximum derivative is 1, attained at x = 0, and it decays to 0 as |x| grows [8]. During the forward pass a network can store the value tanh(x); the backward pass then needs only one extra multiplication to obtain the local gradient, which is one reason the function was attractive on early hardware.
How does tanh relate to the sigmoid?
Tanh and the logistic sigmoid are linearly related. Writing the logistic function as sigma(x) = 1 / (1 + e^(-x)), the identity
tanh(x) = 2 sigma(2x) - 1
holds for all real x [7]. So a tanh layer is, up to an affine reparameterisation of weights and biases, the same as a sigmoid layer with inputs and outputs rescaled. The practical difference is that tanh is centred at zero while the sigmoid is centred at 0.5, and the maximum derivative of tanh (one) is four times larger than the maximum derivative of the sigmoid (one quarter, attained at the origin) [8]. That larger gradient near zero is one reason tanh units learn faster than logistic units in the early phase of training. The Wikipedia article on hyperbolic functions states this identity explicitly [7].
History: when did tanh become the default?
Neural networks of the 1980s and early 1990s used either the logistic sigmoid or tanh as the nonlinear unit, with tanh becoming the more common choice in serious training work after the publication of LeCun, Bottou, Orr and Mueller's chapter "Efficient BackProp" in Neural Networks: Tricks of the Trade (Springer LNCS 1524, 1998) [1]. That chapter, published as a refined version of a 1996 NIPS workshop talk, recommended tanh over the standard logistic sigmoid for hidden units of multilayer perceptrons trained by stochastic gradient descent. Its headline recommendation is blunt: "Symmetric sigmoids such as hyperbolic tangent often converge faster than the standard logistic function." [1] The argument is that nonzero mean activations introduce a bias into the gradients of subsequent layers and slow down convergence; symmetric activations such as tanh keep the average activation closer to zero and so produce more balanced updates [1]. The same chapter also recommends a particular scaled form, 1.7159 * tanh(2x/3), chosen so that the function passes through (1, 1) and (-1, -1) and so that the second derivative is largest at the points used as desired outputs [1].
Tanh remained the default hidden-unit nonlinearity throughout the 2000s. It appears as the squashing function in the original Long Short-Term Memory network of Hochreiter and Schmidhuber (Neural Computation, 1997), where it is applied to the candidate cell input and again to the cell state before it is exposed through the output gate [4]. When Cho, van Merrienboer, Gulcehre, Bahdanau, Bougares, Schwenk and Bengio introduced the gated recurrent unit in 2014 (Cho et al., Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, EMNLP 2014, arXiv:1406.1078), they kept tanh as the candidate-state activation [5]. The Wikipedia article on the GRU writes the candidate hidden state as h-tilde_t = tanh(W_h x_t + U_h (r_t * h_{t-1}) + b_h), with tanh as the activation [10].
Two papers from the Bengio group then changed the default for feedforward networks. Glorot and Bengio's Understanding the difficulty of training deep feedforward neural networks (AISTATS 2010) studied tanh, sigmoid and softsign units, showed how the activations and gradients evolve through depth under standard random initialisation, and derived what is now called Xavier or Glorot initialisation as the variance schedule that keeps signals stable across layers when the activation is approximately linear near zero (true of tanh) [2]. The paper's striking empirical observation is that "the tanh layers saturate one after the other during training, starting from the top layer" under the older initialisation, which is exactly the dynamic the new scheme was designed to prevent [2]. The follow-up by Glorot, Bordes and Bengio, Deep Sparse Rectifier Neural Networks (AISTATS 2011), then showed that rectifier units (max(0, x)) reach equal or better accuracy than tanh networks while producing genuinely sparse representations, and they did so without unsupervised pretraining [3]. After 2011 the centre of gravity in feedforward deep learning moved to ReLU and its descendants, while tanh remained dominant inside recurrent cells.
Properties
Tanh has a small number of properties that almost completely determine why it is used where it is used:
- Range (-1, 1). Outputs are bounded, which keeps activations from exploding through layers and gives downstream gating operations a clean dynamic range. The Wikipedia article on activation functions classifies tanh as a saturating activation function with range (-1, 1) [8].
- Zero-centred. Because the average output is close to zero when inputs are roughly symmetric, tanh avoids the systematic positive bias that the logistic sigmoid imposes on subsequent layers. This is the property LeCun et al. 1998 highlight as the main reason to prefer tanh over logistic sigmoid in hidden units [1].
- Antisymmetric. tanh(-x) = -tanh(x). Combined with zero-centred inputs and weights initialised symmetrically, antisymmetry helps keep the distribution of pre-activations centred during training [1].
- Smooth and infinitely differentiable. Tanh is C-infinity, so any optimisation method that uses higher derivatives (Hessian-free, natural gradient, second-order methods) can be applied without special handling at the kinks that ReLU introduces.
- Monotone increasing. Strictly so on the whole real line, with a single maximum slope of 1 at x = 0 [8].
- Saturation. For large |x| the function is essentially constant, so its derivative collapses to nearly zero. This is the property that drives the vanishing gradient problem in deep tanh networks [8].
What is the vanishing gradient problem in tanh networks?
In a deep tanh network, the gradient that backpropagation computes for an early layer is a product of many local derivatives 1 - tanh^2(z). Each of those factors is at most one, and is much smaller than one whenever the pre-activation z is outside roughly the range (-2, 2). The product therefore tends to shrink exponentially with depth, leaving early layers with effectively no learning signal. This is the classical vanishing gradient problem, first identified by Sepp Hochreiter in his 1991 diploma thesis at the Technical University of Munich (Untersuchungen zu dynamischen neuronalen Netzen) and re-examined empirically by Glorot and Bengio in 2010 [2]. Their paper traces how the variance of the back-propagated gradient decays through depth in a tanh network with the older 1/sqrt(n) initialisation and motivates the Xavier scheme as a remedy that keeps the variance roughly constant in the linear regime around zero [2].
The Glorot, Bordes and Bengio 2011 paper went one step further and replaced the saturating tanh with the rectifier ReLU(x) = max(0, x), whose gradient is exactly one for any positive input [3]. The paper showed that on image and text classification benchmarks ReLU networks matched or beat tanh networks of the same shape, and they did so without the unsupervised layer-by-layer pretraining that had been needed to make deep tanh networks train at all [3]. After 2011 most new feedforward architectures defaulted to ReLU or one of its variants such as Leaky ReLU, ELU, GELU or SiLU/Swish.
Where is tanh still used?
Tanh is far from obsolete. The places where it is still the default in modern systems are precisely the places where the bounded, zero-centred, smooth shape buys something concrete:
Inside recurrent cells
The LSTM cell of Hochreiter and Schmidhuber (1997) uses tanh in two places [4]. First, the candidate cell input g_t = tanh(W_g x_t + U_g h_{t-1} + b_g) is squashed into (-1, 1) before being scaled by the input gate and added to the cell state. Second, the cell state c_t is itself passed through tanh before being scaled by the output gate, so that the exposed hidden state h_t = o_t * tanh(c_t) is bounded [4][9]. The Wikipedia article on LSTM gives these equations explicitly with tanh as the activation, citing Hochreiter and Schmidhuber 1997 [9]. The bounded range of tanh limits how fast the unbounded cell state can change in any single step, which is part of why LSTM trains stably over long sequences.
The GRU of Cho et al. (2014) keeps the same convention [5]. The candidate hidden state is computed as h-tilde_t = tanh(W x_t + U (r_t * h_{t-1}) + b), with tanh as the activation, and the update gate then linearly interpolates between the previous hidden state and this candidate [5][10]. Replacing tanh with ReLU in either LSTM or GRU is unusual; the bounded range and the smooth derivative are useful precisely because the recurrence applies the function many times to the same hidden vector.
Bounded output layers
When the target of a network is bounded, tanh is a natural output nonlinearity. Two common cases:
- Continuous-action reinforcement-learning policies. Many actor networks for continuous-control benchmarks (MuJoCo, robotics) parameterise the action distribution as a Gaussian whose mean is passed through tanh to clip it into the action range expected by the environment. The Wasserstein imitation-learning paper by Xiao et al. (arXiv:1906.08113) describes its policy as a stochastic network with tanh activation layers feeding the parameters of a normal distribution, which is a typical pattern.
- Image generators and Wasserstein critics. The generator in many GAN architectures uses tanh as the final activation so that the synthesised image lies in [-1, 1], matching the convention of normalising real images to that range before training. Wasserstein critics, in contrast, are normally left linear at the output because the critic is meant to estimate an unbounded score, but other intermediate layers in WGAN systems sometimes use tanh.
Normalising flows and small models
Tanh shows up as the elementwise nonlinearity in some normalising-flow components and in small recurrent or control models where the bounded range simplifies stability analysis. Hardware-friendly inference engines on FPGAs and microcontrollers also continue to ship tanh as a primitive, partly because piecewise-linear approximations of tanh are easy to implement and partly because legacy LSTM models for keyword spotting and speech still use it.
How does tanh compare with other activation functions?
The table below summarises the headline properties of the most common scalar activations used in modern neural networks. Formulas are taken from the Wikipedia Activation function article and from the original papers cited in the references [8].
| Activation | Formula | Range | Derivative | Pros | Cons | Typical use |
|---|---|---|---|---|---|---|
| tanh | (e^x - e^(-x)) / (e^x + e^(-x)) | (-1, 1) | 1 - tanh^2(x) | Zero-centred, smooth, bounded, antisymmetric | Saturates; vanishing gradients in deep nets | LSTM/GRU candidate states, bounded outputs |
| Sigmoid | 1 / (1 + e^(-x)) | (0, 1) | sigma(x) (1 - sigma(x)) | Probabilistic interpretation, smooth | Not zero-centred, max derivative 0.25, saturates | Output of binary classifiers, gates in LSTM/GRU |
| ReLU | max(0, x) | [0, inf) | 1 if x > 0 else 0 | No saturation for positive inputs, sparse activations, cheap | Dead neurons for x < 0, not smooth at 0 | Default hidden activation in feedforward and CNNs since ~2012 |
| Leaky ReLU | x if x > 0 else alpha*x (alpha~0.01) | (-inf, inf) | 1 if x > 0 else alpha | Avoids dead-neuron problem of ReLU | Extra hyperparameter alpha | Generative models, some CNN backbones |
| ELU | x if x > 0 else alpha(e^x - 1) | (-alpha, inf) | 1 if x > 0 else ELU(x) + alpha | Smooth at zero, mean activation closer to zero than ReLU | More expensive than ReLU due to exp | Some classification CNNs |
| GELU | x * Phi(x) where Phi is the standard normal CDF | (-inf, inf) | Phi(x) + x * phi(x) | Smooth, used in modern Transformers | More expensive than ReLU | Transformer feedforward sublayers (BERT, GPT-2 family) |
| SiLU / Swish | x * sigma(x) | (-inf, inf) | sigma(x) (1 + x (1 - sigma(x))) | Smooth, self-gated, competitive accuracy | Slightly more expensive than ReLU | EfficientNet, modern CNNs and LLMs |
| Softplus | log(1 + e^x) | (0, inf) | sigma(x) | Smooth approximation of ReLU, always positive | Slower than ReLU, no exact zero | Sometimes used for variance parameters |
Seen this way, tanh is the natural choice when an activation has to be bounded and zero-centred and smooth all at once. ReLU and its variants give up boundedness; the sigmoid gives up zero-centredness; piecewise-linear functions give up smoothness. Tanh keeps all three at the cost of saturating gradients, which is acceptable in the recurrent setting where saturation also acts as an implicit regulariser.
Numerical stability
The naive evaluation tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x)) overflows in IEEE single precision once |x| exceeds about 88, because e^x already overflows. Standard math libraries therefore special-case the function. A common implementation, used by numpy.tanh and most C99 tanh routines, branches on the sign and the magnitude of x [11]:
- For x >= 0 use the identity tanh(x) = 1 - 2 / (e^(2x) + 1). When x is large the second term underflows safely to zero and the result rounds to 1.
- For x < 0 use tanh(x) = -tanh(-x) and apply the same branch.
- For very small |x| use the Taylor expansion tanh(x) ~ x - x^3/3 + 2 x^5/15 to avoid catastrophic cancellation.
Compared with naive sigmoid implementations, which need a similar log-sum-exp style trick to avoid overflow when x is very negative, tanh is unusually friendly: it saturates symmetrically, the safe identity covers both tails, and the function never produces NaN for finite inputs. This is one reason numpy.tanh is documented as a vectorised ufunc with no special handling required by the caller (NumPy reference, numpy.tanh) [11].
Hardware approximations
On FPGAs, microcontrollers and low-power inference accelerators, tanh is rarely computed by evaluating exponentials. Two classes of approximation dominate the literature:
- Piecewise-linear (PWL) approximations. The interval (-3, 3) is divided into several segments, on each of which tanh is replaced by a linear function ax + b chosen to minimise maximum error. Outside that interval the output is clamped to +/-1. Sahin and Tas (Microelectronics Journal, 2023) report a 16-bit PWL implementation of tanh that uses only shift-and-add logic and reduces delay, area and power compared with a polynomial approximation while staying within a small bounded error [14]. PWL approximations are popular because they map onto LUT-and-multiplier blocks naturally and because the maximum error is easy to certify.
- CORDIC implementations. The CORDIC algorithm computes hyperbolic functions iteratively using only shifts, adds and a small constant table; the same hardware can be reused for sinh, cosh, tan and arctan. Tiwari and Khare (Microprocessors and Microsystems, 2015) describe a CORDIC-based hardware implementation of sigmoidal activations including tanh [15], and a 2025 arXiv paper by Cardarilli et al. (arXiv:2503.11685) revisits CORDIC for activation functions in modern accelerators [16]. CORDIC tends to use less area than a large LUT but more clock cycles than a fixed PWL approximation, so the right choice depends on whether area or latency is the bottleneck.
For very tight memory budgets, look-up tables of around 256 entries plus linear interpolation also remain common in low-end DSPs and audio codecs.
How should a tanh layer be initialised?
The default initialiser for a tanh hidden layer is Xavier, also called Glorot, initialisation, introduced by Glorot and Bengio in their 2010 AISTATS paper [2]. The derivation assumes that the activation is approximately linear near zero, which is exactly the regime tanh operates in early in training, and asks for a weight scale that keeps the variance of activations and the variance of back-propagated gradients constant from layer to layer. The recommendation is to draw weights from a uniform distribution on [-r, r] with r = sqrt(6 / (fan_in + fan_out)), or equivalently from a normal distribution with variance 2 / (fan_in + fan_out) [2]. This scheme is implemented as torch.nn.init.xavier_uniform_ and xavier_normal_ in PyTorch and as tf.keras.initializers.GlorotUniform and GlorotNormal in TensorFlow.
For ReLU networks the analogous scheme is He initialisation (He, Zhang, Ren and Sun, 2015), which uses a variance of 2 / fan_in instead of 2 / (fan_in + fan_out) to compensate for the fact that ReLU zeroes out half of the inputs in expectation [17]. He initialisation is not the recommended default for tanh; using it tends to push pre-activations into the saturating tails and slow training.
Library APIs
Tanh is a primitive in essentially every numerical library. The conventional names are:
| Library | Function | Notes |
|---|---|---|
| Python standard library | math.tanh(x) | Scalar, follows the C99 tanh from math.h |
| NumPy | numpy.tanh(x) | Elementwise ufunc, supports broadcasting and the out= and where= arguments (NumPy reference) [11] |
| PyTorch | torch.tanh(input) and torch.nn.Tanh() module | Differentiable; the module form is meant for use inside nn.Sequential [12] |
| TensorFlow / Keras | tf.nn.tanh(x) and tf.keras.activations.tanh(x) | Used by passing activation='tanh' to a Dense or LSTM layer [13] |
| JAX | jax.numpy.tanh(x) | Same semantics as NumPy, with autodiff |
| C / C++ | tanh, tanhf, tanhl from <math.h> / <cmath> | Standardised by C99 and C++11 |
In all of these libraries tanh is computed using the numerically stable identity discussed above, so callers do not need to special-case large or small inputs [11].
Teaching context
Tanh appears in essentially every introductory deep-learning course as the canonical example of a saturating activation function. The pedagogical sequence is usually: introduce the perceptron and the step function; soften the step into a logistic sigmoid so that backpropagation has a well-defined gradient; recentre the sigmoid to obtain tanh; observe that both functions saturate; motivate ReLU as a non-saturating alternative. Goodfellow, Bengio and Courville's Deep Learning (MIT Press, 2016) follows this arc in chapter 6 and notes that tanh is preferred to the logistic sigmoid as a hidden-unit activation because it behaves more like the identity near zero, so a network of small tanh layers can be trained much like a deep linear network in early epochs [6].
It is also the standard worked example for analytical exercises on backpropagation, since the derivative 1 - tanh^2(x) can be computed cheaply from the forward activation. This makes tanh useful as a teaching activation even in courses that recommend ReLU for production work.
See also
- Activation function
- Sigmoid function
- Rectified Linear Unit (ReLU)
- Long Short-Term Memory (LSTM)
- GRU
- Vanishing Gradient Problem
- Recurrent Neural Network
- Backpropagation
- Stochastic Gradient Descent (SGD)
References
- Efficient BackProp, Yann LeCun, Leon Bottou, Genevieve B. Orr and Klaus-Robert Mueller, in *Neural Networks: Tricks of the Trade*, Lecture Notes in Computer Science 1524, Springer, 1998. ↩
- Understanding the difficulty of training deep feedforward neural networks, Xavier Glorot and Yoshua Bengio, *Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS)*, PMLR 9:249-256, 2010. ↩
- Deep Sparse Rectifier Neural Networks, Xavier Glorot, Antoine Bordes and Yoshua Bengio, *Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS)*, PMLR 15:315-323, 2011. ↩
- Long Short-Term Memory, Sepp Hochreiter and Jurgen Schmidhuber, *Neural Computation* 9(8):1735-1780, MIT Press, 1997. ↩
- Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk and Yoshua Bengio, *Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)*, ACL, 2014. ↩
- Deep Learning, Ian Goodfellow, Yoshua Bengio and Aaron Courville, MIT Press, 2016. ↩
- Hyperbolic functions, Wikipedia contributors, Wikipedia, accessed 2026. ↩
- Activation function, Wikipedia contributors, Wikipedia, accessed 2026. ↩
- Long short-term memory, Wikipedia contributors, Wikipedia, accessed 2026. ↩
- Gated recurrent unit, Wikipedia contributors, Wikipedia, accessed 2026. ↩
- numpy.tanh reference, NumPy developers, NumPy documentation, accessed 2026. ↩
- torch.tanh reference, PyTorch contributors, PyTorch documentation, accessed 2026. ↩
- tf.keras.activations.tanh reference, TensorFlow authors, TensorFlow documentation, accessed 2026. ↩
- Cost effective Tanh activation function circuits based on fast piecewise linear logic, Sahin and Tas, *Microelectronics Journal*, Elsevier, 2023. ↩
- Hardware implementation of neural network with Sigmoidal activation functions using CORDIC, Tiwari and Khare, *Microprocessors and Microsystems*, Elsevier, 2015. ↩
- CORDIC Is All You Need, Cardarilli et al., arXiv:2503.11685, 2025. ↩
- Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun, *Proceedings of the IEEE International Conference on Computer Vision (ICCV)*, 2015. ↩
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.