The hyperbolic tangent, written tanh, is a smooth, bounded, S-shaped function that maps the real line into the open interval (-1, 1). In machine learning it is one of the oldest activation functions used in neural networks, and for most of the 1990s and 2000s it was the default nonlinearity in hidden layers of feedforward and recurrent models. Although it has been displaced from generic deep feedforward networks by the Rectified Linear Unit (ReLU) and its variants, tanh remains the standard squashing nonlinearity inside LSTM and GRU cells, and it is still used in output layers when targets are bounded between -1 and 1.
The hyperbolic tangent is defined for any real input x by the ratio of hyperbolic sine and hyperbolic cosine:
tanh(x) = sinh(x) / cosh(x) = (e^x - e^(-x)) / (e^x + e^(-x))
Multiplying numerator and denominator by e^x gives an equivalent, often more numerically convenient form:
tanh(x) = (e^(2x) - 1) / (e^(2x) + 1)
The range is the open interval (-1, 1). The function passes through the origin with tanh(0) = 0, asymptotes to +1 as x grows positive, and asymptotes to -1 as x grows negative. The Wikipedia article on hyperbolic functions lists these identities together with the half-angle formula tanh(x) = 2t / (1 + t^2) where t = tanh(x/2), and notes that tanh is the unique solution of the differential equation f' = 1 - f^2 with f(0) = 0.
A convenient property for backpropagation is that the derivative of tanh can be written entirely in terms of the function's own value:
d/dx tanh(x) = 1 - tanh^2(x) = sech^2(x)
The maximum derivative is 1, attained at x = 0, and it decays to 0 as |x| grows. During the forward pass a network can store the value tanh(x); the backward pass then needs only one extra multiplication to obtain the local gradient, which is one reason the function was attractive on early hardware.
Tanh and the logistic sigmoid are linearly related. Writing the logistic function as σ(x) = 1 / (1 + e^(-x)), the identity
tanh(x) = 2 σ(2x) - 1
holds for all real x. So a tanh layer is, up to an affine reparameterisation of weights and biases, the same as a sigmoid layer with inputs and outputs rescaled. The practical difference is that tanh is centred at zero while the sigmoid is centred at 0.5, and the maximum derivative of tanh (one) is four times larger than the maximum derivative of the sigmoid (one quarter). The Wikipedia article on hyperbolic functions states this identity explicitly.
Neural networks of the 1980s and early 1990s used either the logistic sigmoid or tanh as the nonlinear unit, with tanh becoming the more common choice in serious training work after the publication of LeCun, Bottou, Orr and Müller's chapter "Efficient BackProp" in Neural Networks: Tricks of the Trade (Springer LNCS 1524, 1998). That chapter, published as a refined version of a 1996 NIPS workshop talk, recommended tanh over the standard logistic sigmoid for hidden units of multilayer perceptrons trained by stochastic gradient descent. The argument is that nonzero mean activations introduce a bias into the gradients of subsequent layers and slow down convergence; symmetric activations such as tanh keep the average activation closer to zero and so produce more balanced updates. The same chapter also recommends a particular scaled form, 1.7159 · tanh(2x/3), chosen so that the function passes through (1, 1) and (-1, -1) and so that the second derivative is largest at the points used as desired outputs (LeCun et al. 1998).
Tanh remained the default hidden-unit nonlinearity throughout the 2000s. It appears as the squashing function in the original Long Short-Term Memory network of Hochreiter and Schmidhuber (Neural Computation, 1997), where it is applied to the candidate cell input and again to the cell state before it is exposed through the output gate. When Cho, van Merriënboer, Gulcehre, Bahdanau, Bougares, Schwenk and Bengio introduced the gated recurrent unit in 2014 (Cho et al., Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, EMNLP 2014, arXiv:1406.1078), they kept tanh as the candidate-state activation. The Wikipedia article on the GRU writes the candidate hidden state as ĥ_t = tanh(W_h x_t + U_h (r_t ⊙ h_{t-1}) + b_h), with tanh as the activation.
Two papers from the Bengio group then changed the default for feedforward networks. Glorot and Bengio's Understanding the difficulty of training deep feedforward neural networks (AISTATS 2010) studied tanh, sigmoid and softsign units, showed how the activations and gradients evolve through depth under standard random initialisation, and derived what is now called Xavier or Glorot initialisation as the variance schedule that keeps signals stable across layers when the activation is approximately linear near zero (true of tanh). The follow-up by Glorot, Bordes and Bengio, Deep Sparse Rectifier Neural Networks (AISTATS 2011), then showed that rectifier units (max(0, x)) reach equal or better accuracy than tanh networks while producing genuinely sparse representations, and they did so without unsupervised pretraining. After 2011 the centre of gravity in feedforward deep learning moved to ReLU and its descendants, while tanh remained dominant inside recurrent cells.
Tanh has a small number of properties that almost completely determine why it is used where it is used:
In a deep tanh network, the gradient that backpropagation computes for an early layer is a product of many local derivatives 1 - tanh^2(z). Each of those factors is at most one, and is much smaller than one whenever the pre-activation z is outside roughly the range (-2, 2). The product therefore tends to shrink exponentially with depth, leaving early layers with effectively no learning signal. This is the classical vanishing gradient problem identified by Hochreiter in his 1991 diploma thesis and re-examined empirically by Glorot and Bengio in 2010. Their paper traces how the variance of the back-propagated gradient decays through depth in a tanh network with the older 1/sqrt(n) initialisation and motivates the Xavier scheme as a remedy that keeps the variance roughly constant in the linear regime around zero.
The Glorot, Bordes and Bengio 2011 paper went one step further and replaced the saturating tanh with the rectifier ReLU(x) = max(0, x), whose gradient is exactly one for any positive input. The paper showed that on image and text classification benchmarks ReLU networks matched or beat tanh networks of the same shape, and they did so without the unsupervised layer-by-layer pretraining that had been needed to make deep tanh networks train at all. After 2011 most new feedforward architectures defaulted to ReLU or one of its variants such as Leaky ReLU, ELU, GELU or SiLU/Swish.
Tanh is far from obsolete. The places where it is still the default in modern systems are precisely the places where the bounded, zero-centred, smooth shape buys something concrete:
The LSTM cell of Hochreiter and Schmidhuber (1997) uses tanh in two places. First, the candidate cell input g_t = tanh(W_g x_t + U_g h_{t-1} + b_g) is squashed into (-1, 1) before being scaled by the input gate and added to the cell state. Second, the cell state c_t is itself passed through tanh before being scaled by the output gate, so that the exposed hidden state h_t = o_t · tanh(c_t) is bounded. The Wikipedia article on LSTM gives these equations explicitly with tanh as the activation, citing Hochreiter and Schmidhuber 1997. The bounded range of tanh limits how fast the unbounded cell state can change in any single step, which is part of why LSTM trains stably over long sequences.
The GRU of Cho et al. (2014) keeps the same convention. The candidate hidden state is computed as ĥ_t = tanh(W x_t + U (r_t ⊙ h_{t-1}) + b), with tanh as the activation, and the update gate then linearly interpolates between the previous hidden state and this candidate. Replacing tanh with ReLU in either LSTM or GRU is unusual; the bounded range and the smooth derivative are useful precisely because the recurrence applies the function many times to the same hidden vector.
When the target of a network is bounded, tanh is a natural output nonlinearity. Two common cases:
Tanh shows up as the elementwise nonlinearity in some normalising-flow components and in small recurrent or control models where the bounded range simplifies stability analysis. Hardware-friendly inference engines on FPGAs and microcontrollers also continue to ship tanh as a primitive, partly because piecewise-linear approximations of tanh are easy to implement and partly because legacy LSTM models for keyword spotting and speech still use it.
The table below summarises the headline properties of the most common scalar activations used in modern neural networks. Formulas are taken from the Wikipedia Activation function article and from the original papers cited in the references.
| Activation | Formula | Range | Derivative | Pros | Cons | Typical use |
|---|---|---|---|---|---|---|
| tanh | (e^x - e^(-x)) / (e^x + e^(-x)) | (-1, 1) | 1 - tanh^2(x) | Zero-centred, smooth, bounded, antisymmetric | Saturates; vanishing gradients in deep nets | LSTM/GRU candidate states, bounded outputs |
| Sigmoid | 1 / (1 + e^(-x)) | (0, 1) | σ(x) (1 - σ(x)) | Probabilistic interpretation, smooth | Not zero-centred, max derivative 0.25, saturates | Output of binary classifiers, gates in LSTM/GRU |
| ReLU | max(0, x) | [0, ∞) | 1 if x > 0 else 0 | No saturation for positive inputs, sparse activations, cheap | Dead neurons for x < 0, not smooth at 0 | Default hidden activation in feedforward and CNNs since ~2012 |
| Leaky ReLU | x if x > 0 else αx (α≈0.01) | (-∞, ∞) | 1 if x > 0 else α | Avoids dead-neuron problem of ReLU | Extra hyperparameter α | Generative models, some CNN backbones |
| ELU | x if x > 0 else α(e^x - 1) | (-α, ∞) | 1 if x > 0 else ELU(x) + α | Smooth at zero, mean activation closer to zero than ReLU | More expensive than ReLU due to exp | Some classification CNNs |
| GELU | x · Φ(x) where Φ is the standard normal CDF | (-∞, ∞) | Φ(x) + x · φ(x) | Smooth, used in modern Transformers | More expensive than ReLU | Transformer feedforward sublayers (BERT, GPT-2 family) |
| SiLU / Swish | x · σ(x) | (-∞, ∞) | σ(x) (1 + x (1 - σ(x))) | Smooth, self-gated, competitive accuracy | Slightly more expensive than ReLU | EfficientNet, modern CNNs and LLMs |
| Softplus | log(1 + e^x) | (0, ∞) | σ(x) | Smooth approximation of ReLU, always positive | Slower than ReLU, no exact zero | Sometimes used for variance parameters |
Seen this way, tanh is the natural choice when an activation has to be bounded and zero-centred and smooth all at once. ReLU and its variants give up boundedness; the sigmoid gives up zero-centredness; piecewise-linear functions give up smoothness. Tanh keeps all three at the cost of saturating gradients, which is acceptable in the recurrent setting where saturation also acts as an implicit regulariser.
The naive evaluation tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x)) overflows in IEEE single precision once |x| exceeds about 88, because e^x already overflows. Standard math libraries therefore special-case the function. A common implementation, used by numpy.tanh and most C99 tanh routines, branches on the sign and the magnitude of x:
Compared with naive sigmoid implementations, which need a similar log-sum-exp style trick to avoid overflow when x is very negative, tanh is unusually friendly: it saturates symmetrically, the safe identity covers both tails, and the function never produces NaN for finite inputs. This is one reason numpy.tanh is documented as a vectorised ufunc with no special handling required by the caller (NumPy reference, numpy.tanh).
On FPGAs, microcontrollers and low-power inference accelerators, tanh is rarely computed by evaluating exponentials. Two classes of approximation dominate the literature:
For very tight memory budgets, look-up tables of around 256 entries plus linear interpolation also remain common in low-end DSPs and audio codecs.
The default initialiser for a tanh hidden layer is Xavier, also called Glorot, initialisation, introduced by Glorot and Bengio in their 2010 AISTATS paper. The derivation assumes that the activation is approximately linear near zero, which is exactly the regime tanh operates in early in training, and asks for a weight scale that keeps the variance of activations and the variance of back-propagated gradients constant from layer to layer. The recommendation is to draw weights from a uniform distribution on [-r, r] with r = sqrt(6 / (fan_in + fan_out)), or equivalently from a normal distribution with variance 2 / (fan_in + fan_out). This scheme is implemented as torch.nn.init.xavier_uniform_ and xavier_normal_ in PyTorch and as tf.keras.initializers.GlorotUniform and GlorotNormal in TensorFlow.
For ReLU networks the analogous scheme is He initialisation (He, Zhang, Ren and Sun, 2015), which uses a variance of 2 / fan_in instead of 2 / (fan_in + fan_out) to compensate for the fact that ReLU zeroes out half of the inputs in expectation. He initialisation is not the recommended default for tanh; using it tends to push pre-activations into the saturating tails and slow training.
Tanh is a primitive in essentially every numerical library. The conventional names are:
| Library | Function | Notes |
|---|---|---|
| Python standard library | math.tanh(x) | Scalar, follows the C99 tanh from math.h |
| NumPy | numpy.tanh(x) | Elementwise ufunc, supports broadcasting and the out= and where= arguments (NumPy reference) |
| PyTorch | torch.tanh(input) and torch.nn.Tanh() module | Differentiable; the module form is meant for use inside nn.Sequential |
| TensorFlow / Keras | tf.nn.tanh(x) and tf.keras.activations.tanh(x) | Used by passing activation='tanh' to a Dense or LSTM layer |
| JAX | jax.numpy.tanh(x) | Same semantics as NumPy, with autodiff |
| C / C++ | tanh, tanhf, tanhl from <math.h> / <cmath> | Standardised by C99 and C++11 |
In all of these libraries tanh is computed using the numerically stable identity discussed above, so callers do not need to special-case large or small inputs.
Tanh appears in essentially every introductory deep-learning course as the canonical example of a saturating activation function. The pedagogical sequence is usually: introduce the perceptron and the step function; soften the step into a logistic sigmoid so that backpropagation has a well-defined gradient; recentre the sigmoid to obtain tanh; observe that both functions saturate; motivate ReLU as a non-saturating alternative. Goodfellow, Bengio and Courville's Deep Learning (MIT Press, 2016) follows this arc in chapter 6 and notes that tanh is preferred to the logistic sigmoid as a hidden-unit activation because it behaves more like the identity near zero, so a network of small tanh layers can be trained much like a deep linear network in early epochs.
It is also the standard worked example for analytical exercises on backpropagation, since the derivative 1 - tanh^2(x) can be computed cheaply from the forward activation. This makes tanh useful as a teaching activation even in courses that recommend ReLU for production work.