See also: Machine learning terms
In machine learning, the sigmoid function is a widely used mathematical function that transforms input values into probabilities, ranging from 0 to 1. It is often employed in various types of machine learning algorithms, particularly in artificial neural networks and logistic regression models, to map continuous inputs to probabilities for binary classification tasks. The sigmoid function is characterized by its distinctive S-shaped curve, which is also known as the logistic curve. The function was originally studied in the context of population growth models by Pierre Francois Verhulst in the 1830s and 1840s, and it has since become one of the most recognizable functions in deep learning and statistics.
Beyond its role as an activation function, the sigmoid function appears throughout probability theory, signal processing, and information science. Its ability to compress any real-valued number into the (0, 1) interval makes it a natural choice whenever an output must be interpreted as a probability.
The sigmoid function, denoted as σ(x), is mathematically defined as:
σ(x) = 1 / (1 + e^(-x))
where x represents the input value and e is the base of the natural logarithm, approximately equal to 2.71828. The function compresses input values into a range between 0 and 1, with the output approaching 0 as x tends to negative infinity and approaching 1 as x tends to positive infinity.
The sigmoid function can also be expressed in terms of the hyperbolic tangent:
σ(x) = (1 + tanh(x/2)) / 2
This relationship shows that the sigmoid and tanh functions are closely related through a simple linear transformation.
An equivalent representation uses the negative exponential form:
σ(x) = e^x / (e^x + 1)
This alternative is sometimes preferred for numerical stability when dealing with large negative values of x, since it avoids computing e raised to a large positive power in the denominator.
The sigmoid function possesses several key properties that make it suitable for machine learning applications:
| Property | Description |
|---|---|
| Differentiable | The sigmoid function is smooth and continuously differentiable at every point, which is essential for gradient-based optimization algorithms such as gradient descent. |
| Monotonic | The function is strictly increasing, meaning that the output will increase as the input increases. This property ensures that the function preserves the order of input values. |
| Bounded range | The function's output range is limited to the open interval (0, 1), making it ideal for representing probabilities or binary decisions. |
| Symmetry | The sigmoid function is symmetric around the point (0, 0.5). Specifically, σ(-x) = 1 - σ(x) for all x. |
| Asymptotic behavior | As x approaches positive infinity, σ(x) approaches 1. As x approaches negative infinity, σ(x) approaches 0. The function never actually reaches 0 or 1. |
| Centered at origin | The midpoint at x = 0 corresponds to an output of 0.5, which serves as a natural decision boundary in classification tasks. |
| Log-odds relationship | The inverse of the sigmoid is the logit function: logit(p) = log(p / (1 - p)). This provides a direct connection to log-odds in statistics. |
The logit function is the mathematical inverse of the sigmoid. Given a probability p in the interval (0, 1), the logit maps it back to the real number line:
logit(p) = log(p / (1 - p))
This function converts a probability into log-odds and is central to logistic regression, where the model assumes that the log-odds of the outcome vary linearly with the input features. In information theory, the logit function is also called the "log-odds function." The sigmoid and logit together form a bijection between the real line and the open interval (0, 1), which is why the sigmoid is sometimes called the "expit" function (the inverse of the logit).
One of the most elegant properties of the sigmoid function is that its derivative can be expressed entirely in terms of the function itself:
σ'(x) = σ(x) * (1 - σ(x))
This compact form makes the derivative computationally efficient to evaluate during backpropagation, since the value of σ(x) has typically already been computed during the forward pass. There is no need for an additional exponential calculation.
The derivation proceeds as follows. Starting from σ(x) = 1 / (1 + e^(-x)), applying the quotient rule yields:
σ'(x) = e^(-x) / (1 + e^(-x))^2
This can be rewritten by noting that e^(-x) / (1 + e^(-x))^2 = [1 / (1 + e^(-x))] * [e^(-x) / (1 + e^(-x))] = σ(x) * [1 - σ(x)].
The derivative reaches its maximum value of 0.25 at x = 0, and it approaches 0 as x moves toward positive or negative infinity. This means the gradient is strongest when the input is near zero and weakest when the input has a large absolute value.
| Input (x) | σ(x) | σ'(x) |
|---|---|---|
| -5 | 0.0067 | 0.0066 |
| -3 | 0.0474 | 0.0452 |
| -1 | 0.2689 | 0.1966 |
| 0 | 0.5000 | 0.2500 |
| 1 | 0.7311 | 0.1966 |
| 3 | 0.9526 | 0.0452 |
| 5 | 0.9933 | 0.0066 |
The table above illustrates how the derivative shrinks rapidly for inputs far from zero. This behavior is directly related to the vanishing gradient problem discussed below.
The vanishing gradient problem is one of the most significant drawbacks of using the sigmoid function as an activation function in deep neural networks. Because the maximum value of the sigmoid derivative is only 0.25, gradients become progressively smaller as they are propagated backward through multiple layers during backpropagation.
To understand why this is problematic, consider a deep network with many layers. During backpropagation, the gradient at each layer is multiplied by the local derivative of the activation function. If each layer uses a sigmoid activation, and each local derivative is at most 0.25, then after passing through n layers, the gradient is scaled by at most (0.25)^n. For a network with 10 layers, this means the gradient reaching the first layer is at most (0.25)^10, which is roughly 0.00000095. The gradient effectively vanishes.
The following table illustrates how quickly the maximum possible gradient shrinks with depth:
| Number of layers | Max gradient reaching first layer |
|---|---|
| 2 | 0.0625 |
| 4 | 0.00390625 |
| 6 | 0.000244 |
| 8 | 0.0000153 |
| 10 | 0.00000095 |
| 20 | 9.1 x 10^(-13) |
This vanishing gradient has several practical consequences:
The vanishing gradient problem was a major obstacle to training deep networks throughout the 1990s and early 2000s. It was one of the primary reasons researchers explored alternative activation functions. Xavier Glorot and Yoshua Bengio published an influential 2010 study demonstrating the severity of this issue and proposing improved weight initialization strategies (known as Xavier or Glorot initialization) to partially mitigate it.
The sigmoid function is one of several common activation functions used in neural networks. Each function has different properties that make it suited for different situations.
The hyperbolic tangent (tanh) function is closely related to the sigmoid function. It can be written as:
tanh(x) = 2 * σ(2x) - 1
The key difference is that tanh maps inputs to the range (-1, 1), making it zero-centered. This zero-centered property means that the outputs of tanh have both positive and negative values, which helps subsequent layers learn more efficiently because the gradients do not consistently push weights in one direction. The maximum derivative of tanh is 1.0 (compared to 0.25 for sigmoid), which somewhat mitigates the vanishing gradient problem, though tanh still suffers from saturation at extreme input values. In practice, tanh was the preferred activation function before ReLU gained popularity, and it continues to be used inside gating mechanisms alongside the sigmoid.
The Rectified Linear Unit (ReLU) function, defined as ReLU(x) = max(0, x), was introduced as a solution to many of the sigmoid's shortcomings. Nair and Hinton popularized ReLU in 2010, demonstrating that it allowed faster training of deep networks. ReLU does not saturate for positive inputs, which means it does not suffer from the vanishing gradient problem in the positive domain. Its derivative is either 0 (for negative inputs) or 1 (for positive inputs), allowing gradients to flow unchanged through active neurons. ReLU is also computationally cheaper than the sigmoid, since it involves only a simple thresholding operation rather than an exponential calculation.
However, ReLU introduces the "dying ReLU" problem, where neurons with negative inputs always output zero and receive zero gradient, making them permanently inactive. Variants such as Leaky ReLU, Parametric ReLU (PReLU), and ELU have been developed to address this issue. The landmark 2012 AlexNet paper by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton used ReLU activations throughout its hidden layers, marking a turning point in deep learning practice and firmly establishing ReLU as the default choice for hidden layers.
Several other activation functions have been proposed as alternatives to both sigmoid and ReLU:
| Property | Sigmoid | Tanh | ReLU | Swish | GELU |
|---|---|---|---|---|---|
| Output range | (0, 1) | (-1, 1) | [0, infinity) | Unbounded | Unbounded |
| Zero-centered | No | Yes | No | Approximately | Approximately |
| Max derivative | 0.25 | 1.0 | 1.0 | Varies | Varies |
| Vanishing gradient | Severe | Moderate | No (positive side) | Mild | Mild |
| Computational cost | High | High | Low | Moderate | Moderate |
| Common use case | Output layer, gates | LSTM gates, hidden layers | Hidden layers in deep networks | Hidden layers in deep networks | Transformer models |
| Dying neuron problem | No | No | Yes | No | No |
The sigmoid function is commonly used in machine learning across a variety of tasks and architectures.
Logistic regression is perhaps the most classical application of the sigmoid function. In logistic regression, the model computes a linear combination of input features (z = w^T * x + b) and then passes the result through the sigmoid function to produce a probability estimate:
P(y=1|x) = σ(w^T * x + b)
This probability estimate is then used for binary classification. The decision boundary is set at σ(z) = 0.5, which corresponds to z = 0. The loss function used for training logistic regression is the binary cross-entropy loss (also called the log loss), which is derived from maximum likelihood estimation under the assumption that the output follows a Bernoulli distribution parameterized by the sigmoid output.
The binary cross-entropy loss for a single sample is:
L = -[y * log(σ(z)) + (1 - y) * log(1 - σ(z))]
where y is the true label (0 or 1) and z is the linear output. The sigmoid and log functions interact in a way that keeps the gradients well-behaved during optimization.
In modern neural networks, the sigmoid function is most commonly placed at the output layer of a network designed for binary classification or multi-label classification. While ReLU or its variants are preferred for hidden layers, the sigmoid remains the standard choice for output neurons that need to produce probability estimates in the (0, 1) range.
For multi-label classification, where each sample can belong to multiple classes simultaneously, a sigmoid activation is applied independently to each output neuron. This differs from softmax, which is used for mutually exclusive multi-class classification. In multi-label settings, the sigmoid allows each class probability to be computed independently, so multiple classes can have high probabilities at the same time.
The sigmoid function plays a critical role in gating mechanisms within recurrent neural networks. In the Long Short-Term Memory (LSTM) architecture, sigmoid functions are used to control three gates:
| Gate | Purpose |
|---|---|
| Forget gate | Determines how much of the previous cell state to retain. A sigmoid output near 0 means "forget this information," and near 1 means "keep this information." |
| Input gate | Controls how much of the new candidate information to add to the cell state. |
| Output gate | Determines how much of the cell state to expose as the hidden state output. |
Similarly, in Gated Recurrent Units (GRU), sigmoid functions control the update gate and the reset gate. The bounded (0, 1) output of the sigmoid makes it a natural choice for these gating mechanisms, since the gates need to represent the degree to which information flows through. Notably, the vanishing gradient problem does not affect sigmoid gates in LSTMs and GRUs as severely as it does in plain feedforward networks. The additive structure of the cell state update in LSTMs creates a "gradient highway" that allows gradients to flow across many time steps without repeated multiplication through sigmoid derivatives.
In generative adversarial networks (GANs), the discriminator network typically uses a sigmoid activation at its output layer. The discriminator's task is to distinguish real data samples from fake samples produced by the generator. The sigmoid maps the discriminator's raw output (a logit) to a probability between 0 and 1, where values near 1 indicate that the input is likely real and values near 0 indicate that it is likely fake.
The original GAN formulation by Goodfellow et al. (2014) trains the discriminator using binary cross-entropy loss with target labels of 1 for real images and 0 for generated images. The sigmoid output directly provides the estimated probability D(x) that a given sample x is real. In practice, many implementations use BCEWithLogitsLoss (in PyTorch) or sigmoid_cross_entropy_with_logits (in TensorFlow) to combine the sigmoid and the loss computation into a single numerically stable operation.
Some later GAN variants, such as Wasserstein GANs, removed the sigmoid from the discriminator (using a linear output instead), but the original sigmoid-based formulation remains widely used and studied.
In attention mechanisms, sigmoid activations are sometimes used to compute attention weights, particularly in scenarios where the attention is not mutually exclusive across positions. For example, in some transformer variants and in multi-head attention designs, sigmoid-based attention allows each position to attend to multiple other positions independently, rather than distributing a fixed amount of attention through softmax normalization. Recent research on "sigmoid attention" has shown promising results as an alternative to softmax-based attention in certain architectures.
The sigmoid function serves as the canonical link function in Bayesian logistic regression and in variational inference, where it maps real-valued parameters to probabilities. In variational autoencoders (VAEs), sigmoid activations are used in the decoder when the output is binary or bounded between 0 and 1 (for example, normalized pixel intensities in image generation).
Platt scaling is a post-processing technique that uses the sigmoid function to calibrate the outputs of a classifier into well-calibrated probabilities. The method was introduced by John Platt in 1999, originally in the context of support vector machines (SVMs).
The idea is straightforward: given a classifier that produces raw scores f(x), Platt scaling fits a logistic regression model on top of those scores. The calibrated probability is computed as:
P(y=1|f(x)) = 1 / (1 + exp(A * f(x) + B))
where A and B are scalar parameters learned by maximizing the likelihood on a held-out validation set. This is the same sigmoid function with a linear transformation of the classifier's output.
Platt scaling works well when the distortion in predicted probabilities has a sigmoidal shape, which is common for max-margin classifiers like SVMs and for boosted tree models. For neural networks, a related technique called temperature scaling adjusts the softmax temperature to improve calibration. When calibration data is plentiful, non-parametric methods such as isotonic regression may outperform Platt scaling, but the sigmoid-based approach has the advantage of requiring relatively few samples to fit its two parameters.
The hard sigmoid is a piecewise linear approximation of the standard sigmoid function, designed to be computationally cheaper to evaluate. It replaces the exponential computation with simple comparisons and a linear function:
hard_sigmoid(x) = max(0, min(1, 0.2x + 0.5))
This function clips the output to [0, 1] and uses a straight line in between. It is nearly identical to the true sigmoid near x = 0 but diverges at the tails, where it saturates abruptly instead of approaching the asymptotes gradually.
| Property | Standard sigmoid | Hard sigmoid |
|---|---|---|
| Formula | 1 / (1 + e^(-x)) | max(0, min(1, 0.2x + 0.5)) |
| Computation | Requires exponentiation | Only addition, multiplication, clipping |
| Speed | Slower | Significantly faster |
| Accuracy | Exact | Approximate |
| Differentiability | Smooth everywhere | Not differentiable at x = -2.5 and x = 2.5 |
| Use case | General purpose | Mobile and edge deployment |
The hard sigmoid gained attention through its use in MobileNetV3 (Howard et al., 2019), where it replaced the standard sigmoid in squeeze-and-excitation (SE) modules. By avoiding expensive exponentiation, the hard sigmoid reduces inference latency on mobile and embedded devices with limited computational resources. Keras and PyTorch both provide built-in hard sigmoid implementations.
When implementing the sigmoid function in practice, numerical stability must be handled carefully. Computing e^(-x) for large positive values of x results in numbers very close to zero, which is fine. However, for large negative values of x, e^(-x) can overflow the floating-point range. A common stable implementation computes the sigmoid differently depending on the sign of x:
Similarly, when computing the log of the sigmoid (which appears in the cross-entropy loss), a numerically stable form is:
log(σ(x)) = -softplus(-x) = -log(1 + e^(-x))
where softplus(x) = log(1 + e^x). Most deep learning frameworks provide dedicated log_sigmoid functions that handle these stability concerns internally.
In PyTorch, the BCEWithLogitsLoss function combines the sigmoid and binary cross-entropy loss into a single operation that uses the log-sum-exp trick to avoid numerical overflow. TensorFlow offers an equivalent function called sigmoid_cross_entropy_with_logits. These combined functions are strongly preferred over applying the sigmoid and loss separately, because the naive approach can produce NaN values when the sigmoid output is exactly 0 or 1 due to floating-point rounding.
The sigmoid function is available as a built-in operation in all major deep learning frameworks and numerical libraries. The following table summarizes the most common ways to invoke it:
| Framework | Function call | Notes |
|---|---|---|
| PyTorch | torch.sigmoid(x) or torch.nn.Sigmoid() | Element-wise; supports autograd |
| TensorFlow | tf.math.sigmoid(x) or tf.keras.activations.sigmoid | Also available as a Keras layer activation |
| NumPy | 1 / (1 + np.exp(-x)) or scipy.special.expit(x) | SciPy's expit is numerically stable |
| JAX | jax.nn.sigmoid(x) | Supports JIT compilation and autodiff |
| ONNX | Sigmoid operator | Standard operator in the ONNX spec |
When using these implementations, it is generally best to let the framework handle the sigmoid computation rather than writing a custom version, because the built-in functions include numerical stability safeguards.
The logistic sigmoid described above is the most common member of a broader family of sigmoid functions. Any function with an S-shaped curve can technically be called a sigmoid. Other examples include:
| Function | Formula | Range |
|---|---|---|
| Logistic sigmoid | 1 / (1 + e^(-x)) | (0, 1) |
| Hyperbolic tangent | (e^x - e^(-x)) / (e^x + e^(-x)) | (-1, 1) |
| Arctangent | arctan(x) | (-π/2, π/2) |
| Error function (erf) | (2/sqrt(π)) * integral(e^(-t^2), 0, x) | (-1, 1) |
| Algebraic sigmoid | x / sqrt(1 + x^2) | (-1, 1) |
| Gudermannian function | 2 * arctan(tanh(x/2)) | (-π/2, π/2) |
All of these functions share the characteristic S-shape, with bounded output, monotonicity, and horizontal asymptotes. The logistic sigmoid is the most widely used in machine learning due to its probabilistic interpretation and its convenient derivative.
The sigmoid function appears in numerous scientific and engineering fields outside of machine learning.
The sigmoid's origins lie in population modeling. Verhulst's logistic equation describes how a population grows rapidly at first and then slows as it approaches a carrying capacity K:
P(t) = K / (1 + e^(-r(t - t_0)))
where r is the intrinsic growth rate and t_0 is the midpoint of the growth curve. This S-shaped growth pattern has been observed in bacteria colonies, animal populations, and the adoption of new technologies.
In pharmacology, the sigmoid (or "Hill equation") models the relationship between the dose of a drug and its physiological effect. The dose-response curve typically follows a sigmoidal shape: at low doses the effect is negligible, it increases steeply through a mid-range, and then plateaus at higher doses as receptors become saturated. The Hill coefficient controls the steepness of the curve, and the EC50 parameter represents the dose at which 50% of the maximum effect is achieved.
The logistic sigmoid also models the diffusion of innovations in economics and marketing. When a new product or technology enters a market, the adoption rate typically follows an S-curve: slow initial uptake by early adopters, rapid growth as the mainstream population adopts, and a tapering off as the market reaches saturation.
In audio engineering, sigmoid-like functions serve as waveshaper transfer functions used to emulate the soft clipping behavior of analog circuits, particularly vacuum tube amplifiers. This type of nonlinear distortion produces harmonically rich sounds that are preferred in music production.
The sigmoid function has a rich history in both mathematics and computing. Its origins trace back to Pierre Francois Verhulst (1804-1849), a Belgian mathematician who introduced the logistic function in a series of three papers between 1838 and 1847 while studying population dynamics. Verhulst proposed it as an improvement over the exponential growth model, incorporating a carrying capacity that limits population growth. The resulting logistic equation produces the characteristic S-curve. He coined the term "logistique" for the curve, though the exact reason for this name remains debated by historians.
The function was later adopted by statisticians for binary regression problems. Joseph Berkson introduced the term "logit" in the 1940s as a portmanteau of "logistic unit," establishing the foundation for logistic regression. David Cox formalized logistic regression as a statistical method in his influential 1958 paper, making the sigmoid function central to applied statistics.
In the context of neural networks, the sigmoid gained prominence in the 1980s when backpropagation was popularized by David Rumelhart, Geoffrey Hinton, and Ronald Williams in their 1986 paper "Learning representations by back-propagating errors." The sigmoid's differentiability made it a natural candidate for gradient-based training. For the next two decades, the sigmoid (and its cousin tanh) dominated as the activation function of choice.
The shift away from sigmoid activations in hidden layers began around 2010-2012, when researchers demonstrated that ReLU allowed much deeper networks to be trained effectively. The landmark 2012 AlexNet paper by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton used ReLU activations throughout its hidden layers, marking a turning point in deep learning practice.
Despite this shift, the sigmoid function remains indispensable in specific contexts, particularly in output layers, gating mechanisms, and any situation requiring a probabilistic output. It is one of the few activation functions that has maintained continuous relevance from the earliest days of neural network research to the present.
Imagine you have a long line of numbers, some positive and some negative. You want to squeeze these numbers into a smaller range, from 0 to 1, so that you can easily compare them. The sigmoid function does exactly that. It is like a magical slide that takes each number and slides it into the 0 to 1 range. Really big numbers end up close to 1, and really small numbers end up close to 0. Numbers near zero end up right around 0.5, in the middle. This is very helpful in machine learning because it helps computers understand things like probabilities and make decisions based on them.