Sigmoid Function

Introduction

In machine learning, the sigmoid function is a widely used mathematical function that transforms input values into probabilities, ranging from 0 to 1. It is often employed in various types of machine learning algorithms, particularly in artificial neural networks and logistic regression models, to map continuous inputs to probabilities for binary classification tasks. The sigmoid function is characterized by its distinctive S-shaped curve, which is also known as the logistic curve. The function was originally studied in the context of population growth models by Pierre Francois Verhulst in the 1830s and 1840s, and it has since become one of the most recognizable functions in deep learning and statistics.

Beyond its role as an activation function, the sigmoid function appears throughout probability theory, signal processing, and information science. Its ability to compress any real-valued number into the (0, 1) interval makes it a natural choice whenever an output must be interpreted as a probability.

Definition and mathematical representation

The sigmoid function, denoted as σ(x), is mathematically defined as:

σ(x) = 1 / (1 + e^(-x))

where x represents the input value and e is the base of the natural logarithm, approximately equal to 2.71828. The function compresses input values into a range between 0 and 1, with the output approaching 0 as x tends to negative infinity and approaching 1 as x tends to positive infinity.

The sigmoid function can also be expressed in terms of the hyperbolic tangent:

σ(x) = (1 + tanh(x/2)) / 2

This relationship shows that the sigmoid and tanh functions are closely related through a simple linear transformation.

An equivalent representation uses the negative exponential form:

σ(x) = e^x / (e^x + 1)

This alternative is sometimes preferred for numerical stability when dealing with large negative values of x, since it avoids computing e raised to a large positive power in the denominator.

Properties

The sigmoid function possesses several key properties that make it suitable for machine learning applications:

Property	Description
Differentiable	The sigmoid function is smooth and continuously differentiable at every point, which is essential for gradient-based optimization algorithms such as gradient descent.
Monotonic	The function is strictly increasing, meaning that the output will increase as the input increases. This property ensures that the function preserves the order of input values.
Bounded range	The function's output range is limited to the open interval (0, 1), making it ideal for representing probabilities or binary decisions.
Symmetry	The sigmoid function is symmetric around the point (0, 0.5). Specifically, σ(-x) = 1 - σ(x) for all x.
Asymptotic behavior	As x approaches positive infinity, σ(x) approaches 1. As x approaches negative infinity, σ(x) approaches 0. The function never actually reaches 0 or 1.
Centered at origin	The midpoint at x = 0 corresponds to an output of 0.5, which serves as a natural decision boundary in classification tasks.
Log-odds relationship	The inverse of the sigmoid is the logit function: logit(p) = log(p / (1 - p)). This provides a direct connection to log-odds in statistics.

The logit function (inverse of sigmoid)

The logit function is the mathematical inverse of the sigmoid. Given a probability p in the interval (0, 1), the logit maps it back to the real number line:

logit(p) = log(p / (1 - p))

This function converts a probability into log-odds and is central to logistic regression, where the model assumes that the log-odds of the outcome vary linearly with the input features. In information theory, the logit function is also called the "log-odds function." The sigmoid and logit together form a bijection between the real line and the open interval (0, 1), which is why the sigmoid is sometimes called the "expit" function (the inverse of the logit).

Derivative of the sigmoid function

One of the most elegant properties of the sigmoid function is that its derivative can be expressed entirely in terms of the function itself:

σ'(x) = σ(x) * (1 - σ(x))

This compact form makes the derivative computationally efficient to evaluate during backpropagation, since the value of σ(x) has typically already been computed during the forward pass. There is no need for an additional exponential calculation.

The derivation proceeds as follows. Starting from σ(x) = 1 / (1 + e^(-x)), applying the quotient rule yields:

σ'(x) = e^(-x) / (1 + e^(-x))^2

This can be rewritten by noting that e^(-x) / (1 + e^(-x))^2 = [1 / (1 + e^(-x))] * [e^(-x) / (1 + e^(-x))] = σ(x) * [1 - σ(x)].

The derivative reaches its maximum value of 0.25 at x = 0, and it approaches 0 as x moves toward positive or negative infinity. This means the gradient is strongest when the input is near zero and weakest when the input has a large absolute value.

Input (x)	σ(x)	σ'(x)
-5	0.0067	0.0066
-3	0.0474	0.0452
-1	0.2689	0.1966
0	0.5000	0.2500
1	0.7311	0.1966
3	0.9526	0.0452
5	0.9933	0.0066

The table above illustrates how the derivative shrinks rapidly for inputs far from zero. This behavior is directly related to the vanishing gradient problem discussed below.

The vanishing gradient problem

The vanishing gradient problem is one of the most significant drawbacks of using the sigmoid function as an activation function in deep neural networks. Because the maximum value of the sigmoid derivative is only 0.25, gradients become progressively smaller as they are propagated backward through multiple layers during backpropagation.

To understand why this is problematic, consider a deep network with many layers. During backpropagation, the gradient at each layer is multiplied by the local derivative of the activation function. If each layer uses a sigmoid activation, and each local derivative is at most 0.25, then after passing through n layers, the gradient is scaled by at most (0.25)^n. For a network with 10 layers, this means the gradient reaching the first layer is at most (0.25)^10, which is roughly 0.00000095. The gradient effectively vanishes.

The following table illustrates how quickly the maximum possible gradient shrinks with depth:

Number of layers	Max gradient reaching first layer
2	0.0625
4	0.00390625
6	0.000244
8	0.0000153
10	0.00000095
20	9.1 x 10^(-13)

This vanishing gradient has several practical consequences:

Slow learning in early layers: The weights in the first few layers of a deep network receive extremely small gradient updates, causing them to learn at a much slower rate than later layers.
Training stagnation: In severe cases, the early layers may stop learning entirely, preventing the network from capturing meaningful low-level features.
Poor initialization sensitivity: If the initial weights cause many neurons to saturate (producing outputs near 0 or 1), the gradients will be near zero from the start, making it difficult for the network to escape the saturated regime.
Information bottleneck: When gradients vanish, the information from the loss function cannot effectively reach the earlier layers, creating a bottleneck that limits what the network can learn.

The vanishing gradient problem was a major obstacle to training deep networks throughout the 1990s and early 2000s. It was one of the primary reasons researchers explored alternative activation functions. Xavier Glorot and Yoshua Bengio published an influential 2010 study demonstrating the severity of this issue and proposing improved weight initialization strategies (known as Xavier or Glorot initialization) to partially mitigate it.

Comparison with other activation functions

The sigmoid function is one of several common activation functions used in neural networks. Each function has different properties that make it suited for different situations.

Sigmoid vs. tanh

The hyperbolic tangent (tanh) function is closely related to the sigmoid function. It can be written as:

tanh(x) = 2 * σ(2x) - 1

The key difference is that tanh maps inputs to the range (-1, 1), making it zero-centered. This zero-centered property means that the outputs of tanh have both positive and negative values, which helps subsequent layers learn more efficiently because the gradients do not consistently push weights in one direction. The maximum derivative of tanh is 1.0 (compared to 0.25 for sigmoid), which somewhat mitigates the vanishing gradient problem, though tanh still suffers from saturation at extreme input values. In practice, tanh was the preferred activation function before ReLU gained popularity, and it continues to be used inside gating mechanisms alongside the sigmoid.

Sigmoid vs. ReLU

The Rectified Linear Unit (ReLU) function, defined as ReLU(x) = max(0, x), was introduced as a solution to many of the sigmoid's shortcomings. Nair and Hinton popularized ReLU in 2010, demonstrating that it allowed faster training of deep networks. ReLU does not saturate for positive inputs, which means it does not suffer from the vanishing gradient problem in the positive domain. Its derivative is either 0 (for negative inputs) or 1 (for positive inputs), allowing gradients to flow unchanged through active neurons. ReLU is also computationally cheaper than the sigmoid, since it involves only a simple thresholding operation rather than an exponential calculation.

However, ReLU introduces the "dying ReLU" problem, where neurons with negative inputs always output zero and receive zero gradient, making them permanently inactive. Variants such as Leaky ReLU, Parametric ReLU (PReLU), and ELU have been developed to address this issue. The landmark 2012 AlexNet paper by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton used ReLU activations throughout its hidden layers, marking a turning point in deep learning practice and firmly establishing ReLU as the default choice for hidden layers.

Other modern activation functions

Several other activation functions have been proposed as alternatives to both sigmoid and ReLU:

Swish: Defined as x * σ(x), Swish was discovered through automated search by Ramachandran, Zoph, and Le in 2017. It is smooth, non-monotonic, and has been shown to outperform ReLU on some deep architectures.
GELU (Gaussian Error Linear Unit): Defined as x * Φ(x), where Φ is the cumulative distribution function of the standard normal distribution. GELU is used in BERT and GPT models.
Mish: Defined as x * tanh(softplus(x)), Mish is another smooth, non-monotonic activation that has shown strong results in computer vision tasks.

Comparison table

Property	Sigmoid	Tanh	ReLU	Swish	GELU
Output range	(0, 1)	(-1, 1)	[0, infinity)	Unbounded	Unbounded
Zero-centered	No	Yes	No	Approximately	Approximately
Max derivative	0.25	1.0	1.0	Varies	Varies
Vanishing gradient	Severe	Moderate	No (positive side)	Mild	Mild
Computational cost	High	High	Low	Moderate	Moderate
Common use case	Output layer, gates	LSTM gates, hidden layers	Hidden layers in deep networks	Hidden layers in deep networks	Transformer models
Dying neuron problem	No	No	Yes	No	No

Applications in machine learning

The sigmoid function is commonly used in machine learning across a variety of tasks and architectures.

Logistic regression

Logistic regression is perhaps the most classical application of the sigmoid function. In logistic regression, the model computes a linear combination of input features (z = w^T * x + b) and then passes the result through the sigmoid function to produce a probability estimate:

P(y=1|x) = σ(w^T * x + b)

This probability estimate is then used for binary classification. The decision boundary is set at σ(z) = 0.5, which corresponds to z = 0. The loss function used for training logistic regression is the binary cross-entropy loss (also called the log loss), which is derived from maximum likelihood estimation under the assumption that the output follows a Bernoulli distribution parameterized by the sigmoid output.

The binary cross-entropy loss for a single sample is:

L = -[y * log(σ(z)) + (1 - y) * log(1 - σ(z))]

where y is the true label (0 or 1) and z is the linear output. The sigmoid and log functions interact in a way that keeps the gradients well-behaved during optimization.

Output layer for binary and multi-label classification

In modern neural networks, the sigmoid function is most commonly placed at the output layer of a network designed for binary classification or multi-label classification. While ReLU or its variants are preferred for hidden layers, the sigmoid remains the standard choice for output neurons that need to produce probability estimates in the (0, 1) range.

For multi-label classification, where each sample can belong to multiple classes simultaneously, a sigmoid activation is applied independently to each output neuron. This differs from softmax, which is used for mutually exclusive multi-class classification. In multi-label settings, the sigmoid allows each class probability to be computed independently, so multiple classes can have high probabilities at the same time.

Gating mechanisms in recurrent networks

The sigmoid function plays a critical role in gating mechanisms within recurrent neural networks. In the Long Short-Term Memory (LSTM) architecture, sigmoid functions are used to control three gates:

Gate	Purpose
Forget gate	Determines how much of the previous cell state to retain. A sigmoid output near 0 means "forget this information," and near 1 means "keep this information."
Input gate	Controls how much of the new candidate information to add to the cell state.
Output gate	Determines how much of the cell state to expose as the hidden state output.

Similarly, in Gated Recurrent Units (GRU), sigmoid functions control the update gate and the reset gate. The bounded (0, 1) output of the sigmoid makes it a natural choice for these gating mechanisms, since the gates need to represent the degree to which information flows through. Notably, the vanishing gradient problem does not affect sigmoid gates in LSTMs and GRUs as severely as it does in plain feedforward networks. The additive structure of the cell state update in LSTMs creates a "gradient highway" that allows gradients to flow across many time steps without repeated multiplication through sigmoid derivatives.

Generative adversarial networks

In generative adversarial networks (GANs), the discriminator network typically uses a sigmoid activation at its output layer. The discriminator's task is to distinguish real data samples from fake samples produced by the generator. The sigmoid maps the discriminator's raw output (a logit) to a probability between 0 and 1, where values near 1 indicate that the input is likely real and values near 0 indicate that it is likely fake.

The original GAN formulation by Goodfellow et al. (2014) trains the discriminator using binary cross-entropy loss with target labels of 1 for real images and 0 for generated images. The sigmoid output directly provides the estimated probability D(x) that a given sample x is real. In practice, many implementations use BCEWithLogitsLoss (in PyTorch) or sigmoid_cross_entropy_with_logits (in TensorFlow) to combine the sigmoid and the loss computation into a single numerically stable operation.

Some later GAN variants, such as Wasserstein GANs, removed the sigmoid from the discriminator (using a linear output instead), but the original sigmoid-based formulation remains widely used and studied.

Attention mechanisms

In attention mechanisms, sigmoid activations are sometimes used to compute attention weights, particularly in scenarios where the attention is not mutually exclusive across positions. For example, in some transformer variants and in multi-head attention designs, sigmoid-based attention allows each position to attend to multiple other positions independently, rather than distributing a fixed amount of attention through softmax normalization. Recent research on "sigmoid attention" has shown promising results as an alternative to softmax-based attention in certain architectures.

Probabilistic models and Bayesian networks

The sigmoid function serves as the canonical link function in Bayesian logistic regression and in variational inference, where it maps real-valued parameters to probabilities. In variational autoencoders (VAEs), sigmoid activations are used in the decoder when the output is binary or bounded between 0 and 1 (for example, normalized pixel intensities in image generation).

Platt scaling and probability calibration

Platt scaling is a post-processing technique that uses the sigmoid function to calibrate the outputs of a classifier into well-calibrated probabilities. The method was introduced by John Platt in 1999, originally in the context of support vector machines (SVMs).

The idea is straightforward: given a classifier that produces raw scores f(x), Platt scaling fits a logistic regression model on top of those scores. The calibrated probability is computed as:

P(y=1|f(x)) = 1 / (1 + exp(A * f(x) + B))

where A and B are scalar parameters learned by maximizing the likelihood on a held-out validation set. This is the same sigmoid function with a linear transformation of the classifier's output.

Platt scaling works well when the distortion in predicted probabilities has a sigmoidal shape, which is common for max-margin classifiers like SVMs and for boosted tree models. For neural networks, a related technique called temperature scaling adjusts the softmax temperature to improve calibration. When calibration data is plentiful, non-parametric methods such as isotonic regression may outperform Platt scaling, but the sigmoid-based approach has the advantage of requiring relatively few samples to fit its two parameters.

Hard sigmoid

The hard sigmoid is a piecewise linear approximation of the standard sigmoid function, designed to be computationally cheaper to evaluate. It replaces the exponential computation with simple comparisons and a linear function:

hard_sigmoid(x) = max(0, min(1, 0.2x + 0.5))

This function clips the output to [0, 1] and uses a straight line in between. It is nearly identical to the true sigmoid near x = 0 but diverges at the tails, where it saturates abruptly instead of approaching the asymptotes gradually.

Property	Standard sigmoid	Hard sigmoid
Formula	1 / (1 + e^(-x))	max(0, min(1, 0.2x + 0.5))
Computation	Requires exponentiation	Only addition, multiplication, clipping
Speed	Slower	Significantly faster
Accuracy	Exact	Approximate
Differentiability	Smooth everywhere	Not differentiable at x = -2.5 and x = 2.5
Use case	General purpose	Mobile and edge deployment

The hard sigmoid gained attention through its use in MobileNetV3 (Howard et al., 2019), where it replaced the standard sigmoid in squeeze-and-excitation (SE) modules. By avoiding expensive exponentiation, the hard sigmoid reduces inference latency on mobile and embedded devices with limited computational resources. Keras and PyTorch both provide built-in hard sigmoid implementations.

Numerical stability considerations

When implementing the sigmoid function in practice, numerical stability must be handled carefully. Computing e^(-x) for large positive values of x results in numbers very close to zero, which is fine. However, for large negative values of x, e^(-x) can overflow the floating-point range. A common stable implementation computes the sigmoid differently depending on the sign of x:

For x >= 0: σ(x) = 1 / (1 + e^(-x))
For x < 0: σ(x) = e^x / (1 + e^x)

Similarly, when computing the log of the sigmoid (which appears in the cross-entropy loss), a numerically stable form is:

log(σ(x)) = -softplus(-x) = -log(1 + e^(-x))

where softplus(x) = log(1 + e^x). Most deep learning frameworks provide dedicated log_sigmoid functions that handle these stability concerns internally.

In PyTorch, the BCEWithLogitsLoss function combines the sigmoid and binary cross-entropy loss into a single operation that uses the log-sum-exp trick to avoid numerical overflow. TensorFlow offers an equivalent function called sigmoid_cross_entropy_with_logits. These combined functions are strongly preferred over applying the sigmoid and loss separately, because the naive approach can produce NaN values when the sigmoid output is exactly 0 or 1 due to floating-point rounding.

Implementation in frameworks

The sigmoid function is available as a built-in operation in all major deep learning frameworks and numerical libraries. The following table summarizes the most common ways to invoke it:

Framework	Function call	Notes
PyTorch	`torch.sigmoid(x)` or `torch.nn.Sigmoid()`	Element-wise; supports autograd
TensorFlow	`tf.math.sigmoid(x)` or `tf.keras.activations.sigmoid`	Also available as a Keras layer activation
NumPy	`1 / (1 + np.exp(-x))` or `scipy.special.expit(x)`	SciPy's `expit` is numerically stable
JAX	`jax.nn.sigmoid(x)`	Supports JIT compilation and autodiff
ONNX	`Sigmoid` operator	Standard operator in the ONNX spec

When using these implementations, it is generally best to let the framework handle the sigmoid computation rather than writing a custom version, because the built-in functions include numerical stability safeguards.

The sigmoid in the broader family of S-shaped functions

The logistic sigmoid described above is the most common member of a broader family of sigmoid functions. Any function with an S-shaped curve can technically be called a sigmoid. Other examples include:

Function	Formula	Range
Logistic sigmoid	1 / (1 + e^(-x))	(0, 1)
Hyperbolic tangent	(e^x - e^(-x)) / (e^x + e^(-x))	(-1, 1)
Arctangent	arctan(x)	(-π/2, π/2)
Error function (erf)	(2/sqrt(π)) * integral(e^(-t^2), 0, x)	(-1, 1)
Algebraic sigmoid	x / sqrt(1 + x^2)	(-1, 1)
Gudermannian function	2 * arctan(tanh(x/2))	(-π/2, π/2)

All of these functions share the characteristic S-shape, with bounded output, monotonicity, and horizontal asymptotes. The logistic sigmoid is the most widely used in machine learning due to its probabilistic interpretation and its convenient derivative.

Applications beyond machine learning

The sigmoid function appears in numerous scientific and engineering fields outside of machine learning.

Population dynamics

The sigmoid's origins lie in population modeling. Verhulst's logistic equation describes how a population grows rapidly at first and then slows as it approaches a carrying capacity K:

P(t) = K / (1 + e^(-r(t - t_0)))

where r is the intrinsic growth rate and t_0 is the midpoint of the growth curve. This S-shaped growth pattern has been observed in bacteria colonies, animal populations, and the adoption of new technologies.

Pharmacology and dose-response curves

In pharmacology, the sigmoid (or "Hill equation") models the relationship between the dose of a drug and its physiological effect. The dose-response curve typically follows a sigmoidal shape: at low doses the effect is negligible, it increases steeply through a mid-range, and then plateaus at higher doses as receptors become saturated. The Hill coefficient controls the steepness of the curve, and the EC50 parameter represents the dose at which 50% of the maximum effect is achieved.

Economics and diffusion of innovations

The logistic sigmoid also models the diffusion of innovations in economics and marketing. When a new product or technology enters a market, the adoption rate typically follows an S-curve: slow initial uptake by early adopters, rapid growth as the mainstream population adopts, and a tapering off as the market reaches saturation.

Audio signal processing

In audio engineering, sigmoid-like functions serve as waveshaper transfer functions used to emulate the soft clipping behavior of analog circuits, particularly vacuum tube amplifiers. This type of nonlinear distortion produces harmonically rich sounds that are preferred in music production.

Historical context

The sigmoid function has a rich history in both mathematics and computing. Its origins trace back to Pierre Francois Verhulst (1804-1849), a Belgian mathematician who introduced the logistic function in a series of three papers between 1838 and 1847 while studying population dynamics. Verhulst proposed it as an improvement over the exponential growth model, incorporating a carrying capacity that limits population growth. The resulting logistic equation produces the characteristic S-curve. He coined the term "logistique" for the curve, though the exact reason for this name remains debated by historians.

The function was later adopted by statisticians for binary regression problems. Joseph Berkson introduced the term "logit" in the 1940s as a portmanteau of "logistic unit," establishing the foundation for logistic regression. David Cox formalized logistic regression as a statistical method in his influential 1958 paper, making the sigmoid function central to applied statistics.

In the context of neural networks, the sigmoid gained prominence in the 1980s when backpropagation was popularized by David Rumelhart, Geoffrey Hinton, and Ronald Williams in their 1986 paper "Learning representations by back-propagating errors." The sigmoid's differentiability made it a natural candidate for gradient-based training. For the next two decades, the sigmoid (and its cousin tanh) dominated as the activation function of choice.

The shift away from sigmoid activations in hidden layers began around 2010-2012, when researchers demonstrated that ReLU allowed much deeper networks to be trained effectively. The landmark 2012 AlexNet paper by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton used ReLU activations throughout its hidden layers, marking a turning point in deep learning practice.

Despite this shift, the sigmoid function remains indispensable in specific contexts, particularly in output layers, gating mechanisms, and any situation requiring a probabilistic output. It is one of the few activation functions that has maintained continuous relevance from the earliest days of neural network research to the present.

Explain like I'm 5 (ELI5)

Imagine you have a long line of numbers, some positive and some negative. You want to squeeze these numbers into a smaller range, from 0 to 1, so that you can easily compare them. The sigmoid function does exactly that. It is like a magical slide that takes each number and slides it into the 0 to 1 range. Really big numbers end up close to 1, and really small numbers end up close to 0. Numbers near zero end up right around 0.5, in the middle. This is very helpful in machine learning because it helps computers understand things like probabilities and make decisions based on them.

References

Verhulst, P.F. (1838). "Notice sur la loi que la population suit dans son accroissement." *Correspondance Mathematique et Physique*, 10, 113-121.
Cox, D.R. (1958). "The regression analysis of binary sequences." *Journal of the Royal Statistical Society: Series B*, 20(2), 215-242.
Rumelhart, D.E., Hinton, G.E., and Williams, R.J. (1986). "Learning representations by back-propagating errors." *Nature*, 323(6088), 533-536.
Hochreiter, S. and Schmidhuber, J. (1997). "Long Short-Term Memory." *Neural Computation*, 9(8), 1735-1780.
Platt, J. (1999). "Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods." *Advances in Large Margin Classifiers*, 10(3), 61-74.
Nair, V. and Hinton, G.E. (2010). "Rectified Linear Units Improve Restricted Boltzmann Machines." *Proceedings of the 27th International Conference on Machine Learning (ICML)*.
Glorot, X. and Bengio, Y. (2010). "Understanding the difficulty of training deep feedforward neural networks." *Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS)*.
Krizhevsky, A., Sutskever, I., and Hinton, G.E. (2012). "ImageNet Classification with Deep Convolutional Neural Networks." *Advances in Neural Information Processing Systems*, 25.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). "Generative Adversarial Nets." *Advances in Neural Information Processing Systems*, 27.
Goodfellow, I., Bengio, Y., and Courville, A. (2016). *Deep Learning*. MIT Press. Chapter 6: Deep Feedforward Networks.
Ramachandran, P., Zoph, B., and Le, Q.V. (2017). "Searching for Activation Functions." *arXiv preprint arXiv:1710.05941*.
Howard, A., Sandler, M., Chu, G., Chen, L.C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., Le, Q.V., and Adam, H. (2019). "Searching for MobileNetV3." *Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)*.

Introduction

Definition and mathematical representation

Properties

The logit function (inverse of sigmoid)

Derivative of the sigmoid function

The vanishing gradient problem

Comparison with other activation functions

Sigmoid vs. tanh

Sigmoid vs. ReLU

Other modern activation functions

Comparison table

Applications in machine learning

Logistic regression

Output layer for binary and multi-label classification

Gating mechanisms in recurrent networks

Generative adversarial networks

Attention mechanisms

Probabilistic models and Bayesian networks

Platt scaling and probability calibration

Hard sigmoid

Numerical stability considerations

Implementation in frameworks

The sigmoid in the broader family of S-shaped functions

Applications beyond machine learning

Population dynamics

Pharmacology and dose-response curves

Economics and diffusion of innovations

Audio signal processing

Historical context

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

GELU (Gaussian Error Linear Unit)

Multi-head Latent Attention

Sparse autoencoder

ARC-AGI 2

Bias (Math) or Bias Term

Weighted Sum

Introduction

Definition and mathematical representation

Properties

The logit function (inverse of sigmoid)

Derivative of the sigmoid function

The vanishing gradient problem

Comparison with other activation functions

Sigmoid vs. tanh

Sigmoid vs. ReLU

Other modern activation functions

Comparison table

Applications in machine learning

Logistic regression

Output layer for binary and multi-label classification

Gating mechanisms in recurrent networks

Generative adversarial networks

Attention mechanisms

Probabilistic models and Bayesian networks

Platt scaling and probability calibration

Hard sigmoid

Numerical stability considerations

Implementation in frameworks

The sigmoid in the broader family of S-shaped functions

Applications beyond machine learning

Population dynamics

Pharmacology and dose-response curves

Economics and diffusion of innovations

Audio signal processing

Historical context

Explain like I'm 5 (ELI5)

References

Related Articles

GELU (Gaussian Error Linear Unit)

Multi-head Latent Attention

Sparse autoencoder

ARC-AGI 2

Bias (Math) or Bias Term

Weighted Sum