# Neuron

> Source: https://aiwiki.ai/wiki/neuron
> Updated: 2026-06-23
> Categories: Machine Learning, Neural Networks
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [Machine learning terms](/wiki/machine_learning_terms), [Neural network](/wiki/neural_network), [Perceptron](/wiki/perceptron)*

A **neuron** (also called a node or unit) is the fundamental computational element of an [artificial neural network](/wiki/neural_network): it takes one or more numeric inputs, multiplies each by a learned [weight](/wiki/weight), sums them with a [bias](/wiki/bias_math_or_bias_term) term, and passes the result through a nonlinear [activation function](/wiki/activation_function) to produce a single output. The idea dates to 1943, when Warren McCulloch and Walter Pitts published the first mathematical model of a neural unit, and it remains the building block of every modern deep learning system, from image classifiers to [large language models](/wiki/large_language_model) with hundreds of billions of [parameters](/wiki/parameter).[1] Inspired loosely by biological neurons in the nervous system, artificial neurons are organized into [layers](/wiki/layer) and connected through weighted links, forming the architecture of a [neural network](/wiki/neural_network).

## Historical Background

### What was the McCulloch-Pitts neuron (1943)?

The concept of the artificial neuron dates back to 1943, when neurophysiologist Warren McCulloch and logician Walter Pitts published their landmark paper "A Logical Calculus of the Ideas Immanent in Nervous Activity" in the *Bulletin of Mathematical Biophysics*, volume 5, pages 115-133.[1] Their model, now known as the McCulloch-Pitts (MCP) neuron, was the first mathematical model of a neural unit. It operated as a simple binary threshold device: the neuron received binary inputs, computed their sum, and fired (output 1) if the sum met or exceeded a fixed threshold, or remained silent (output 0) otherwise. Because of the "all-or-none" character of nervous activity, McCulloch and Pitts argued, neural events and the relations among them could be treated with propositional logic.[1]

McCulloch and Pitts demonstrated that networks of these binary neurons could compute any Boolean logic function (AND, OR, NOT), meaning they were theoretically capable of universal computation. John von Neumann cited their work as a significant result that influenced early computer design. However, the MCP neuron had important limitations: all weights were fixed (there was no learning rule), and a single unit could not solve problems like XOR that require nonlinear separation.[2]

### When was the perceptron invented?

Frank Rosenblatt extended the MCP model by introducing the [perceptron](/wiki/perceptron) in 1958. The key innovation was that the perceptron had adjustable weights that could be trained from data using a learning algorithm. Rosenblatt's hardware implementation, the Mark I Perceptron, was built at the Cornell Aeronautical Laboratory under a U.S. Navy contract; it used 400 photocells arranged in a 20x20 grid as its "eye" and stored learned weights in motor-driven potentiometers.[3] At a 1958 Navy press conference, Rosenblatt described the perceptron as "the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence."[3]

Rosenblatt proved that the perceptron learning rule would converge to a correct solution for any linearly separable problem. Despite this advance, Marvin Minsky and Seymour Papert showed in their 1969 book *Perceptrons* that single-layer perceptrons could not learn non-linearly separable functions like XOR, which contributed to the first "AI winter."[4]

### How did modern neurons and backpropagation change the picture (1980s onward)?

In the 1980s, research on neural networks regained momentum with the development of the [backpropagation](/wiki/backpropagation) algorithm for training multi-layer networks. This required neurons with smooth, differentiable [activation functions](/wiki/activation_function) (such as the sigmoid and tanh) rather than hard threshold steps. The transition to continuous activation functions enabled gradient-based optimization and opened the door to deep learning as it exists today.[5]

## Mathematical Model

An artificial neuron performs a straightforward computation that can be broken down into two stages: a linear transformation followed by a nonlinear activation.

### The Neuron Equation

Given a neuron with *n* inputs, the output *y* is computed as:

**y = f(w₁x₁ + w₂x₂ + ... + wₙxₙ + b)**

Or in more compact notation:

**y = f(w · x + b)**

where:

| Symbol | Meaning |
|---|---|
| x₁, x₂, ..., xₙ | Input values from the data or from previous [layer](/wiki/layer) neurons |
| w₁, w₂, ..., wₙ | [Weights](/wiki/weight) that scale each input, learned during training |
| b | [Bias](/wiki/bias_math_or_bias_term) term, an additive offset that shifts the activation threshold |
| w · x | Dot product (weighted sum) of inputs and weights |
| f(·) | [Activation function](/wiki/activation_function) that introduces nonlinearity |
| y | Output of the neuron, passed to subsequent layers or used as final output |

### Step-by-Step Computation

1. **Weighted sum**: Each input *xᵢ* is multiplied by its corresponding [weight](/wiki/weight) *wᵢ*. These products are then summed together: z = Σ(wᵢ · xᵢ).
2. **Add bias**: The [bias](/wiki/bias_math_or_bias_term) term *b* is added to the weighted sum: z = z + b. The bias allows the neuron to shift its activation threshold independently of the input values.
3. **Apply activation function**: The result *z* is passed through a nonlinear [activation function](/wiki/activation_function) *f* to produce the final output: y = f(z).

The weights and bias are the learnable [parameters](/wiki/parameter) of the neuron. During [training](/wiki/training), they are adjusted iteratively using optimization algorithms (typically some variant of [gradient descent](/wiki/stochastic_gradient_descent_sgd)) to minimize a [loss function](/wiki/loss_function).

## Components of a Neuron

### Inputs

Inputs are the values fed into the neuron. In the first layer of a network, these come directly from the data (raw [features](/wiki/feature) such as pixel intensities or word [embeddings](/wiki/embeddings)). In deeper layers, the inputs are the outputs of neurons from the preceding [layer](/wiki/layer).

### Weights

[Weights](/wiki/weight) are scalar values that determine the strength and direction of the connection between an input and the neuron. A large positive weight means the corresponding input has a strong excitatory influence on the neuron's output. A large negative weight indicates an inhibitory influence. Weights are initialized (often randomly) before training begins and are updated through [backpropagation](/wiki/backpropagation).

### Bias

The [bias](/wiki/bias_math_or_bias_term) is an additional learnable parameter that allows the neuron to activate even when all inputs are zero. Geometrically, the bias shifts the decision boundary of the neuron away from the origin. Without a bias term, the neuron's decision hyperplane would always pass through the origin, limiting its flexibility.

### Activation Function

The [activation function](/wiki/activation_function) introduces nonlinearity into the neuron's computation. Without it, any composition of neurons would still compute only a linear function, regardless of the number of layers. Common activation functions include:

| Activation Function | Formula | Key Properties |
|---|---|---|
| Sigmoid | f(z) = 1 / (1 + e⁻ᶻ) | Outputs in (0, 1); historically popular; suffers from vanishing gradients |
| Tanh | f(z) = (eᶻ - e⁻ᶻ) / (eᶻ + e⁻ᶻ) | Outputs in (-1, 1); zero-centered; also suffers from vanishing gradients |
| [ReLU](/wiki/rectified_linear_unit_relu) | f(z) = max(0, z) | Simple and efficient; default choice in most modern networks; can cause dead neurons |
| Leaky ReLU | f(z) = max(αz, z), α small | Prevents dead neurons by allowing a small gradient for negative inputs |
| Swish | f(z) = z · sigmoid(z) | Smooth, non-monotonic; often outperforms ReLU in deep networks |
| [Softmax](/wiki/softmax) | f(zᵢ) = eᶻⁱ / Σeᶻʲ | Used in output layer for multi-class classification; outputs sum to 1 |

## Single Neuron as a Linear Classifier

A single neuron with a step or sigmoid activation function can serve as a binary [linear classifier](/wiki/logistic_regression). Its weighted sum *w · x + b = 0* defines a hyperplane (a line in 2D, a plane in 3D) that separates the input space into two regions. Points on one side of the boundary are classified into one class, and points on the other side into the second class.

The [bias](/wiki/bias_math_or_bias_term) shifts the position of this boundary, while the [weights](/wiki/weight) determine its orientation. This is precisely what the original [perceptron](/wiki/perceptron) does. However, a single neuron can only learn linearly separable patterns. Problems like XOR, where no straight line can separate the two classes, require multiple neurons organized across multiple layers.

## How does an artificial neuron differ from a biological neuron?

Although artificial neurons draw inspiration from biology, the two are fundamentally different in complexity and mechanism. The human brain contains roughly 86 billion neurons, each forming on the order of 7,000 synaptic connections on average, yet runs on about 20 watts of power, a level of efficiency that today's GPU-based networks cannot approach.[6]

| Aspect | Biological Neuron | Artificial Neuron |
|---|---|---|
| Structure | Cell body (soma), dendrites, axon, synapses | Mathematical function with weights, bias, activation |
| Signal type | Electrochemical (action potentials, neurotransmitters) | Numerical values (floating-point numbers) |
| Communication | Spikes transmitted along axons; analog/digital hybrid | Continuous values passed through weighted connections |
| Learning | Synaptic plasticity (Hebbian learning, long-term potentiation) | Weight updates via [backpropagation](/wiki/backpropagation) and gradient descent |
| Connectivity | ~7,000 synaptic connections per neuron on average | Typically tens to thousands of input connections per neuron |
| Speed | Slow firing (~1-100 Hz) but massively parallel | Fast computation but more sequential in practice |
| Energy | Extremely efficient (~20 watts for the entire brain) | Energy-intensive (GPUs consume hundreds of watts) |
| Scale | ~86 billion neurons in the human brain | Large models have billions of parameters, but far fewer distinct neuron units |
| Self-repair | Neurons can form new connections and partially compensate for damage | No intrinsic self-repair; requires retraining or manual intervention |

It is worth noting that artificial neurons are simplified "caricature models" of their biological counterparts. Real neurons exhibit complex behaviors such as dendritic computation, diverse neurotransmitter dynamics, and temporal coding that artificial neurons do not capture.[7]

## Types of Neurons by Role

Neurons are classified by their position and function within the network:

| Neuron Type | Location | Role |
|---|---|---|
| Input neuron | Input [layer](/wiki/layer) | Receives raw data features and passes them into the network |
| Hidden neuron | [Hidden layer](/wiki/hidden_layer)(s) | Processes intermediate representations; learns internal features |
| Output neuron | Output [layer](/wiki/layer) | Produces the network's final prediction (class probabilities, regression values) |

## Neurons in Different Layer Types

The behavior of a neuron depends heavily on the type of [layer](/wiki/layer) it belongs to:

### Dense (Fully Connected) Layers

In a dense layer, every neuron is connected to every neuron in the previous layer. Each neuron computes the full weighted sum over all its inputs. Dense layers are the most general type and are commonly used in the final classification stages of a network.

### Convolutional Layers

In a [convolutional neural network](/wiki/convolutional_neural_network) (CNN), neurons are arranged in three dimensions (height, width, depth) and each neuron is connected to only a small local region (receptive field) of the previous layer. All neurons in a feature map share the same weights (a filter or kernel), which dramatically reduces the number of parameters and allows the network to detect spatial patterns like edges, textures, and objects regardless of their position in the input.

### Recurrent Layers

In a [recurrent neural network](/wiki/recurrent_neural_network) (RNN), neurons have connections that loop back on themselves, giving the network a form of memory. At each time step, a recurrent neuron receives both a new input and its own previous output (or hidden state), allowing it to process sequential data such as text, speech, or time series. Variants like [LSTM](/wiki/long_short-term_memory_lstm) and GRU add gating mechanisms that control how information flows through recurrent neurons over time.

### Attention Layers

In [transformer](/wiki/transformer) architectures, the [attention](/wiki/attention) mechanism computes dynamic, data-dependent weighted sums rather than using fixed connections. While the underlying linear projections still use standard neuron-like units, the attention mechanism allows each position in a sequence to attend to all other positions, breaking free from the fixed connectivity patterns of dense and convolutional layers.

## Number of Neurons and Model Capacity

The number of neurons in a [neural network](/wiki/neural_network) is directly tied to the model's capacity, which describes the range of functions the model can learn.

- **Width** (number of neurons per layer): Increasing width allows a layer to represent more features simultaneously. In theory, a single [hidden layer](/wiki/hidden_layer) with enough neurons can approximate any continuous function (the Universal Approximation Theorem). In practice, very wide single-layer networks are difficult to train effectively.
- **Depth** (number of layers): Increasing depth allows the network to learn hierarchical, compositional features. Research has shown that the number of distinct linear regions a network can represent grows exponentially with depth but only polynomially with width, making depth a more efficient way to increase model complexity.[8]
- **Overfitting vs. underfitting**: A model with too few neurons may [underfit](/wiki/underfitting) (fail to capture the patterns in the data). A model with too many neurons may [overfit](/wiki/overfitting) (memorize training data rather than learning general patterns). Techniques like [dropout](/wiki/dropout_regularization), [regularization](/wiki/regularization), and [early stopping](/wiki/early_stopping) help manage this tradeoff.

Modern large language models contain billions of [parameters](/wiki/parameter) distributed across millions of neuron-like units organized in deep architectures, giving them the capacity to model extremely complex language patterns.

## Dead Neurons and the Dying ReLU Problem

A **dead neuron** is one that always outputs zero regardless of its input, effectively contributing nothing to the network's computation. This phenomenon is most commonly associated with the [ReLU](/wiki/rectified_linear_unit_relu) activation function.

### How It Happens

ReLU outputs zero for any negative input: f(z) = max(0, z). If, during training, the [weights](/wiki/weight) and [bias](/wiki/bias_math_or_bias_term) of a neuron shift such that its pre-activation value *z* becomes negative for all training examples, the neuron will consistently output zero. Because the gradient of ReLU is also zero for negative inputs, the neuron receives no gradient signal during [backpropagation](/wiki/backpropagation), meaning its weights can never be updated. The neuron is effectively "dead."[9]

### Causes

- **Large learning rates** that cause weights to overshoot into regions where the neuron's input is always negative.
- **Poor weight initialization** that places neurons in permanently inactive regions from the start.
- **Unfortunate data distributions** that cause certain neurons to receive predominantly negative inputs.

### Solutions

| Solution | How It Works |
|---|---|
| Leaky ReLU | Allows a small non-zero gradient (e.g., 0.01z) for negative inputs, preventing complete death |
| Parametric ReLU (PReLU) | Learns the slope for negative inputs as a trainable parameter |
| ELU | Uses an exponential curve for negative inputs, providing non-zero outputs and gradients |
| Swish / GELU | Smooth activation functions that avoid the sharp cutoff at zero |
| Lower learning rate | Reduces the chance of weights overshooting into dead zones |
| Careful initialization | Methods like He initialization set initial weights to appropriate scales for ReLU networks |

If more than 40% of neurons in a layer become dead, it can severely impair the network's ability to learn.

## What do individual neurons learn? (Interpretability)

Understanding what individual neurons learn is a central question in [neural network interpretability](/wiki/interpretability).

### Feature Visualization

Researchers use optimization techniques to generate synthetic inputs that maximally activate a specific neuron, revealing what "feature" or pattern the neuron has learned to detect. In image networks like Inception and [ResNet](/wiki/resnet), these visualizations show that early-layer neurons respond to simple patterns (edges, colors), middle-layer neurons detect textures and parts (eyes, wheels), and late-layer neurons respond to complex objects or scenes.[10]

### Dataset Examples

A complementary approach involves feeding a large dataset through the network and identifying which real inputs produce the highest activation for a given neuron. This provides intuitive evidence of what each neuron "looks for."

### What is a polysemantic neuron?

Many neurons turn out to be **polysemantic**, meaning they respond to multiple unrelated concepts. A well-documented example is neuron 4e:55 in the InceptionV1 vision model, which responds to cat faces, cat legs, and the fronts of cars: feature visualization shows it is not detecting some subtle shared feature but genuinely responding to all three.[10] In language models, a single neuron might respond to academic citations, dialogue, HTTP requests, and Korean text simultaneously. This makes interpreting individual neurons difficult.

### What is the superposition hypothesis?

The **superposition hypothesis**, articulated by researchers at Anthropic in their 2022 paper "Toy Models of Superposition," offers an explanation for polysemantic neurons. The paper investigates "how and when models represent more features than they have dimensions," a phenomenon the authors call superposition.[11] The core idea is that neural networks need to represent more features than they have neurons, so they exploit the geometry of high-dimensional spaces to encode multiple features as overlapping directions in activation space. Each neuron participates in encoding many features simultaneously, and each feature is distributed across many neurons. The paper shows that superposition lets a model store extra features at the cost of "interference" that requires nonlinear filtering, and that features organize into geometric structures such as digons, triangles, and pentagons.[11]

This insight has driven the development of techniques like **sparse autoencoders**, which decompose polysemantic neuron activations into more interpretable "monosemantic" features. Anthropic's 2023 follow-up, "Towards Monosemanticity," used a sparse autoencoder to extract thousands of interpretable features from a one-layer transformer.[12] Understanding superposition is considered one of the most important open problems in AI interpretability and safety research.

## Explain Like I'm 5 (ELI5)

Imagine a neuron in a neural network is like a tiny worker in a huge factory. The worker gets messages from many other workers. Some messages are important (they have a big "weight"), and some are less important (small "weight"). The worker adds up all the messages, paying more attention to the important ones. Then the worker checks: "Is this total big enough for me to care?" If it is, the worker gets excited and passes along a signal to the next group of workers. If not, the worker stays quiet.

When you stack thousands of these workers in rows (layers), the first row notices simple things (like edges in a picture), the next row combines those simple things into medium things (like eyes or wheels), and the deeper rows recognize whole objects (like a cat or a car). The factory learns by adjusting how much each worker pays attention to each message, trying to get the right answer more often.

## References

1. McCulloch, W. S., & Pitts, W. (1943). "A Logical Calculus of the Ideas Immanent in Nervous Activity." *Bulletin of Mathematical Biophysics*, 5(4), 115-133. https://link.springer.com/article/10.1007/BF02478259
2. "Artificial neuron." *Wikipedia*. https://en.wikipedia.org/wiki/Artificial_neuron
3. "Perceptron." *Wikipedia*. https://en.wikipedia.org/wiki/Perceptron
4. Minsky, M., & Papert, S. (1969). *Perceptrons: An Introduction to Computational Geometry*. MIT Press.
5. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). "Learning representations by back-propagating errors." *Nature*, 323(6088), 533-536.
6. "Billions of neurons, trillions of synapses." *UCLA Brain Research Institute*. https://bri.ucla.edu/brain-fact/billions-of-neurons-trillions-of-synapses/
7. "Biological Neurons vs Artificial Neural Networks." *Sophos Blog*. https://www.sophos.com/en-us/blog/man-vs-machine-comparing-artificial-and-biological-neural-networks
8. Montufar, G. F., Pascanu, R., Cho, K., & Bengio, Y. (2014). "On the Number of Linear Regions of Deep Neural Networks." *Advances in Neural Information Processing Systems (NeurIPS)*.
9. Lu, L., Shin, Y., Su, Y., & Karniadakis, G. E. (2019). "Dying ReLU and Initialization: Theory and Numerical Examples." *arXiv:1903.06733*.
10. Olah, C., Mordvintsev, A., & Schubert, L. (2017). "Feature Visualization." *Distill*. https://distill.pub/2017/feature-visualization/
11. Elhage, N., et al. (2022). "Toy Models of Superposition." *Transformer Circuits Thread*. https://transformer-circuits.pub/2022/toy_model/index.html
12. Bricken, T., et al. (2023). "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning." *Transformer Circuits Thread*. https://transformer-circuits.pub/2023/monosemantic-features

