# Node (neural network)

> Source: https://aiwiki.ai/wiki/node_neural_network
> Updated: 2026-07-07
> Categories: Model Architecture, Neural Networks
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

A **node** in a [neural network](/wiki/neural_network) is the basic computational element, an artificial neuron, that receives one or more inputs, multiplies each by a learned weight, sums them, adds a bias, and passes the result through a nonlinear [activation function](/wiki/activation_function) to produce a single output. The same object goes by several other names: [neuron](/wiki/neuron), unit, processing element, and (historically) [perceptron](/wiki/perceptron). All of these refer to the same simple compute primitive that, when stacked into layers and wired together by weighted connections, yields the [deep learning](/wiki/deep_learning) systems that power modern artificial intelligence. The primitive itself has barely changed since 1943; what changed is scale, from the single threshold unit of the McCulloch-Pitts model to the tens of thousands of nodes in each layer of a modern [large language model](/wiki/large_language_model).

The word "node" can be ambiguous. In the context of feedforward layers, it almost always means a single unit. In a [graph neural network](/wiki/graph_neural_network) or a TensorFlow computation graph it means something quite different. The disambiguation section below covers those cases.

## How does a node compute its output?

A single node receives a vector of inputs `x = (x_1, x_2, ..., x_n)` along with a corresponding vector of learned [weights](/wiki/weights) `w = (w_1, w_2, ..., w_n)` and a scalar [bias](/wiki/bias_math_or_bias_term) term `b`. The node's output `y` is given by:

```
y = sigma(sum_i w_i * x_i + b)
```

where `sigma` is the activation function. The inner sum (the pre-activation) is sometimes called the logit or the net input. Geometrically, the pre-activation defines a hyperplane in input space; the activation function then bends that linear decision surface into something the layer above can use.

A layer of `m` nodes connected to `n` inputs is conventionally written in matrix form as `y = sigma(W x + b)`, where `W` is an `m x n` weight matrix and `b` is a length-`m` bias vector. This is exactly what `nn.Linear(n, m)` builds in [PyTorch](/wiki/pytorch) and what `tf.keras.layers.Dense(m)` builds in [TensorFlow](/wiki/tensorflow). The argument `m` is the number of output nodes, also known as the layer's width.

## When was the artificial neuron invented?

The modern node has a long lineage. The table below lists the main milestones.

| Year | Model | Contribution |
| --- | --- | --- |
| 1943 | McCulloch-Pitts neuron | First mathematical model of a neuron. Inputs are binary, the unit fires if the weighted sum crosses a threshold, and a single inhibitory input can veto the output. McCulloch and Pitts showed that networks of these units could compute any logical function [1]. |
| 1958 | Rosenblatt's perceptron | First learnable single-layer node. Introduced an iterative weight-update rule that converges on linearly separable problems. Published in *Psychological Review* under the title "The perceptron: a probabilistic model for information storage and organization in the brain" [2]. |
| 1969 | Minsky and Papert's *Perceptrons* | Proved that a single-layer perceptron cannot learn XOR. Funding for neural network research collapsed for over a decade [3]. |
| 1982 | [Hopfield network](/wiki/hopfield_network) | Recurrent net of binary nodes with symmetric weights, demonstrating that simple equivalent units can produce content-addressable memory through emergent collective behavior [4]. |
| 1986 | [Backpropagation](/wiki/backpropagation) rediscovered | Rumelhart, Hinton, and Williams popularized training multi-layer networks of differentiable nodes by [gradient descent](/wiki/gradient_descent). This unlocked the hidden layer [5]. |
| 1989-1991 | Universal approximation | Cybenko (1989), Hornik, Stinchcombe and White (1989), and Hornik (1991) proved that a single hidden layer of enough nodes with a non-polynomial activation can approximate any continuous function on a compact set [6][7][8]. |
| 2012 | [AlexNet](/wiki/alexnet) | Demonstrated that wide networks of [ReLU](/wiki/rectified_linear_unit_relu) nodes trained on GPUs could win [ImageNet](/wiki/imagenet) by a wide margin, kicking off the deep learning era [9]. |
| 2017 | Transformer | The attention block reframed where the action lives: nodes sit inside the position-wise feedforward sublayers, while [attention heads](/wiki/attention_head) handle mixing across positions [12]. |

The McCulloch-Pitts paper ("A Logical Calculus of the Ideas Immanent in Nervous Activity", *Bulletin of Mathematical Biophysics*, 1943) is usually cited as the origin of the artificial neuron [1]. Its opening line states the core idea: "Because of the 'all-or-none' character of nervous activity, neural events and the relations among them can be treated by means of propositional logic" [1]. From that premise the authors proved that a network of their threshold units, given enough units, can compute any expression of propositional logic.

Rosenblatt's perceptron added the missing ingredient, learning, and was demonstrated on the Mark I Perceptron, a custom analog machine built at Cornell Aeronautical Laboratory whose input "retina" was a 20x20 grid of 400 photocells [2][22]. At a 1958 Navy press conference, Rosenblatt's claims for the device were extravagant: The New York Times reported it as "the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence" [22]. Minsky and Papert's 1969 proof that a single node cannot compute XOR helped end this first wave of enthusiasm [3].

## What are the other names for a node?

Different communities and time periods favor different words for the same object. The choice of word is largely cosmetic.

| Term | Where you see it |
| --- | --- |
| Neuron | Most common in deep learning textbooks and biology-flavored writing. |
| Unit | Common in older connectionist literature and in the [hidden layer](/wiki/hidden_layer) / output layer naming conventions. |
| Node | Common when describing network topology, especially in diagrams. |
| Perceptron | Historical. Today usually refers to a single-node binary classifier or to a [multilayer perceptron](/wiki/perceptron) (MLP). |
| Processing element (PE) | Older systems engineering literature, especially around analog implementations. |
| Cell | Used for nodes with internal state, such as the LSTM cell. |

In this article, "node" and "neuron" are used interchangeably.

## What roles do nodes play in a neural network?

Nodes are organized into layers. Their role depends on the layer they sit in.

| Role | Description |
| --- | --- |
| Input node | Holds a single component of the input vector. Has no weights and no activation; it just delivers a feature value to the next layer. |
| Hidden node | Sits in a [hidden layer](/wiki/hidden_layer) and learns to combine inputs from the layer below into intermediate representations. The expressive power of a network comes from its hidden nodes. |
| Output node | Produces the final prediction. The activation differs by task: linear for regression, sigmoid for binary classification, [softmax](/wiki/softmax) for multiclass. |
| Bias node | A constant input (almost always equal to 1) connected through its own learned weight. Lets the activation surface shift away from the origin. Most modern frameworks fold the bias into the linear layer rather than drawing it as a separate node. |
| Convolutional filter | A node-like unit in a [convolutional neural network](/wiki/convolutional_neural_network) whose weights are shared across all spatial positions. One filter produces one channel of the output feature map. |
| LSTM cell | A composite node with internal memory, used in a [long short-term memory](/wiki/long_short-term_memory_lstm) network. Each cell contains its own input, [forget](/wiki/forget_gate), and output gates, each of which is itself made of standard sigmoid nodes. |
| Attention head | The [transformer](/wiki/transformer) analog. Not a single scalar node, but a block that attends across positions and produces a vector. Heads are sometimes called "node-like" components when describing model width. |

## What activation functions do nodes use?

Without a nonlinear activation, a stack of nodes collapses into a single linear map and the network loses all expressive power. Different activations have dominated different eras.

| Activation | Formula | Range | Notes |
| --- | --- | --- | --- |
| Step / threshold | `1 if z > 0 else 0` | {0, 1} | The original McCulloch-Pitts and perceptron activation. Not differentiable, so unusable with gradient descent. |
| [Sigmoid](/wiki/sigmoid_function) | `1 / (1 + exp(-z))` | (0, 1) | Smooth and bounded. Standard in shallow nets through the 1990s. Saturates and causes the [vanishing gradient problem](/wiki/vanishing_gradient_problem) in deep nets. |
| Tanh | `(exp(z) - exp(-z)) / (exp(z) + exp(-z))` | (-1, 1) | Zero-centered version of sigmoid. Preferred in older hidden layers. |
| [ReLU](/wiki/relu) | `max(0, z)` | [0, inf) | Cheap, doesn't saturate for positive inputs, made deep networks practical from 2012 onward. The default for hidden layers in CNNs and MLPs. |
| Leaky ReLU | `max(alpha * z, z)`, small `alpha` | (-inf, inf) | Fixes the "dying ReLU" problem by keeping a small slope for negative inputs. |
| PReLU | Same as Leaky ReLU but `alpha` is learned | (-inf, inf) | Introduced by He et al. (2015) for ImageNet [26]. |
| GELU | `z * Phi(z)` where `Phi` is the standard Gaussian CDF | (-inf, inf) | Smooth ReLU variant that weights inputs by their value rather than gating by sign. Used in [BERT](/wiki/bert), GPT-2, [GPT-3](/wiki/gpt-3), and most modern transformers [23]. |
| Swish / SiLU | `z * sigmoid(z)` | (-inf, inf) | Found by neural architecture search (Swish; Ramachandran et al. 2017); the same `z * sigmoid(z)` form was introduced independently as SiLU (Elfwing et al. 2018). Used in [EfficientNet](/wiki/efficientnet) and several LLMs [24][25]. |
| Softmax | `exp(z_i) / sum_j exp(z_j)` | (0, 1) per output, sums to 1 | Output activation for multiclass classification. Operates on a whole vector of nodes, not on each one independently. |

The choice of activation is one of the few hyperparameters that affects training dynamics more than final accuracy. The shift from sigmoid to ReLU around 2010-2012 is widely credited as one of the practical breakthroughs that enabled deep learning [9]. Later refinements are incremental by comparison: replacing every ReLU with Swish raised top-1 ImageNet accuracy by 0.9% on Mobile NASNet-A and 0.6% on Inception-ResNet-v2, the margins that motivated its adoption [24].

## What are layer width and depth?

The number of nodes in a layer is called the layer's width. The number of layers is the network's depth. Both matter, and they are not interchangeable.

Common widths in modern architectures cluster around powers of two: 64, 128, 256, 512, 1024, 2048, 4096. Hidden layers in BERT base have width 768. The hidden state in GPT-3 has width 12,288, with feedforward sublayers expanded to 4 times that, or 49,152, before being projected back down [16]. Width determines how many features can be represented in parallel at each layer.

The **[universal approximation theorem](/wiki/universal_approximation_theorem)** (Cybenko, 1989; Hornik, 1991) says that a feedforward network with one hidden layer of finite but possibly very large width and a non-polynomial activation can approximate any continuous function on a compact set to arbitrary accuracy [6][8]. The theorem is an existence proof. It does not say the required width is small or that the weights are easy to find. In practice, depth turns out to be far more parameter-efficient than width: deep narrow networks routinely outperform shallow wide ones with the same total parameter count.

Lu et al. (2017), in "The Expressive Power of Neural Networks: A View from the Width", established a complementary result for ReLU networks. Width-(n+4) ReLU networks, where n is the input dimension, are universal approximators, and there exist functions representable by wide shallow networks that any narrow network of polynomial depth cannot match. Their broader conclusion is that depth is more effective than width, but width is not negligible [13].

Practical considerations:

- Too few nodes per layer underfits. The model lacks the capacity to represent the target function.
- Too many nodes overfits, wastes compute, and slows training. Regularization techniques such as [dropout](/wiki/dropout) (Srivastava et al., 2014) deliberately disable random subsets of nodes during training to reduce co-adaptation [10].

## How are nodes connected? (network topology)

How nodes connect to each other defines the network topology. The dominant patterns are:

- Fully connected (also called [dense](/wiki/dense_layer) or [fully connected](/wiki/fully_connected_layer)). Every node in layer L is connected to every node in layer L+1. The cost is O(n*m) parameters per layer pair.
- Sparse. Only a subset of possible connections exists. Used in efficient architectures and in [mixture-of-experts](/wiki/mixture_of_experts) layers.
- Convolutional. Nodes in the next layer connect only to a small spatial neighborhood of the previous layer, with weights shared across positions. Drastically reduces parameter count and adds translation invariance.
- Recurrent. Nodes have connections back to earlier time steps, giving the network memory of past inputs.
- Skip connections (residual). Nodes in layer L+k receive a shortcut copy of activations from layer L. [ResNet](/wiki/resnet) (He et al., 2016) showed this lets networks scale to hundreds of layers without losing trainability [11].

## How many nodes do modern models have?

Modern foundation models contain enormous numbers of nodes.

| Model | Hidden width | Total parameters | Source |
| --- | --- | --- | --- |
| MNIST classifier (1998) | 100-300 | ~200K | LeCun et al. |
| AlexNet (2012) | 4096 in FC layers | 60M | Krizhevsky et al. |
| BERT base (2018) | 768 | 110M | Devlin et al. |
| BERT large (2018) | 1024 | 340M | Devlin et al. |
| GPT-3 (2020) | 12,288 | 175B | Brown et al. |
| GPT-3 FFN inner | 49,152 | (4x hidden) | Brown et al. |

The number of nodes inside a transformer FFN block is typically four times the model's hidden dimension. Counting attention heads as analog "nodes" gives even larger numbers: GPT-3 has 96 heads per layer over 96 layers, or 9,216 attention heads in total.

## Does one node represent one concept?

A long-standing puzzle is whether a single node represents a single human-meaningful concept. The answer turned out to be no, in general. Most nodes in trained language models are **[polysemantic](/wiki/polysemanticity)**: a given neuron will activate for several unrelated concepts, such as quotes from US presidents and the Hebrew alphabet. This is a consequence of **[superposition](/wiki/superposition)**, the idea that networks pack more features than they have neurons by storing them in nearly orthogonal directions in activation space.

Elhage et al. (2022), in "Toy Models of Superposition" at Anthropic, formalized this. As the paper puts it, "Neural networks often pack many unrelated concepts into a single neuron," a puzzle that arises from "models storing additional sparse features in 'superposition'" [17]. Superposition is thus a strategy the network learns in order to use its available width efficiently. Bricken et al. (2023), in "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning", trained a [sparse autoencoder](/wiki/sparse_autoencoder) on the activations of a one-layer transformer and decomposed a 512-node layer into more than 4,000 separate features, each of which is much more interpretable than the underlying neurons [18]. Their follow-up, "Scaling Monosemanticity" (2024), extended the technique to [Claude 3](/wiki/claude_3) Sonnet, recovering features ranging from concrete (the Golden Gate Bridge) to abstract (deception, sycophancy) [19].

This line of work has practical implications. Pruning and editing whole nodes is much less precise than editing the underlying features, which is one reason [mechanistic interpretability](/wiki/mechanistic_interpretability) has become a major research direction.

## Can nodes be removed? (pruning)

Not every node in a trained network earns its keep. **Structured pruning** removes entire nodes (or filters or attention heads) along with their weights, producing a smaller dense network that runs faster on standard hardware. Magnitude-based pruning, lottery-ticket pruning (Frankle and Carbin, 2019) [14], and head pruning (Michel et al., 2019) [15] are well-studied variants. Empirically, transformer models often retain most of their accuracy after 30-50% of attention heads are removed, which suggests substantial redundancy at the head level; the title of Michel et al.'s study, "Are Sixteen Heads Really Better than One?", captures the finding [15].

## What else does "node" mean in machine learning?

The word "node" appears elsewhere in machine learning with different meanings, which causes regular confusion.

| Context | What "node" means |
| --- | --- |
| Feedforward layer | A single unit with weights, bias, and activation. The meaning used throughout this article. |
| TensorFlow computation graph | An [operation](/wiki/operation) (op) in the [computational graph](/wiki/computational_graph), such as `MatMul` or `Relu`. Edges carry [tensors](/wiki/tensor) between op nodes. |
| PyTorch autograd graph | A `Function` object that produced a tensor. Used during the backward pass. |
| Graph neural network (GNN) | A vertex in the input graph (a person in a social network, an atom in a molecule). The node feature vector is updated by aggregating messages from neighboring nodes. Not the same as a hidden unit. |
| Decision tree | A branching point in the [decision tree](/wiki/decision_tree), also called a split node. |
| Distributed training | A physical machine in a cluster (e.g., a TPU node or a compute node in a Kubernetes cluster). |

When reading a paper or codebase, it is worth checking which of these meanings is intended.

## How do you create nodes in PyTorch and TensorFlow?

In PyTorch, the line

```python
layer = torch.nn.Linear(in_features=784, out_features=512)
```

creates a layer with 784 input nodes and 512 output nodes, allocating a 512x784 weight matrix and a length-512 bias vector [20]. Stacking another `nn.Linear(512, 10)` on top makes the 512 outputs become the 512 inputs of the next layer, and yields 10 output nodes.

In Keras / TensorFlow:

```python
layer = tf.keras.layers.Dense(units=512, activation='relu')
```

here `units=512` is the number of output nodes; the input width is inferred from the previous layer [21].

A convolutional layer specifies the number of output filters, which plays the same role as `units` for fully connected layers. Each filter corresponds to one output channel and behaves like a node with shared spatial weights.

## Summary

A node in a neural network is the workhorse of the whole system: a small linear function followed by a nonlinearity. Everything from the McCulloch-Pitts threshold unit of 1943 to the 49,152-wide feedforward sublayers of GPT-3 is built from this primitive. The interesting questions today are no longer how a single node works, but how millions of them combine: which features they store, how those features superpose, and how to read them back out. The basic compute element is unchanged, but its scale and the techniques used to interpret it have transformed beyond anything Rosenblatt would have recognized.

## References

1. McCulloch, W. S., & Pitts, W. (1943). "A Logical Calculus of the Ideas Immanent in Nervous Activity." *Bulletin of Mathematical Biophysics*, 5(4), 115-133. https://link.springer.com/article/10.1007/BF02478259
2. Rosenblatt, F. (1958). "The perceptron: a probabilistic model for information storage and organization in the brain." *Psychological Review*, 65(6), 386-408. https://psycnet.apa.org/record/1959-09865-001
3. Minsky, M., & Papert, S. (1969). *Perceptrons: An Introduction to Computational Geometry*. MIT Press.
4. Hopfield, J. J. (1982). "Neural networks and physical systems with emergent collective computational abilities." *Proceedings of the National Academy of Sciences*, 79(8), 2554-2558. https://www.pnas.org/doi/10.1073/pnas.79.8.2554
5. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). "Learning representations by back-propagating errors." *Nature*, 323(6088), 533-536.
6. Cybenko, G. (1989). "Approximation by superpositions of a sigmoidal function." *Mathematics of Control, Signals, and Systems*, 2(4), 303-314. https://link.springer.com/article/10.1007/BF02551274
7. Hornik, K., Stinchcombe, M., & White, H. (1989). "Multilayer feedforward networks are universal approximators." *Neural Networks*, 2(5), 359-366.
8. Hornik, K. (1991). "Approximation capabilities of multilayer feedforward networks." *Neural Networks*, 4(2), 251-257.
9. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). "ImageNet Classification with Deep Convolutional Neural Networks." *NeurIPS*.
10. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). "Dropout: A Simple Way to Prevent Neural Networks from Overfitting." *Journal of Machine Learning Research*, 15, 1929-1958.
11. He, K., Zhang, X., Ren, S., & Sun, J. (2016). "Deep Residual Learning for Image Recognition." *CVPR*.
12. Vaswani, A. et al. (2017). "Attention Is All You Need." *NeurIPS*.
13. Lu, Z., Pu, H., Wang, F., Hu, Z., & Wang, L. (2017). "The Expressive Power of Neural Networks: A View from the Width." *NeurIPS*. https://arxiv.org/abs/1709.02540
14. Frankle, J., & Carbin, M. (2019). "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks." *ICLR*.
15. Michel, P., Levy, O., & Neubig, G. (2019). "Are Sixteen Heads Really Better than One?" *NeurIPS*.
16. Brown, T. B. et al. (2020). "Language Models are Few-Shot Learners." *NeurIPS*. (GPT-3 paper, hidden dim 12,288.)
17. Elhage, N. et al. (2022). "Toy Models of Superposition." Anthropic. https://transformer-circuits.pub/2022/toy_model/
18. Bricken, T. et al. (2023). "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning." Anthropic. https://transformer-circuits.pub/2023/monosemantic-features
19. Templeton, A. et al. (2024). "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet." Anthropic. https://transformer-circuits.pub/2024/scaling-monosemanticity/
20. PyTorch documentation, `torch.nn.Linear`. https://docs.pytorch.org/docs/stable/generated/torch.nn.Linear.html
21. TensorFlow documentation, `tf.keras.layers.Dense`. https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense
22. New York Times. (1958, July 8). "New Navy Device Learns By Doing; Psychologist Shows Embryo of Computer Designed to Read and Grow Wiser." https://www.nytimes.com/1958/07/08/archives/new-navy-device-learns-by-doing-psychologist-shows-embryo-of.html
23. Hendrycks, D., & Gimpel, K. (2016). "Gaussian Error Linear Units (GELUs)." arXiv:1606.08415. https://arxiv.org/abs/1606.08415
24. Ramachandran, P., Zoph, B., & Le, Q. V. (2017). "Searching for Activation Functions." arXiv:1710.05941. https://arxiv.org/abs/1710.05941
25. Elfwing, S., Uchibe, E., & Doya, K. (2018). "Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning." *Neural Networks*, 107, 3-11. https://arxiv.org/abs/1702.03118
26. He, K., Zhang, X., Ren, S., & Sun, J. (2015). "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification." *ICCV*. https://arxiv.org/abs/1502.01852