# Node (neural network)

> Source: https://aiwiki.ai/wiki/node_neural_network
> Updated: 2026-06-02
> Categories: Model Architecture, Neural Networks
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

In a [neural network](/wiki/neural_network), a **node** is the basic computational element that receives one or more inputs, computes a weighted sum, adds a bias, and passes the result through a nonlinear [activation function](/wiki/activation_function) to produce an output. The same object goes by several other names: [neuron](/wiki/neuron), unit, processing element, and (historically) [perceptron](/wiki/perceptron). All of these refer to the same simple compute primitive that, when stacked into layers and wired together by weighted connections, yields the deep learning systems that power modern artificial intelligence.

The word "node" can be ambiguous. In the context of feedforward layers, it almost always means a single unit. In a graph neural network or a TensorFlow computation graph it means something quite different. The disambiguation section below covers those cases.

## mathematical definition

A single node receives a vector of inputs `x = (x_1, x_2, ..., x_n)` along with a corresponding vector of learned weights `w = (w_1, w_2, ..., w_n)` and a scalar [bias](/wiki/bias_math_or_bias_term) term `b`. The node's output `y` is given by:

```
y = sigma(sum_i w_i * x_i + b)
```

where `sigma` is the activation function. The inner sum (the pre-activation) is sometimes called the logit or the net input. Geometrically, the pre-activation defines a hyperplane in input space; the activation function then bends that linear decision surface into something the layer above can use.

A layer of `m` nodes connected to `n` inputs is conventionally written in matrix form as `y = sigma(W x + b)`, where `W` is an `m x n` weight matrix and `b` is a length-`m` bias vector. This is exactly what `nn.Linear(n, m)` builds in [PyTorch](/wiki/pytorch) and what `tf.keras.layers.Dense(m)` builds in [TensorFlow](/wiki/tensorflow). The argument `m` is the number of output nodes, also known as the layer's width.

## historical origins

The modern node has a long lineage. The table below lists the main milestones.

| Year | Model | Contribution |
| --- | --- | --- |
| 1943 | McCulloch-Pitts neuron | First mathematical model of a neuron. Inputs are binary, the unit fires if the weighted sum crosses a threshold, and a single inhibitory input can veto the output. McCulloch and Pitts showed that networks of these units could compute any logical function. |
| 1958 | Rosenblatt's perceptron | First learnable single-layer node. Introduced an iterative weight-update rule that converges on linearly separable problems. Published in *Psychological Review* under the title "The perceptron: a probabilistic model for information storage and organization in the brain". |
| 1969 | Minsky and Papert's *Perceptrons* | Proved that a single-layer perceptron cannot learn XOR. Funding for neural network research collapsed for over a decade. |
| 1982 | [Hopfield network](/wiki/hopfield_network) | Recurrent net of binary nodes with symmetric weights, demonstrating that simple equivalent units can produce content-addressable memory through emergent collective behavior. |
| 1986 | Backpropagation rediscovered | Rumelhart, Hinton, and Williams popularized training multi-layer networks of differentiable nodes by gradient descent. This unlocked the hidden layer. |
| 1989-1991 | Universal approximation | Cybenko (1989), Hornik, Stinchcombe and White (1989), and Hornik (1991) proved that a single hidden layer of enough nodes with a non-polynomial activation can approximate any continuous function on a compact set. |
| 2012 | AlexNet | Demonstrated that wide networks of [ReLU](/wiki/rectified_linear_unit_relu) nodes trained on GPUs could win ImageNet by a wide margin, kicking off the deep learning era. |
| 2017 | Transformer | The attention block reframed where the action lives: nodes sit inside the position-wise feedforward sublayers, while [attention heads](/wiki/attention_head) handle mixing across positions. |

The McCulloch-Pitts paper ("A Logical Calculus of the Ideas Immanent in Nervous Activity", *Bulletin of Mathematical Biophysics*, 1943) is usually cited as the origin of the artificial neuron [1]. Rosenblatt's perceptron added the missing ingredient, learning, and was demonstrated on the Mark I Perceptron, a custom analog machine built at Cornell Aeronautical Laboratory [2].

## terminology and synonyms

Different communities and time periods favor different words for the same object. The choice of word is largely cosmetic.

| Term | Where you see it |
| --- | --- |
| Neuron | Most common in deep learning textbooks and biology-flavored writing. |
| Unit | Common in older connectionist literature and in the [hidden layer](/wiki/hidden_layer) / output layer naming conventions. |
| Node | Common when describing network topology, especially in diagrams. |
| Perceptron | Historical. Today usually refers to a single-node binary classifier or to a [multilayer perceptron](/wiki/perceptron) (MLP). |
| Processing element (PE) | Older systems engineering literature, especially around analog implementations. |
| Cell | Used for nodes with internal state, such as the LSTM cell. |

In this article, "node" and "neuron" are used interchangeably.

## roles within a network

Nodes are organized into layers. Their role depends on the layer they sit in.

| Role | Description |
| --- | --- |
| Input node | Holds a single component of the input vector. Has no weights and no activation; it just delivers a feature value to the next layer. |
| Hidden node | Sits in a [hidden layer](/wiki/hidden_layer) and learns to combine inputs from the layer below into intermediate representations. The expressive power of a network comes from its hidden nodes. |
| Output node | Produces the final prediction. The activation differs by task: linear for regression, sigmoid for binary classification, [softmax](/wiki/softmax) for multiclass. |
| Bias node | A constant input (almost always equal to 1) connected through its own learned weight. Lets the activation surface shift away from the origin. Most modern frameworks fold the bias into the linear layer rather than drawing it as a separate node. |
| Convolutional filter | A node-like unit in a [convolutional neural network](/wiki/convolutional_neural_network) whose weights are shared across all spatial positions. One filter produces one channel of the output feature map. |
| LSTM cell | A composite node with internal memory, used in a [long short-term memory](/wiki/long_short-term_memory_lstm) network. Each cell contains its own input, [forget](/wiki/forget_gate), and output gates, each of which is itself made of standard sigmoid nodes. |
| Attention head | The [transformer](/wiki/transformer) analog. Not a single scalar node, but a block that attends across positions and produces a vector. Heads are sometimes called "node-like" components when describing model width. |

## activation functions

Without a nonlinear activation, a stack of nodes collapses into a single linear map and the network loses all expressive power. Different activations have dominated different eras.

| Activation | Formula | Range | Notes |
| --- | --- | --- | --- |
| Step / threshold | `1 if z > 0 else 0` | {0, 1} | The original McCulloch-Pitts and perceptron activation. Not differentiable, so unusable with gradient descent. |
| [Sigmoid](/wiki/sigmoid_function) | `1 / (1 + exp(-z))` | (0, 1) | Smooth and bounded. Standard in shallow nets through the 1990s. Saturates and causes the vanishing gradient problem in deep nets. |
| Tanh | `(exp(z) - exp(-z)) / (exp(z) + exp(-z))` | (-1, 1) | Zero-centered version of sigmoid. Preferred in older hidden layers. |
| [ReLU](/wiki/relu) | `max(0, z)` | [0, inf) | Cheap, doesn't saturate for positive inputs, made deep networks practical from 2012 onward. The default for hidden layers in CNNs and MLPs. |
| Leaky ReLU | `max(alpha * z, z)`, small `alpha` | (-inf, inf) | Fixes the "dying ReLU" problem by keeping a small slope for negative inputs. |
| PReLU | Same as Leaky ReLU but `alpha` is learned | (-inf, inf) | Introduced by He et al. 2015 for ImageNet. |
| GELU | `z * Phi(z)` where `Phi` is the standard Gaussian CDF | (-inf, inf) | Smooth ReLU variant. Used in BERT, GPT-2, GPT-3, and most modern transformers. |
| Swish / SiLU | `z * sigmoid(z)` | (-inf, inf) | Found by neural architecture search. Used in EfficientNet and several LLMs. |
| Softmax | `exp(z_i) / sum_j exp(z_j)` | (0, 1) per output, sums to 1 | Output activation for multiclass classification. Operates on a whole vector of nodes, not on each one independently. |

The choice of activation is one of the few hyperparameters that affects training dynamics more than final accuracy. The shift from sigmoid to ReLU around 2010-2012 is widely credited as one of the practical breakthroughs that enabled deep learning [9].

## layer width and depth

The number of nodes in a layer is called the layer's width. The number of layers is the network's depth. Both matter, and they are not interchangeable.

Common widths in modern architectures cluster around powers of two: 64, 128, 256, 512, 1024, 2048, 4096. Hidden layers in BERT base have width 768. The hidden state in GPT-3 has width 12,288, with feedforward sublayers expanded to 4 times that, or 49,152, before being projected back down [16]. Width determines how many features can be represented in parallel at each layer.

The **universal approximation theorem** (Cybenko, 1989; Hornik, 1991) says that a feedforward network with one hidden layer of finite but possibly very large width and a non-polynomial activation can approximate any continuous function on a compact set to arbitrary accuracy [6][8]. The theorem is an existence proof. It does not say the required width is small or that the weights are easy to find. In practice, depth turns out to be far more parameter-efficient than width: deep narrow networks routinely outperform shallow wide ones with the same total parameter count.

Lu et al. (2017), in "The Expressive Power of Neural Networks: A View from the Width", established a complementary result for ReLU networks. Width-(n+4) ReLU networks, where n is the input dimension, are universal approximators, and there exist functions representable by wide shallow networks that any narrow network of polynomial depth cannot match. Their broader conclusion is that depth is more effective than width, but width is not negligible [13].

Practical considerations:

- Too few nodes per layer underfits. The model lacks the capacity to represent the target function.
- Too many nodes overfits, wastes compute, and slows training. Regularization techniques such as [dropout](/wiki/dropout) (Srivastava et al., 2014) deliberately disable random subsets of nodes during training to reduce co-adaptation [10].

## network topology

How nodes connect to each other defines the network topology. The dominant patterns are:

- Fully connected (also called [dense](/wiki/dense_layer) or [fully connected](/wiki/fully_connected_layer)). Every node in layer L is connected to every node in layer L+1. The cost is O(n*m) parameters per layer pair.
- Sparse. Only a subset of possible connections exists. Used in efficient architectures and in mixture-of-experts layers.
- Convolutional. Nodes in the next layer connect only to a small spatial neighborhood of the previous layer, with weights shared across positions. Drastically reduces parameter count and adds translation invariance.
- Recurrent. Nodes have connections back to earlier time steps, giving the network memory of past inputs.
- Skip connections (residual). Nodes in layer L+k receive a shortcut copy of activations from layer L. ResNet (He et al., 2016) showed this lets networks scale to hundreds of layers without losing trainability [11].

## node count in modern systems

Modern foundation models contain enormous numbers of nodes.

| Model | Hidden width | Total parameters | Source |
| --- | --- | --- | --- |
| MNIST classifier (1998) | 100-300 | ~200K | LeCun et al. |
| AlexNet (2012) | 4096 in FC layers | 60M | Krizhevsky et al. |
| BERT base (2018) | 768 | 110M | Devlin et al. |
| BERT large (2018) | 1024 | 340M | Devlin et al. |
| GPT-3 (2020) | 12,288 | 175B | Brown et al. |
| GPT-3 FFN inner | 49,152 | (4x hidden) | Brown et al. |

The number of nodes inside a transformer FFN block is typically four times the model's hidden dimension. Counting attention heads as analog "nodes" gives even larger numbers: GPT-3 has 96 heads per layer over 96 layers, or 9,216 attention heads in total.

## interpretability and superposition

A long-standing puzzle is whether a single node represents a single human-meaningful concept. The answer turned out to be no, in general. Most nodes in trained language models are **polysemantic**: a given neuron will activate for several unrelated concepts, such as quotes from US presidents and the Hebrew alphabet. This is a consequence of **superposition**, the idea that networks pack more features than they have neurons by storing them in nearly orthogonal directions in activation space.

Elhage et al. (2022), in "Toy Models of Superposition" at Anthropic, formalized this and showed that superposition is a deliberate strategy the network learns to use available width efficiently [17]. Bricken et al. (2023), in "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning", trained a sparse autoencoder on the activations of a one-layer transformer and decomposed a 512-node layer into more than 4,000 separate features, each of which is much more interpretable than the underlying neurons [18]. Their follow-up, "Scaling Monosemanticity" (2024), extended the technique to Claude 3 Sonnet, recovering features ranging from concrete (the Golden Gate Bridge) to abstract (deception, sycophancy) [19].

This line of work has practical implications. Pruning and editing whole nodes is much less precise than editing the underlying features, which is one reason mechanistic interpretability has become a major research direction.

## node pruning

Not every node in a trained network earns its keep. **Structured pruning** removes entire nodes (or filters or attention heads) along with their weights, producing a smaller dense network that runs faster on standard hardware. Magnitude-based pruning, lottery-ticket pruning (Frankle and Carbin, 2019) [14], and head pruning (Michel et al., 2019) [15] are well-studied variants. Empirically, transformer models often retain most of their accuracy after 30-50% of attention heads are removed, which suggests substantial redundancy at the head level [15].

## disambiguation: other meanings of "node"

The word "node" appears elsewhere in machine learning with different meanings, which causes regular confusion.

| Context | What "node" means |
| --- | --- |
| Feedforward layer | A single unit with weights, bias, and activation. The meaning used throughout this article. |
| TensorFlow computation graph | An [operation](/wiki/operation) (op) in the dataflow graph, such as `MatMul` or `Relu`. Edges carry tensors between op nodes. |
| PyTorch autograd graph | A `Function` object that produced a tensor. Used during the backward pass. |
| Graph neural network (GNN) | A vertex in the input graph (a person in a social network, an atom in a molecule). The node feature vector is updated by aggregating messages from neighboring nodes. Not the same as a hidden unit. |
| Decision tree | A branching point in the tree, also called a split node. |
| Distributed training | A physical machine in a cluster (e.g., a TPU node or a compute node in a Kubernetes cluster). |

When reading a paper or codebase, it is worth checking which of these meanings is intended.

## implementation in popular frameworks

In PyTorch, the line

```python
layer = torch.nn.Linear(in_features=784, out_features=512)
```

creates a layer with 784 input nodes and 512 output nodes, allocating a 512x784 weight matrix and a length-512 bias vector [20]. Stacking another `nn.Linear(512, 10)` on top makes the 512 outputs become the 512 inputs of the next layer, and yields 10 output nodes.

In Keras / TensorFlow:

```python
layer = tf.keras.layers.Dense(units=512, activation='relu')
```

here `units=512` is the number of output nodes; the input width is inferred from the previous layer [21].

A convolutional layer specifies the number of output filters, which plays the same role as `units` for fully connected layers. Each filter corresponds to one output channel and behaves like a node with shared spatial weights.

## summary

A node in a neural network is the workhorse of the whole system: a small linear function followed by a nonlinearity. Everything from the McCulloch-Pitts threshold unit of 1943 to the 49,152-wide feedforward sublayers of GPT-3 is built from this primitive. The interesting questions today are no longer how a single node works, but how millions of them combine: which features they store, how those features superpose, and how to read them back out. The basic compute element is unchanged, but its scale and the techniques used to interpret it have transformed beyond anything Rosenblatt would have recognized.

## references

1. McCulloch, W. S., & Pitts, W. (1943). "A Logical Calculus of the Ideas Immanent in Nervous Activity." *Bulletin of Mathematical Biophysics*, 5(4), 115-133. https://link.springer.com/article/10.1007/BF02478259
2. Rosenblatt, F. (1958). "The perceptron: a probabilistic model for information storage and organization in the brain." *Psychological Review*, 65(6), 386-408. https://psycnet.apa.org/record/1959-09865-001
3. Minsky, M., & Papert, S. (1969). *Perceptrons: An Introduction to Computational Geometry*. MIT Press.
4. Hopfield, J. J. (1982). "Neural networks and physical systems with emergent collective computational abilities." *Proceedings of the National Academy of Sciences*, 79(8), 2554-2558. https://www.pnas.org/doi/10.1073/pnas.79.8.2554
5. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). "Learning representations by back-propagating errors." *Nature*, 323(6088), 533-536.
6. Cybenko, G. (1989). "Approximation by superpositions of a sigmoidal function." *Mathematics of Control, Signals, and Systems*, 2(4), 303-314. https://link.springer.com/article/10.1007/BF02551274
7. Hornik, K., Stinchcombe, M., & White, H. (1989). "Multilayer feedforward networks are universal approximators." *Neural Networks*, 2(5), 359-366.
8. Hornik, K. (1991). "Approximation capabilities of multilayer feedforward networks." *Neural Networks*, 4(2), 251-257.
9. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). "ImageNet Classification with Deep Convolutional Neural Networks." *NeurIPS*.
10. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). "Dropout: A Simple Way to Prevent Neural Networks from Overfitting." *Journal of Machine Learning Research*, 15, 1929-1958.
11. He, K., Zhang, X., Ren, S., & Sun, J. (2016). "Deep Residual Learning for Image Recognition." *CVPR*.
12. Vaswani, A. et al. (2017). "Attention Is All You Need." *NeurIPS*.
13. Lu, Z., Pu, H., Wang, F., Hu, Z., & Wang, L. (2017). "The Expressive Power of Neural Networks: A View from the Width." *NeurIPS*. https://arxiv.org/abs/1709.02540
14. Frankle, J., & Carbin, M. (2019). "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks." *ICLR*.
15. Michel, P., Levy, O., & Neubig, G. (2019). "Are Sixteen Heads Really Better than One?" *NeurIPS*.
16. Brown, T. B. et al. (2020). "Language Models are Few-Shot Learners." *NeurIPS*. (GPT-3 paper, hidden dim 12,288.)
17. Elhage, N. et al. (2022). "Toy Models of Superposition." Anthropic. https://transformer-circuits.pub/2022/toy_model/
18. Bricken, T. et al. (2023). "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning." Anthropic. https://transformer-circuits.pub/2023/monosemantic-features
19. Templeton, A. et al. (2024). "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet." Anthropic. https://transformer-circuits.pub/2024/scaling-monosemanticity/
20. PyTorch documentation, `torch.nn.Linear`. https://docs.pytorch.org/docs/stable/generated/torch.nn.Linear.html
21. TensorFlow documentation, `tf.keras.layers.Dense`. https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense

