Node (neural network)

See also: Machine learning terms

In a neural network, a node is the basic computational element that receives one or more inputs, computes a weighted sum, adds a bias, and passes the result through a nonlinear activation function to produce an output. The same object goes by several other names: neuron, unit, processing element, and (historically) perceptron. All of these refer to the same simple compute primitive that, when stacked into layers and wired together by weighted connections, yields the deep learning systems that power modern artificial intelligence.

The word "node" can be ambiguous. In the context of feedforward layers, it almost always means a single unit. In a graph neural network or a TensorFlow computation graph it means something quite different. The disambiguation section below covers those cases.

mathematical definition

A single node receives a vector of inputs x = (x_1, x_2, ..., x_n) along with a corresponding vector of learned weights w = (w_1, w_2, ..., w_n) and a scalar bias term b. The node's output y is given by:

y = sigma(sum_i w_i * x_i + b)

where sigma is the activation function. The inner sum (the pre-activation) is sometimes called the logit or the net input. Geometrically, the pre-activation defines a hyperplane in input space; the activation function then bends that linear decision surface into something the layer above can use.

A layer of m nodes connected to n inputs is conventionally written in matrix form as y = sigma(W x + b), where W is an m x n weight matrix and b is a length-m bias vector. This is exactly what nn.Linear(n, m) builds in PyTorch and what tf.keras.layers.Dense(m) builds in TensorFlow. The argument m is the number of output nodes, also known as the layer's width.

historical origins

The modern node has a long lineage. The table below lists the main milestones.

Year	Model	Contribution
1943	McCulloch-Pitts neuron	First mathematical model of a neuron. Inputs are binary, the unit fires if the weighted sum crosses a threshold, and a single inhibitory input can veto the output. McCulloch and Pitts showed that networks of these units could compute any logical function.
1958	Rosenblatt's perceptron	First learnable single-layer node. Introduced an iterative weight-update rule that converges on linearly separable problems. Published in Psychological Review under the title "The perceptron: a probabilistic model for information storage and organization in the brain".
1969	Minsky and Papert's Perceptrons	Proved that a single-layer perceptron cannot learn XOR. Funding for neural network research collapsed for over a decade.
1982	Hopfield network	Recurrent net of binary nodes with symmetric weights, demonstrating that simple equivalent units can produce content-addressable memory through emergent collective behavior.
1986	Backpropagation rediscovered	Rumelhart, Hinton, and Williams popularized training multi-layer networks of differentiable nodes by gradient descent. This unlocked the hidden layer.
1989-1991	Universal approximation	Cybenko (1989), Hornik, Stinchcombe and White (1989), and Hornik (1991) proved that a single hidden layer of enough nodes with a non-polynomial activation can approximate any continuous function on a compact set.
2012	AlexNet	Demonstrated that wide networks of ReLU nodes trained on GPUs could win ImageNet by a wide margin, kicking off the deep learning era.
2017	Transformer	The attention block reframed where the action lives: nodes sit inside the position-wise feedforward sublayers, while attention heads handle mixing across positions.

The McCulloch-Pitts paper ("A Logical Calculus of the Ideas Immanent in Nervous Activity", Bulletin of Mathematical Biophysics, 1943) is usually cited as the origin of the artificial neuron. Rosenblatt's perceptron added the missing ingredient, learning, and was demonstrated on the Mark I Perceptron, a custom analog machine built at Cornell Aeronautical Laboratory.

terminology and synonyms

Different communities and time periods favor different words for the same object. The choice of word is largely cosmetic.

Term	Where you see it
Neuron	Most common in deep learning textbooks and biology-flavored writing.
Unit	Common in older connectionist literature and in the hidden layer / output layer naming conventions.
Node	Common when describing network topology, especially in diagrams.
Perceptron	Historical. Today usually refers to a single-node binary classifier or to a multilayer perceptron (MLP).
Processing element (PE)	Older systems engineering literature, especially around analog implementations.
Cell	Used for nodes with internal state, such as the LSTM cell.

In this article, "node" and "neuron" are used interchangeably.

roles within a network

Nodes are organized into layers. Their role depends on the layer they sit in.

Role	Description
Input node	Holds a single component of the input vector. Has no weights and no activation; it just delivers a feature value to the next layer.
Hidden node	Sits in a hidden layer and learns to combine inputs from the layer below into intermediate representations. The expressive power of a network comes from its hidden nodes.
Output node	Produces the final prediction. The activation differs by task: linear for regression, sigmoid for binary classification, softmax for multiclass.
Bias node	A constant input (almost always equal to 1) connected through its own learned weight. Lets the activation surface shift away from the origin. Most modern frameworks fold the bias into the linear layer rather than drawing it as a separate node.
Convolutional filter	A node-like unit in a convolutional neural network whose weights are shared across all spatial positions. One filter produces one channel of the output feature map.
LSTM cell	A composite node with internal memory, used in a long short-term memory network. Each cell contains its own input, forget, and output gates, each of which is itself made of standard sigmoid nodes.
Attention head	The transformer analog. Not a single scalar node, but a block that attends across positions and produces a vector. Heads are sometimes called "node-like" components when describing model width.

activation functions

Without a nonlinear activation, a stack of nodes collapses into a single linear map and the network loses all expressive power. Different activations have dominated different eras.

Activation	Formula	Range	Notes
Step / threshold	`1 if z > 0 else 0`	{0, 1}	The original McCulloch-Pitts and perceptron activation. Not differentiable, so unusable with gradient descent.
Sigmoid	`1 / (1 + exp(-z))`	(0, 1)	Smooth and bounded. Standard in shallow nets through the 1990s. Saturates and causes the vanishing gradient problem in deep nets.
Tanh	`(exp(z) - exp(-z)) / (exp(z) + exp(-z))`	(-1, 1)	Zero-centered version of sigmoid. Preferred in older hidden layers.
ReLU	`max(0, z)`	[0, inf)	Cheap, doesn't saturate for positive inputs, made deep networks practical from 2012 onward. The default for hidden layers in CNNs and MLPs.
Leaky ReLU	`max(alpha * z, z)`, small `alpha`	(-inf, inf)	Fixes the "dying ReLU" problem by keeping a small slope for negative inputs.
PReLU	Same as Leaky ReLU but `alpha` is learned	(-inf, inf)	Introduced by He et al. 2015 for ImageNet.
GELU	`z * Phi(z)` where `Phi` is the standard Gaussian CDF	(-inf, inf)	Smooth ReLU variant. Used in BERT, GPT-2, GPT-3, and most modern transformers.
Swish / SiLU	`z * sigmoid(z)`	(-inf, inf)	Found by neural architecture search. Used in EfficientNet and several LLMs.
Softmax	`exp(z_i) / sum_j exp(z_j)`	(0, 1) per output, sums to 1	Output activation for multiclass classification. Operates on a whole vector of nodes, not on each one independently.

The choice of activation is one of the few hyperparameters that affects training dynamics more than final accuracy. The shift from sigmoid to ReLU around 2010-2012 is widely credited as one of the practical breakthroughs that enabled deep learning.

layer width and depth

The number of nodes in a layer is called the layer's width. The number of layers is the network's depth. Both matter, and they are not interchangeable.

Common widths in modern architectures cluster around powers of two: 64, 128, 256, 512, 1024, 2048, 4096. Hidden layers in BERT base have width 768. The hidden state in GPT-3 has width 12,288, with feedforward sublayers expanded to 4 times that, or 49,152, before being projected back down. Width determines how many features can be represented in parallel at each layer.

The universal approximation theorem (Cybenko, 1989; Hornik, 1991) says that a feedforward network with one hidden layer of finite but possibly very large width and a non-polynomial activation can approximate any continuous function on a compact set to arbitrary accuracy. The theorem is an existence proof. It does not say the required width is small or that the weights are easy to find. In practice, depth turns out to be far more parameter-efficient than width: deep narrow networks routinely outperform shallow wide ones with the same total parameter count.

Lu et al. (2017), in "The Expressive Power of Neural Networks: A View from the Width", established a complementary result for ReLU networks. Width-(n+4) ReLU networks, where n is the input dimension, are universal approximators, and there exist functions representable by wide shallow networks that any narrow network of polynomial depth cannot match. Their broader conclusion is that depth is more effective than width, but width is not negligible.

Practical considerations:

Too few nodes per layer underfits. The model lacks the capacity to represent the target function.
Too many nodes overfits, wastes compute, and slows training. Regularization techniques such as dropout (Srivastava et al., 2014) deliberately disable random subsets of nodes during training to reduce co-adaptation.

network topology

How nodes connect to each other defines the network topology. The dominant patterns are:

Fully connected (also called dense or fully connected). Every node in layer L is connected to every node in layer L+1. The cost is O(n*m) parameters per layer pair.
Sparse. Only a subset of possible connections exists. Used in efficient architectures and in mixture-of-experts layers.
Convolutional. Nodes in the next layer connect only to a small spatial neighborhood of the previous layer, with weights shared across positions. Drastically reduces parameter count and adds translation invariance.
Recurrent. Nodes have connections back to earlier time steps, giving the network memory of past inputs.
Skip connections (residual). Nodes in layer L+k receive a shortcut copy of activations from layer L. ResNet (He et al., 2016) showed this lets networks scale to hundreds of layers without losing trainability.

node count in modern systems

Modern foundation models contain enormous numbers of nodes.

Model	Hidden width	Total parameters	Source
MNIST classifier (1998)	100-300	~200K	LeCun et al.
AlexNet (2012)	4096 in FC layers	60M	Krizhevsky et al.
BERT base (2018)	768	110M	Devlin et al.
BERT large (2018)	1024	340M	Devlin et al.
GPT-3 (2020)	12,288	175B	Brown et al.
GPT-3 FFN inner	49,152	(4x hidden)	Brown et al.

The number of nodes inside a transformer FFN block is typically four times the model's hidden dimension. Counting attention heads as analog "nodes" gives even larger numbers: GPT-3 has 96 heads per layer over 96 layers, or 9,216 attention heads in total.

interpretability and superposition

A long-standing puzzle is whether a single node represents a single human-meaningful concept. The answer turned out to be no, in general. Most nodes in trained language models are polysemantic: a given neuron will activate for several unrelated concepts, such as quotes from US presidents and the Hebrew alphabet. This is a consequence of superposition, the idea that networks pack more features than they have neurons by storing them in nearly orthogonal directions in activation space.

Elhage et al. (2022), in "Toy Models of Superposition" at Anthropic, formalized this and showed that superposition is a deliberate strategy the network learns to use available width efficiently. Bricken et al. (2023), in "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning", trained a sparse autoencoder on the activations of a one-layer transformer and decomposed a 512-node layer into more than 4,000 separate features, each of which is much more interpretable than the underlying neurons. Their follow-up, "Scaling Monosemanticity" (2024), extended the technique to Claude 3 Sonnet, recovering features ranging from concrete (the Golden Gate Bridge) to abstract (deception, sycophancy).

This line of work has practical implications. Pruning and editing whole nodes is much less precise than editing the underlying features, which is one reason mechanistic interpretability has become a major research direction.

node pruning

Not every node in a trained network earns its keep. Structured pruning removes entire nodes (or filters or attention heads) along with their weights, producing a smaller dense network that runs faster on standard hardware. Magnitude-based pruning, lottery-ticket pruning (Frankle and Carbin, 2019), and head pruning (Michel et al., 2019) are well-studied variants. Empirically, transformer models often retain most of their accuracy after 30-50% of attention heads are removed, which suggests substantial redundancy at the head level.

disambiguation: other meanings of "node"

The word "node" appears elsewhere in machine learning with different meanings, which causes regular confusion.

Context	What "node" means
Feedforward layer	A single unit with weights, bias, and activation. The meaning used throughout this article.
TensorFlow computation graph	An operation (op) in the dataflow graph, such as `MatMul` or `Relu`. Edges carry tensors between op nodes.
PyTorch autograd graph	A `Function` object that produced a tensor. Used during the backward pass.
Graph neural network (GNN)	A vertex in the input graph (a person in a social network, an atom in a molecule). The node feature vector is updated by aggregating messages from neighboring nodes. Not the same as a hidden unit.
Decision tree	A branching point in the tree, also called a split node.
Distributed training	A physical machine in a cluster (e.g., a TPU node or a compute node in a Kubernetes cluster).

When reading a paper or codebase, it is worth checking which of these meanings is intended.

implementation in popular frameworks

In PyTorch, the line

layer = torch.nn.Linear(in_features=784, out_features=512)

creates a layer with 784 input nodes and 512 output nodes, allocating a 512x784 weight matrix and a length-512 bias vector. Stacking another nn.Linear(512, 10) on top makes the 512 outputs become the 512 inputs of the next layer, and yields 10 output nodes.

In Keras / TensorFlow:

layer = tf.keras.layers.Dense(units=512, activation='relu')

here units=512 is the number of output nodes; the input width is inferred from the previous layer.

A convolutional layer specifies the number of output filters, which plays the same role as units for fully connected layers. Each filter corresponds to one output channel and behaves like a node with shared spatial weights.

summary

A node in a neural network is the workhorse of the whole system: a small linear function followed by a nonlinearity. Everything from the McCulloch-Pitts threshold unit of 1943 to the 49,152-wide feedforward sublayers of GPT-3 is built from this primitive. The interesting questions today are no longer how a single node works, but how millions of them combine: which features they store, how those features superpose, and how to read them back out. The basic compute element is unchanged, but its scale and the techniques used to interpret it have transformed beyond anything Rosenblatt would have recognized.

references

McCulloch, W. S., & Pitts, W. (1943). "A Logical Calculus of the Ideas Immanent in Nervous Activity." Bulletin of Mathematical Biophysics, 5(4), 115-133. https://link.springer.com/article/10.1007/BF02478259
Rosenblatt, F. (1958). "The perceptron: a probabilistic model for information storage and organization in the brain." Psychological Review, 65(6), 386-408. https://psycnet.apa.org/record/1959-09865-001
Minsky, M., & Papert, S. (1969). Perceptrons: An Introduction to Computational Geometry. MIT Press.
Hopfield, J. J. (1982). "Neural networks and physical systems with emergent collective computational abilities." Proceedings of the National Academy of Sciences, 79(8), 2554-2558. https://www.pnas.org/doi/10.1073/pnas.79.8.2554
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). "Learning representations by back-propagating errors." Nature, 323(6088), 533-536.
Cybenko, G. (1989). "Approximation by superpositions of a sigmoidal function." Mathematics of Control, Signals, and Systems, 2(4), 303-314. https://link.springer.com/article/10.1007/BF02551274
Hornik, K., Stinchcombe, M., & White, H. (1989). "Multilayer feedforward networks are universal approximators." Neural Networks, 2(5), 359-366.
Hornik, K. (1991). "Approximation capabilities of multilayer feedforward networks." Neural Networks, 4(2), 251-257.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). "ImageNet Classification with Deep Convolutional Neural Networks." NeurIPS.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). "Dropout: A Simple Way to Prevent Neural Networks from Overfitting." Journal of Machine Learning Research, 15, 1929-1958.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). "Deep Residual Learning for Image Recognition." CVPR.
Vaswani, A. et al. (2017). "Attention Is All You Need." NeurIPS.
Lu, Z., Pu, H., Wang, F., Hu, Z., & Wang, L. (2017). "The Expressive Power of Neural Networks: A View from the Width." NeurIPS. https://arxiv.org/abs/1709.02540
Frankle, J., & Carbin, M. (2019). "The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks." ICLR.
Michel, P., Levy, O., & Neubig, G. (2019). "Are Sixteen Heads Really Better than One?" NeurIPS.
Brown, T. B. et al. (2020). "Language Models are Few-Shot Learners." NeurIPS. (GPT-3 paper, hidden dim 12,288.)
Elhage, N. et al. (2022). "Toy Models of Superposition." Anthropic. https://transformer-circuits.pub/2022/toy_model/
Bricken, T. et al. (2023). "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning." Anthropic. https://transformer-circuits.pub/2023/monosemantic-features
Templeton, A. et al. (2024). "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet." Anthropic. https://transformer-circuits.pub/2024/scaling-monosemanticity/
PyTorch documentation, torch.nn.Linear. https://docs.pytorch.org/docs/stable/generated/torch.nn.Linear.html
TensorFlow documentation, tf.keras.layers.Dense. https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense

mathematical definition

historical origins

terminology and synonyms

roles within a network

activation functions

layer width and depth

network topology

node count in modern systems

interpretability and superposition

node pruning

disambiguation: other meanings of "node"

implementation in popular frameworks

summary

references

Improve this article

Related Articles

Multi-head Latent Attention

GELU (Gaussian Error Linear Unit)

Bidirectional

Transformers

LSTM

Encoder

mathematical definition

historical origins

terminology and synonyms

roles within a network

activation functions

layer width and depth

network topology

node count in modern systems

interpretability and superposition

node pruning

disambiguation: other meanings of "node"

implementation in popular frameworks

summary

references

Related Articles

Multi-head Latent Attention

GELU (Gaussian Error Linear Unit)

Bidirectional

Transformers

LSTM

Encoder