See also: Machine learning terms
In a neural network, a node is the basic computational element that receives one or more inputs, computes a weighted sum, adds a bias, and passes the result through a nonlinear activation function to produce an output. The same object goes by several other names: neuron, unit, processing element, and (historically) perceptron. All of these refer to the same simple compute primitive that, when stacked into layers and wired together by weighted connections, yields the deep learning systems that power modern artificial intelligence.
The word "node" can be ambiguous. In the context of feedforward layers, it almost always means a single unit. In a graph neural network or a TensorFlow computation graph it means something quite different. The disambiguation section below covers those cases.
A single node receives a vector of inputs x = (x_1, x_2, ..., x_n) along with a corresponding vector of learned weights w = (w_1, w_2, ..., w_n) and a scalar bias term b. The node's output y is given by:
y = sigma(sum_i w_i * x_i + b)
where sigma is the activation function. The inner sum (the pre-activation) is sometimes called the logit or the net input. Geometrically, the pre-activation defines a hyperplane in input space; the activation function then bends that linear decision surface into something the layer above can use.
A layer of m nodes connected to n inputs is conventionally written in matrix form as y = sigma(W x + b), where W is an m x n weight matrix and b is a length-m bias vector. This is exactly what nn.Linear(n, m) builds in PyTorch and what tf.keras.layers.Dense(m) builds in TensorFlow. The argument m is the number of output nodes, also known as the layer's width.
The modern node has a long lineage. The table below lists the main milestones.
| Year | Model | Contribution |
|---|---|---|
| 1943 | McCulloch-Pitts neuron | First mathematical model of a neuron. Inputs are binary, the unit fires if the weighted sum crosses a threshold, and a single inhibitory input can veto the output. McCulloch and Pitts showed that networks of these units could compute any logical function. |
| 1958 | Rosenblatt's perceptron | First learnable single-layer node. Introduced an iterative weight-update rule that converges on linearly separable problems. Published in Psychological Review under the title "The perceptron: a probabilistic model for information storage and organization in the brain". |
| 1969 | Minsky and Papert's Perceptrons | Proved that a single-layer perceptron cannot learn XOR. Funding for neural network research collapsed for over a decade. |
| 1982 | Hopfield network | Recurrent net of binary nodes with symmetric weights, demonstrating that simple equivalent units can produce content-addressable memory through emergent collective behavior. |
| 1986 | Backpropagation rediscovered | Rumelhart, Hinton, and Williams popularized training multi-layer networks of differentiable nodes by gradient descent. This unlocked the hidden layer. |
| 1989-1991 | Universal approximation | Cybenko (1989), Hornik, Stinchcombe and White (1989), and Hornik (1991) proved that a single hidden layer of enough nodes with a non-polynomial activation can approximate any continuous function on a compact set. |
| 2012 | AlexNet | Demonstrated that wide networks of ReLU nodes trained on GPUs could win ImageNet by a wide margin, kicking off the deep learning era. |
| 2017 | Transformer | The attention block reframed where the action lives: nodes sit inside the position-wise feedforward sublayers, while attention heads handle mixing across positions. |
The McCulloch-Pitts paper ("A Logical Calculus of the Ideas Immanent in Nervous Activity", Bulletin of Mathematical Biophysics, 1943) is usually cited as the origin of the artificial neuron. Rosenblatt's perceptron added the missing ingredient, learning, and was demonstrated on the Mark I Perceptron, a custom analog machine built at Cornell Aeronautical Laboratory.
Different communities and time periods favor different words for the same object. The choice of word is largely cosmetic.
| Term | Where you see it |
|---|---|
| Neuron | Most common in deep learning textbooks and biology-flavored writing. |
| Unit | Common in older connectionist literature and in the hidden layer / output layer naming conventions. |
| Node | Common when describing network topology, especially in diagrams. |
| Perceptron | Historical. Today usually refers to a single-node binary classifier or to a multilayer perceptron (MLP). |
| Processing element (PE) | Older systems engineering literature, especially around analog implementations. |
| Cell | Used for nodes with internal state, such as the LSTM cell. |
In this article, "node" and "neuron" are used interchangeably.
Nodes are organized into layers. Their role depends on the layer they sit in.
| Role | Description |
|---|---|
| Input node | Holds a single component of the input vector. Has no weights and no activation; it just delivers a feature value to the next layer. |
| Hidden node | Sits in a hidden layer and learns to combine inputs from the layer below into intermediate representations. The expressive power of a network comes from its hidden nodes. |
| Output node | Produces the final prediction. The activation differs by task: linear for regression, sigmoid for binary classification, softmax for multiclass. |
| Bias node | A constant input (almost always equal to 1) connected through its own learned weight. Lets the activation surface shift away from the origin. Most modern frameworks fold the bias into the linear layer rather than drawing it as a separate node. |
| Convolutional filter | A node-like unit in a convolutional neural network whose weights are shared across all spatial positions. One filter produces one channel of the output feature map. |
| LSTM cell | A composite node with internal memory, used in a long short-term memory network. Each cell contains its own input, forget, and output gates, each of which is itself made of standard sigmoid nodes. |
| Attention head | The transformer analog. Not a single scalar node, but a block that attends across positions and produces a vector. Heads are sometimes called "node-like" components when describing model width. |
Without a nonlinear activation, a stack of nodes collapses into a single linear map and the network loses all expressive power. Different activations have dominated different eras.
| Activation | Formula | Range | Notes |
|---|---|---|---|
| Step / threshold | 1 if z > 0 else 0 | {0, 1} | The original McCulloch-Pitts and perceptron activation. Not differentiable, so unusable with gradient descent. |
| Sigmoid | 1 / (1 + exp(-z)) | (0, 1) | Smooth and bounded. Standard in shallow nets through the 1990s. Saturates and causes the vanishing gradient problem in deep nets. |
| Tanh | (exp(z) - exp(-z)) / (exp(z) + exp(-z)) | (-1, 1) | Zero-centered version of sigmoid. Preferred in older hidden layers. |
| ReLU | max(0, z) | [0, inf) | Cheap, doesn't saturate for positive inputs, made deep networks practical from 2012 onward. The default for hidden layers in CNNs and MLPs. |
| Leaky ReLU | max(alpha * z, z), small alpha | (-inf, inf) | Fixes the "dying ReLU" problem by keeping a small slope for negative inputs. |
| PReLU | Same as Leaky ReLU but alpha is learned | (-inf, inf) | Introduced by He et al. 2015 for ImageNet. |
| GELU | z * Phi(z) where Phi is the standard Gaussian CDF | (-inf, inf) | Smooth ReLU variant. Used in BERT, GPT-2, GPT-3, and most modern transformers. |
| Swish / SiLU | z * sigmoid(z) | (-inf, inf) | Found by neural architecture search. Used in EfficientNet and several LLMs. |
| Softmax | exp(z_i) / sum_j exp(z_j) | (0, 1) per output, sums to 1 | Output activation for multiclass classification. Operates on a whole vector of nodes, not on each one independently. |
The choice of activation is one of the few hyperparameters that affects training dynamics more than final accuracy. The shift from sigmoid to ReLU around 2010-2012 is widely credited as one of the practical breakthroughs that enabled deep learning.
The number of nodes in a layer is called the layer's width. The number of layers is the network's depth. Both matter, and they are not interchangeable.
Common widths in modern architectures cluster around powers of two: 64, 128, 256, 512, 1024, 2048, 4096. Hidden layers in BERT base have width 768. The hidden state in GPT-3 has width 12,288, with feedforward sublayers expanded to 4 times that, or 49,152, before being projected back down. Width determines how many features can be represented in parallel at each layer.
The universal approximation theorem (Cybenko, 1989; Hornik, 1991) says that a feedforward network with one hidden layer of finite but possibly very large width and a non-polynomial activation can approximate any continuous function on a compact set to arbitrary accuracy. The theorem is an existence proof. It does not say the required width is small or that the weights are easy to find. In practice, depth turns out to be far more parameter-efficient than width: deep narrow networks routinely outperform shallow wide ones with the same total parameter count.
Lu et al. (2017), in "The Expressive Power of Neural Networks: A View from the Width", established a complementary result for ReLU networks. Width-(n+4) ReLU networks, where n is the input dimension, are universal approximators, and there exist functions representable by wide shallow networks that any narrow network of polynomial depth cannot match. Their broader conclusion is that depth is more effective than width, but width is not negligible.
Practical considerations:
How nodes connect to each other defines the network topology. The dominant patterns are:
Modern foundation models contain enormous numbers of nodes.
| Model | Hidden width | Total parameters | Source |
|---|---|---|---|
| MNIST classifier (1998) | 100-300 | ~200K | LeCun et al. |
| AlexNet (2012) | 4096 in FC layers | 60M | Krizhevsky et al. |
| BERT base (2018) | 768 | 110M | Devlin et al. |
| BERT large (2018) | 1024 | 340M | Devlin et al. |
| GPT-3 (2020) | 12,288 | 175B | Brown et al. |
| GPT-3 FFN inner | 49,152 | (4x hidden) | Brown et al. |
The number of nodes inside a transformer FFN block is typically four times the model's hidden dimension. Counting attention heads as analog "nodes" gives even larger numbers: GPT-3 has 96 heads per layer over 96 layers, or 9,216 attention heads in total.
A long-standing puzzle is whether a single node represents a single human-meaningful concept. The answer turned out to be no, in general. Most nodes in trained language models are polysemantic: a given neuron will activate for several unrelated concepts, such as quotes from US presidents and the Hebrew alphabet. This is a consequence of superposition, the idea that networks pack more features than they have neurons by storing them in nearly orthogonal directions in activation space.
Elhage et al. (2022), in "Toy Models of Superposition" at Anthropic, formalized this and showed that superposition is a deliberate strategy the network learns to use available width efficiently. Bricken et al. (2023), in "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning", trained a sparse autoencoder on the activations of a one-layer transformer and decomposed a 512-node layer into more than 4,000 separate features, each of which is much more interpretable than the underlying neurons. Their follow-up, "Scaling Monosemanticity" (2024), extended the technique to Claude 3 Sonnet, recovering features ranging from concrete (the Golden Gate Bridge) to abstract (deception, sycophancy).
This line of work has practical implications. Pruning and editing whole nodes is much less precise than editing the underlying features, which is one reason mechanistic interpretability has become a major research direction.
Not every node in a trained network earns its keep. Structured pruning removes entire nodes (or filters or attention heads) along with their weights, producing a smaller dense network that runs faster on standard hardware. Magnitude-based pruning, lottery-ticket pruning (Frankle and Carbin, 2019), and head pruning (Michel et al., 2019) are well-studied variants. Empirically, transformer models often retain most of their accuracy after 30-50% of attention heads are removed, which suggests substantial redundancy at the head level.
The word "node" appears elsewhere in machine learning with different meanings, which causes regular confusion.
| Context | What "node" means |
|---|---|
| Feedforward layer | A single unit with weights, bias, and activation. The meaning used throughout this article. |
| TensorFlow computation graph | An operation (op) in the dataflow graph, such as MatMul or Relu. Edges carry tensors between op nodes. |
| PyTorch autograd graph | A Function object that produced a tensor. Used during the backward pass. |
| Graph neural network (GNN) | A vertex in the input graph (a person in a social network, an atom in a molecule). The node feature vector is updated by aggregating messages from neighboring nodes. Not the same as a hidden unit. |
| Decision tree | A branching point in the tree, also called a split node. |
| Distributed training | A physical machine in a cluster (e.g., a TPU node or a compute node in a Kubernetes cluster). |
When reading a paper or codebase, it is worth checking which of these meanings is intended.
In PyTorch, the line
layer = torch.nn.Linear(in_features=784, out_features=512)
creates a layer with 784 input nodes and 512 output nodes, allocating a 512x784 weight matrix and a length-512 bias vector. Stacking another nn.Linear(512, 10) on top makes the 512 outputs become the 512 inputs of the next layer, and yields 10 output nodes.
In Keras / TensorFlow:
layer = tf.keras.layers.Dense(units=512, activation='relu')
here units=512 is the number of output nodes; the input width is inferred from the previous layer.
A convolutional layer specifies the number of output filters, which plays the same role as units for fully connected layers. Each filter corresponds to one output channel and behaves like a node with shared spatial weights.
A node in a neural network is the workhorse of the whole system: a small linear function followed by a nonlinearity. Everything from the McCulloch-Pitts threshold unit of 1943 to the 49,152-wide feedforward sublayers of GPT-3 is built from this primitive. The interesting questions today are no longer how a single node works, but how millions of them combine: which features they store, how those features superpose, and how to read them back out. The basic compute element is unchanged, but its scale and the techniques used to interpret it have transformed beyond anything Rosenblatt would have recognized.
torch.nn.Linear. https://docs.pytorch.org/docs/stable/generated/torch.nn.Linear.htmltf.keras.layers.Dense. https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense