Broadcasting
Last reviewed
Sources
11 citations
Review status
Source-backed
Revision
v3 · 2,816 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
11 citations
Review status
Source-backed
Revision
v3 · 2,816 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms
Broadcasting is the set of rules that lets element-wise operations (addition, subtraction, multiplication, division) act on arrays or tensors of different but compatible shapes by virtually stretching the smaller array to match the larger one, without copying data in memory. It originated in NumPy, the foundational numerical computing library for Python, and the same semantics have been adopted by every major deep learning framework, including PyTorch and TensorFlow. The official NumPy documentation defines it as follows: "The term broadcasting describes how NumPy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is 'broadcast' across the larger array so that they have compatible shapes." [1]
The two governing rules are simple: NumPy compares array shapes element-wise starting from the trailing (rightmost) dimension and working left, and two dimensions are compatible when they are equal or when one of them is 1. [1] A size-1 dimension is then conceptually stretched to match its partner, but the stretching is virtual: NumPy reads the original values repeatedly rather than allocating an expanded copy. This makes broadcasting both faster and more memory efficient than manual replication, and it is one of the reasons vectorized machine learning code can be written so compactly.
Broadcasting refers to the implicit expansion of arrays with smaller or missing dimensions so that they become shape-compatible for element-wise arithmetic. When two arrays are involved in an operation such as addition, subtraction, multiplication, or division, broadcasting stretches the smaller array along its size-1 or missing dimensions to match the shape of the larger array. This stretching is virtual: the underlying data is not duplicated in memory. Instead, the computation engine (for example, NumPy's C-level loops or a GPU kernel) reads the same value repeatedly as needed. The NumPy documentation describes this directly: "NumPy is smart enough to use the original scalar value without actually making copies so that broadcasting operations are as memory and computationally efficient as possible." [1]
For example, adding a scalar value 5 to a 1-D array [1, 2, 3] produces [6, 7, 8]. The scalar is "broadcast" across every element of the array. Similarly, adding a 1-D array of shape (3,) to a 2-D array of shape (4, 3) broadcasts the 1-D array across all four rows, producing a (4, 3) result. As a compact illustration, a column vector of shape (3, 1) added to a row vector of shape (1, 4) produces a (3, 4) grid, because each size-1 axis is stretched to meet the other operand.
The concept traces its roots to array programming languages such as APL (A Programming Language), which introduced "scalar extension" in the 1960s. In APL, a scalar could be automatically applied to every element of an array. NumPy generalized this idea into a full set of broadcasting rules that handle arrays of arbitrary rank, and the term "broadcasting" became the standard name for this mechanism. NumPy 1.0 was released in 2006, and its broadcasting rules were documented the same year in Travis Oliphant's Guide to NumPy. [5][2] Because the scientific Python ecosystem standardized on these rules, modern frameworks for neural networks, data preprocessing, and scientific computing all share essentially the same broadcasting behavior.
NumPy's broadcasting rules have become the de facto standard across the scientific Python ecosystem and deep learning frameworks. The official rule is stated as: "It starts with the trailing (i.e. rightmost) dimension and works its way left. Two dimensions are compatible when they are equal, or one of them is 1." [1] In practice the rules can be expanded into three steps.
If two arrays have a different number of dimensions, the shape of the array with fewer dimensions is padded with ones on its leading (left) side until both shapes have equal length.
| Array | Original Shape | Padded Shape |
|---|---|---|
| A (2-D) | (4, 3) | (4, 3) |
| B (1-D) | (3,) | (1, 3) |
After padding, each pair of corresponding dimensions is compared. Two dimensions are compatible if they are equal or if one of them is 1. When a dimension has size 1, it is conceptually stretched to match the size of the other dimension.
| Dimension Pair | A Size | B Size | Compatible? | Result Size |
|---|---|---|---|---|
| Axis 0 | 4 | 1 | Yes | 4 |
| Axis 1 | 3 | 3 | Yes | 3 |
If any pair of corresponding dimensions is neither equal nor has one value of 1, broadcasting fails and the operation raises an error. As the NumPy documentation states, "If these conditions are not met, a ValueError: operands could not be broadcast together exception is thrown, indicating that the arrays have incompatible shapes." [1]
The following table shows several shape combinations and whether they are compatible for broadcasting.
| Array A Shape | Array B Shape | Compatible? | Result Shape |
|---|---|---|---|
| (5, 4) | (1,) | Yes | (5, 4) |
| (5, 4) | (4,) | Yes | (5, 4) |
| (15, 3, 5) | (15, 1, 5) | Yes | (15, 3, 5) |
| (15, 3, 5) | (3, 5) | Yes | (15, 3, 5) |
| (8, 1, 6, 1) | (7, 1, 5) | Yes | (8, 7, 6, 5) |
| (256, 256, 3) | (3,) | Yes | (256, 256, 3) |
| (3,) | (4,) | No | Error |
| (2, 1) | (8, 4, 3) | No | Error |
The following examples illustrate how broadcasting works in practice with NumPy code.
The simplest case of broadcasting occurs when a scalar is combined with an array. The scalar is broadcast to every element.
import numpy as np
a = np.array([1.0, 2.0, 3.0])
b = 2.0
result = a * b
# result: array([2., 4., 6.])
Shape alignment: the scalar () is treated as (1,) and then stretched to (3,) to match a.
A 1-D array can be broadcast across the rows of a 2-D array when the trailing dimension matches.
a = np.array([[0, 0, 0],
[10, 10, 10],
[20, 20, 20],
[30, 30, 30]]) # shape (4, 3)
b = np.array([1, 2, 3]) # shape (3,)
result = a + b
# result:
# array([[ 1, 2, 3],
# [11, 12, 13],
# [21, 22, 23],
# [31, 32, 33]])
Here, b with shape (3,) is padded to (1, 3) and then stretched along axis 0 to become (4, 3).
By reshaping arrays so that their non-trivial dimensions do not overlap, broadcasting produces an outer-product-like result.
a = np.array([0, 10, 20, 30]) # shape (4,)
b = np.array([1, 2, 3]) # shape (3,)
result = a[:, np.newaxis] + b # shapes (4, 1) and (3,)
# result shape: (4, 3)
# array([[ 1, 2, 3],
# [11, 12, 13],
# [21, 22, 23],
# [31, 32, 33]])
Using np.newaxis converts a from shape (4,) to (4, 1), allowing it to broadcast with b of shape (3,). The same trick generalizes the (3, 1) plus (1, 4) case mentioned above: any pair of orthogonal size-1 axes broadcasts to their full grid.
Broadcasting raises an error when trailing dimensions are neither equal nor 1.
a = np.array([[1, 2, 3],
[4, 5, 6]]) # shape (2, 3)
b = np.array([1, 2, 3, 4]) # shape (4,)
result = a + b
# ValueError: operands could not be broadcast together
# with shapes (2,3) (4,)
The trailing dimensions are 3 and 4, which are neither equal nor 1, so the operation is invalid.
PyTorch follows the same broadcasting rules as NumPy. The PyTorch documentation states that two tensors are "broadcastable" if "when iterating over the dimension sizes, starting at the trailing dimension, the dimension sizes must either be equal, one of them is 1, or one of them does not exist." [3] It further notes that "if a PyTorch operation supports broadcast, then its Tensor arguments can be automatically expanded to be of equal sizes (without making copies of the data)." [3]
import torch
x = torch.empty(5, 1, 4, 1)
y = torch.empty(3, 1, 1)
result = x + y
# result shape: torch.Size([5, 3, 4, 1])
One important constraint in PyTorch is that in-place operations (such as add_()) do not allow the in-place tensor to change shape as a result of broadcasting. [3] For instance, if x has shape (1, 3, 1) and y has shape (3, 1, 7), calling x.add_(y) raises an error because x would need to expand from size 1 to size 7 in the last dimension.
TensorFlow also implements NumPy-style broadcasting for its tensor operations. The tf.broadcast_to function can be used to explicitly broadcast a tensor to a target shape. TensorFlow's XLA (Accelerated Linear Algebra) compiler takes a more explicit approach by requiring a broadcast_dimensions parameter when combining arrays of different ranks, which specifies exactly which dimensions of the higher-rank array correspond to the lower-rank array. [4]
import tensorflow as tf
a = tf.constant([[1], <sup><a href="#cite_note-2" class="cite-ref">[2]</a></sup>, <sup><a href="#cite_note-3" class="cite-ref">[3]</a></sup>]) # shape (3, 1)
b = tf.constant([10, 20, 30]) # shape (3,)
result = a + b
# result shape: (3, 3)
JAX, Google's library for high-performance numerical computing and automatic differentiation, also follows NumPy broadcasting conventions. Since JAX is designed as a drop-in replacement for NumPy in many scenarios, its broadcasting behavior is essentially identical.
One of the most important properties of broadcasting is that it does not create copies of data. When NumPy broadcasts a scalar across a million-element array, it does not allocate a million-element temporary array filled with that scalar. Instead, it reads the single scalar value repeatedly during the element-wise loop. The NumPy documentation summarizes the benefit: broadcasting "does this without making needless copies of data and usually leads to efficient algorithm implementations." [1] This is achieved internally through NumPy's stride mechanism: a broadcast dimension has a stride of zero, meaning the pointer does not advance along that axis.
This design provides two major benefits:
However, broadcasting is not always optimal. When arrays are broadcast across many dimensions to produce very large intermediate results, the expanded computation can exceed available memory or slow down due to cache pressure. In such cases, a Python loop over smaller slices may actually be more efficient. NumPy's documentation explicitly warns that for large datasets with complex broadcasting, a hybrid approach (Python loops around lower-dimensional vectorized operations) can outperform pure high-dimensional broadcasting. [1]
Broadcasting is used extensively throughout neural network training and inference. Several of the most common operations in deep learning depend on it.
In a fully connected layer, the output before activation is computed as y = Wx + b, where W is the weight matrix, x is the input, and b is the bias vector. When processing a batch of inputs, Wx produces a 2-D array of shape (batch_size, num_units), while b has shape (num_units,). Broadcasting automatically adds the bias vector to every sample in the batch without requiring explicit replication of b. More generally, adding a (d,) bias to an (n, d) batch is the canonical broadcasting pattern: the (d,) vector is padded to (1, d) and stretched across all n rows.
# batch_size = 32, num_units = 128
Wx = np.random.randn(32, 128) # shape (32, 128)
b = np.random.randn(128) # shape (128,)
output = Wx + b # b broadcast to (32, 128)
Batch normalization computes the mean and variance of each feature across the batch dimension, then normalizes and scales. The mean and variance arrays have shape (num_features,) while the data has shape (batch_size, num_features). Broadcasting handles the subtraction and division across the batch dimension.
Many loss functions involve comparing predictions with ground truth labels. When predictions have shape (batch_size, num_classes) and labels are one-hot encoded with the same shape, element-wise operations proceed directly. When labels are integers of shape (batch_size,), broadcasting and indexing work together to select the correct class probabilities.
In transformer models, attention scores are computed as the dot product of query and key matrices, producing a tensor of shape (batch_size, num_heads, seq_len, seq_len). Masking operations, where a mask of shape (1, 1, seq_len, seq_len) or (seq_len, seq_len) is added to the scores, rely on broadcasting to apply the mask across all batches and heads.
In convolutional neural networks, per-channel normalization is a common preprocessing step. An image batch with shape (batch_size, height, width, channels) can be normalized by subtracting a mean array of shape (3,) (one value per RGB channel). Broadcasting stretches the mean across all spatial positions and all images in the batch.
Broadcasting and vectorization are closely related concepts that together enable high-performance numerical computing. Vectorization refers to expressing computations as operations on entire arrays rather than writing explicit Python for loops. Broadcasting extends vectorization by making it possible to vectorize operations even when the operand arrays differ in shape. The NumPy documentation frames broadcasting precisely as "a means of vectorizing array operations so that looping occurs in C instead of Python." [1]
Without broadcasting, a programmer who wanted to add a bias vector to every row of a matrix would need to either write an explicit loop or manually tile the vector into a full matrix using np.tile() or np.repeat(). Broadcasting eliminates this overhead, producing code that is both shorter and faster. Vectorized operations with broadcasting execute in optimized C or Fortran routines rather than the Python interpreter, which can run orders of magnitude faster than equivalent Python loops.
Several mistakes frequently arise when working with broadcasting.
(4, 1) added to one with shape (1, 3) produces a (4, 3) result. If the programmer expected element-wise addition of two length-4 vectors, the result is an unintended outer addition. Careful attention to array shapes is essential.(1000, 1) array against a (1, 1000) array creates a (1000, 1000) result, consuming 1000 times more memory than either input. With higher dimensions, this can quickly exhaust available RAM.tensor.add_() cannot change the tensor's shape via broadcasting. Attempting this raises a runtime error. [3]np.matmul or the @ operator) follows different rules for combining dimensions.Imagine you have a coloring book page with a big grid of empty squares, four rows and three columns. You also have just three crayons: red, blue, and green. You want to color every row the same way: red in column one, blue in column two, green in column three. Instead of picking up and putting down each crayon 12 times (once for every square), broadcasting lets you say "use these three colors" and the computer automatically fills in every row for you. You only needed three crayons, but the computer treated them as if you had 12, one for each square. The clever part is that the computer never actually made extra crayons; it just reused the same three over and over. That is broadcasting: taking a small set of values and automatically applying them across a bigger grid without wasting space by making copies.