See also: Machine learning terms
In machine learning, a tensor is a multi-dimensional array of numbers that serves as the fundamental data structure for representing and manipulating data. Tensors generalize the familiar concepts of scalars (single numbers), vectors (lists of numbers), and matrices (grids of numbers) to arbitrarily many dimensions. Every piece of data flowing through a neural network, from raw inputs to final predictions, is represented as a tensor. The name "TensorFlow" itself reflects how central tensors are to deep learning: the framework is named after the flow of tensors through computational graphs.
While the term "tensor" originates in mathematics and physics, its usage in machine learning is somewhat different. In physics, a tensor is a geometric object that transforms in specific ways under changes of coordinate systems. In machine learning, the term refers more loosely to any multi-dimensional array of numerical data. This article focuses primarily on the machine learning usage, while also noting the connections to the mathematical concept.
Tensors are classified by their number of dimensions, also called their rank or order. The table below summarizes the hierarchy.
| Rank | Name | Description | Example Shape | Example Use Case |
|---|---|---|---|---|
| 0 | Scalar | A single number with no axes | () | A loss value, learning rate |
| 1 | Vector | A one-dimensional array of numbers | (n,) | A word embedding, bias term |
| 2 | Matrix | A two-dimensional grid of numbers arranged in rows and columns | (m, n) | A weight matrix in a fully connected layer |
| 3 | 3D Tensor | A "cube" or stack of matrices | (d1, d2, d3) | A batch of text sequences, a color image (H, W, C) |
| 4 | 4D Tensor | A stack of 3D tensors | (d1, d2, d3, d4) | A batch of color images (N, C, H, W) |
| 5+ | 5D+ Tensor | Higher-dimensional arrays | (d1, ..., dn) | Video data (N, T, C, H, W) |
The tensor shape describes the size along each dimension. For example, a tensor of shape (32, 3, 224, 224) holds a batch of 32 color images, each with 3 color channels and a spatial resolution of 224 by 224 pixels. The tensor rank (also called the number of dimensions or ndim) counts how many axes the tensor has. The total number of elements in a tensor equals the product of all dimension sizes.
Tensors support a rich set of mathematical operations that form the backbone of deep learning computations.
Element-wise (pointwise) operations apply a function independently to each element or to each pair of corresponding elements in tensors of the same shape. Common element-wise operations include addition, subtraction, multiplication, division, and the application of activation functions like ReLU or sigmoid.
Matrix multiplication is one of the most critical operations in neural networks. It is used in every fully connected layer, attention mechanism, and many other components. For two-dimensional tensors, if tensor A has shape (m, n) and tensor B has shape (n, p), their matrix product has shape (m, p). Higher-dimensional tensors support batched matrix multiplication, where the operation is applied across batch dimensions.
Reshaping changes the tensor shape without altering the underlying data. For example, a tensor of shape (2, 3, 4) can be reshaped to (6, 4) or (2, 12), as long as the total number of elements remains the same (24 in this case). Reshaping is commonly used to prepare data for specific network layers.
Slicing extracts a contiguous subset of a tensor along one or more axes. For example, given an image tensor of shape (3, 224, 224), slicing tensor[0, :, :] extracts the first color channel as a 2D matrix. Advanced indexing allows selecting non-contiguous elements or using boolean masks to filter elements based on conditions.
Broadcasting enables arithmetic operations between tensors of different shapes without explicitly copying data. When two tensors have different shapes, the smaller tensor is "broadcast" (virtually expanded) to match the larger tensor's shape. For broadcasting to work, dimensions must either match or one of them must be 1. For example, adding a bias vector of shape (n,) to a batch of vectors of shape (batch, n) works because the bias is broadcast along the batch dimension. Broadcasting is essential for memory efficiency, since the data is not physically duplicated.
Reduction operations collapse one or more dimensions of a tensor by applying an aggregation function. Common reductions include sum, mean, max, and min. For example, computing the mean across the batch dimension of a loss tensor produces a single scalar value for backpropagation.
Concatenation joins tensors along an existing dimension, while stacking joins them along a new dimension. These operations are used frequently when assembling batches of data or combining outputs from multiple network branches.
The data type (dtype) of a tensor determines how each element is stored in memory and the precision of computations. Choosing the right data type balances numerical accuracy against memory usage and computational speed.
| Data Type | Bits | Range / Precision | Typical Use |
|---|---|---|---|
| float32 (FP32) | 32 | ~7 decimal digits, range up to ~3.4 x 10^38 | Default training dtype; high precision |
| float16 (FP16) | 16 | ~3.3 decimal digits, range up to ~65,504 | Mixed-precision training; inference |
| bfloat16 (BF16) | 16 | ~2-3 decimal digits, same exponent range as FP32 | Training on TPUs and newer NVIDIA GPUs |
| float64 (FP64) | 64 | ~15 decimal digits | Scientific computing (rarely used in ML) |
| int8 | 8 | -128 to 127 | Post-training quantization |
| int32 | 32 | -2^31 to 2^31 - 1 | Index tensors, token IDs |
| bool | 1 | True / False | Masks, conditions |
float32 is the default data type in most frameworks. It provides good numerical stability for training.
float16 uses half the memory and can double throughput on GPUs with dedicated half-precision hardware (such as NVIDIA Tensor Cores). However, its limited range can cause overflow or underflow during training, which is why it is typically used with a technique called mixed-precision training, where a master copy of weights is kept in float32.
bfloat16 was developed by Google for use on TPUs. It keeps the same exponent range as float32 (8 exponent bits) but sacrifices significand precision (7 mantissa bits instead of 23). This makes it more numerically stable than float16 for training, because it can represent the same range of magnitudes as float32.
int8 quantization reduces model size and speeds up inference by representing weights and activations as 8-bit integers. This is widely used for deploying models on edge devices and in production environments.
The word "tensor" is shared between physics/mathematics and machine learning, but the two meanings are quite different.
| Aspect | Physics / Mathematics | Machine Learning |
|---|---|---|
| Definition | A geometric object that transforms according to specific rules under coordinate changes | A multi-dimensional array of numbers |
| Key property | Obeys coordinate transformation laws (covariance/contravariance) | No transformation rules required |
| Rank meaning | Number of indices, each associated with a vector space or its dual | Number of array dimensions (axes) |
| Example | Stress tensor, electromagnetic field tensor, Riemann curvature tensor | A batch of images stored as a 4D array |
| Framework | Differential geometry, general relativity, continuum mechanics | PyTorch, TensorFlow, NumPy |
In physics, a rank-2 tensor is not just any matrix; it is a linear map that transforms in a specific way when the basis vectors change. In machine learning, a rank-2 tensor is simply a 2D array of numbers with no transformation rules attached. The ML usage is more informal but is now the dominant meaning in the software engineering and AI communities.
PyTorch represents all data as torch.Tensor objects. Key characteristics include:
tensor.to('cuda') or tensor.cuda(). All tensors participating in an operation must reside on the same device.requires_grad=True on a tensor tells PyTorch to track all operations on it, building a computational graph for automatic differentiation._, such as tensor.add_()) but these must be used carefully because they can interfere with gradient computation.TensorFlow uses tf.Tensor as its core data structure. Key characteristics include:
tf.function for converting Python functions into optimized computational graphs.tf.Variable.with tf.device('/GPU:0').NumPy provides ndarray, which is the predecessor and inspiration for both PyTorch and TensorFlow tensors. While NumPy arrays run only on CPU and lack automatic differentiation, they remain widely used for data preprocessing and are interoperable with both frameworks through zero-copy conversion functions like torch.from_numpy() and tf.constant().
Modern deep learning workloads run on specialized hardware to accelerate tensor operations.
| Device | Description | Strengths |
|---|---|---|
| CPU | General-purpose processor | Flexible; good for small tensors and data preprocessing |
| GPU (CUDA) | Massively parallel processor with thousands of cores | Excellent for large matrix multiplications and convolutions |
| TPU | Google's custom ASIC designed for tensor operations | Optimized for large-scale training with bfloat16 |
| Apple Silicon (MPS) | Apple's GPU framework for M-series chips | Enables local GPU training on Mac hardware |
Transferring tensors between devices (for example, from CPU to GPU) involves a data copy across the memory bus, which can become a bottleneck if done too frequently. Best practice is to move data to the target device once and perform all computations there before moving results back.
In PyTorch, a common pattern looks like:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
tensor = tensor.to(device)
model = model.to(device)
In TensorFlow, GPU placement is typically automatic, but can be controlled explicitly:
with tf.device('/GPU:0'):
tensor = tf.constant([1.0, 2.0, 3.0])
Automatic differentiation (autograd) is the mechanism that enables gradient descent optimization in neural networks by computing gradients of a loss function with respect to model parameters. Tensors play a central role in this process.
When a tensor has requires_grad=True in PyTorch (or is wrapped in a tf.Variable in TensorFlow), the framework records every operation applied to it in a directed acyclic graph (DAG). In this graph:
Calling .backward() on the loss tensor traverses this graph from root to leaves using the chain rule, computing the gradient of the loss with respect to every leaf tensor. These gradients are stored in each tensor's .grad attribute and are then consumed by the optimizer to update the weights.
PyTorch uses a dynamic computational graph (define-by-run), meaning the graph is rebuilt on every forward pass. This allows Python control flow (if statements, loops) to affect the graph structure naturally. TensorFlow 2.x also supports dynamic graphs through eager execution, though tf.function can capture a static graph for performance optimization.
To prevent unnecessary gradient tracking during inference, PyTorch provides torch.no_grad() and TensorFlow provides tf.stop_gradient(). Disabling gradient tracking reduces memory usage and speeds up computation.
Many real-world datasets contain tensors where the vast majority of elements are zero. Storing all these zeros wastes memory and computation. Sparse tensor representations store only the non-zero values and their coordinates, dramatically reducing memory usage.
Common sparse storage formats include:
torch.sparse_coo_tensor) and TensorFlow (tf.sparse.SparseTensor).Sparse tensors are particularly useful for:
Traditional tensor operations refer to dimensions by numerical index (0, 1, 2, ...), which can lead to subtle bugs when dimensions are reordered or when code becomes complex. Named tensors attach meaningful names to dimensions, making code more readable and less error-prone.
PyTorch introduced experimental named tensor support, allowing dimensions to be labeled:
images = torch.randn(32, 3, 224, 224, names=('batch', 'channels', 'height', 'width'))
With named dimensions, operations can reference axes by name rather than index, which prevents common errors like summing over the wrong dimension. While PyTorch's named tensor API remains experimental, libraries like einops and the einsum notation provide alternative ways to write dimension-aware tensor operations clearly.
Different neural network architectures expect input tensors in specific shapes. Understanding these conventions is essential for building and debugging models.
| Architecture | Typical Input Shape | Dimension Meanings |
|---|---|---|
| Fully connected (MLP) | (batch, features) | Batch of flat feature vectors |
| CNN (PyTorch) | (batch, channels, height, width) | NCHW format |
| CNN (TensorFlow) | (batch, height, width, channels) | NHWC format |
| RNN / LSTM | (batch, sequence_length, features) | Batch of variable-length sequences |
| Transformer | (batch, sequence_length, d_model) | Batch of token embeddings |
| 3D CNN (video) | (batch, channels, depth, height, width) | Volumetric or temporal data |
The two dominant memory layout conventions for image data are NCHW (batch, channels, height, width) and NHWC (batch, height, width, channels). PyTorch defaults to NCHW, while TensorFlow defaults to NHWC. NVIDIA Tensor Cores perform best with NHWC, but automatic layout conversions handle this transparently in most cases. The choice of layout can affect performance by up to 10-30% depending on the hardware and operation.
The mathematical concept of tensors was formalized in the 19th century by Gregorio Ricci-Curbastro and Tullio Levi-Civita as part of their work on differential geometry. Albert Einstein later used tensor calculus extensively in his general theory of relativity.
The adoption of tensors in machine learning began in the early 2000s, when researchers like M. Alex O. Vasilescu and Demetri Terzopoulos introduced multilinear tensor methods into computer vision. The modern usage of tensors as multi-dimensional arrays became widespread with the development of NumPy in 2006, the Theano library in 2007, and eventually TensorFlow in 2015 and PyTorch in 2016.
Specialized tensor hardware followed: NVIDIA released cuDNN in 2014, Google developed TPUs between 2015 and 2017, and NVIDIA introduced Tensor Cores with its Volta GPU architecture in 2017. These hardware advances enabled training models with billions of parameters.
Imagine you have different ways to organize your toys. A single toy car sitting on the floor is like a scalar: just one number. Now line up five toy cars in a row, and that is like a vector: a list of numbers. Arrange the cars in rows and columns on a table, and you have a matrix: a grid of numbers. A tensor is what you get when you stack multiple grids on top of each other, or even organize them in even more complicated patterns. It is basically a container that can hold numbers in any arrangement, no matter how many directions or layers you need.
Computers use tensors to work with all sorts of data. A color photo, for instance, is stored as three grids layered together (one for red, one for green, one for blue). A whole album of photos is a stack of those layers. When a computer learns to recognize cats in photos, it reads these tensors of numbers, does a lot of math on them, and gradually figures out which patterns mean "cat." Tensors are the building blocks that make all of this possible.