Tensor

Introduction

In machine learning, a tensor is a multi-dimensional array of numbers that serves as the fundamental data structure for representing and manipulating data. Tensors generalize the familiar concepts of scalars (single numbers), vectors (lists of numbers), and matrices (grids of numbers) to arbitrarily many dimensions. Every piece of data flowing through a neural network, from raw inputs to final predictions, is represented as a tensor. The name "TensorFlow" itself reflects how central tensors are to deep learning: the framework is named after the flow of tensors through computational graphs.

While the term "tensor" originates in mathematics and physics, its usage in machine learning is somewhat different. In physics, a tensor is a geometric object that transforms in specific ways under changes of coordinate systems. In machine learning, the term refers more loosely to any multi-dimensional array of numerical data. This article focuses primarily on the machine learning usage, while also noting the connections to the mathematical concept.

Tensor Hierarchy by Dimension

Tensors are classified by their number of dimensions, also called their rank or order. The table below summarizes the hierarchy.

Rank	Name	Description	Example Shape	Example Use Case
0	Scalar	A single number with no axes	`()`	A loss value, learning rate
1	Vector	A one-dimensional array of numbers	`(n,)`	A word embedding, bias term
2	Matrix	A two-dimensional grid of numbers arranged in rows and columns	`(m, n)`	A weight matrix in a fully connected layer
3	3D Tensor	A "cube" or stack of matrices	`(d1, d2, d3)`	A batch of text sequences, a color image (H, W, C)
4	4D Tensor	A stack of 3D tensors	`(d1, d2, d3, d4)`	A batch of color images (N, C, H, W)
5+	5D+ Tensor	Higher-dimensional arrays	`(d1, ..., dn)`	Video data (N, T, C, H, W)

The tensor shape describes the size along each dimension. For example, a tensor of shape (32, 3, 224, 224) holds a batch of 32 color images, each with 3 color channels and a spatial resolution of 224 by 224 pixels. The tensor rank (also called the number of dimensions or ndim) counts how many axes the tensor has. The total number of elements in a tensor equals the product of all dimension sizes.

Tensor Operations

Tensors support a rich set of mathematical operations that form the backbone of deep learning computations.

Element-wise Operations

Element-wise (pointwise) operations apply a function independently to each element or to each pair of corresponding elements in tensors of the same shape. Common element-wise operations include addition, subtraction, multiplication, division, and the application of activation functions like ReLU or sigmoid.

Matrix Multiplication

Matrix multiplication is one of the most critical operations in neural networks. It is used in every fully connected layer, attention mechanism, and many other components. For two-dimensional tensors, if tensor A has shape (m, n) and tensor B has shape (n, p), their matrix product has shape (m, p). Higher-dimensional tensors support batched matrix multiplication, where the operation is applied across batch dimensions.

Reshaping

Reshaping changes the tensor shape without altering the underlying data. For example, a tensor of shape (2, 3, 4) can be reshaped to (6, 4) or (2, 12), as long as the total number of elements remains the same (24 in this case). Reshaping is commonly used to prepare data for specific network layers.

Slicing and Indexing

Slicing extracts a contiguous subset of a tensor along one or more axes. For example, given an image tensor of shape (3, 224, 224), slicing tensor[0, :, :] extracts the first color channel as a 2D matrix. Advanced indexing allows selecting non-contiguous elements or using boolean masks to filter elements based on conditions.

Broadcasting

Broadcasting enables arithmetic operations between tensors of different shapes without explicitly copying data. When two tensors have different shapes, the smaller tensor is "broadcast" (virtually expanded) to match the larger tensor's shape. For broadcasting to work, dimensions must either match or one of them must be 1. For example, adding a bias vector of shape (n,) to a batch of vectors of shape (batch, n) works because the bias is broadcast along the batch dimension. Broadcasting is essential for memory efficiency, since the data is not physically duplicated.

Reduction Operations

Reduction operations collapse one or more dimensions of a tensor by applying an aggregation function. Common reductions include sum, mean, max, and min. For example, computing the mean across the batch dimension of a loss tensor produces a single scalar value for backpropagation.

Concatenation and Stacking

Concatenation joins tensors along an existing dimension, while stacking joins them along a new dimension. These operations are used frequently when assembling batches of data or combining outputs from multiple network branches.

Tensor Data Types

The data type (dtype) of a tensor determines how each element is stored in memory and the precision of computations. Choosing the right data type balances numerical accuracy against memory usage and computational speed.

Data Type	Bits	Range / Precision	Typical Use
float32 (FP32)	32	~7 decimal digits, range up to ~3.4 x 10^38	Default training dtype; high precision
float16 (FP16)	16	~3.3 decimal digits, range up to ~65,504	Mixed-precision training; inference
bfloat16 (BF16)	16	~2-3 decimal digits, same exponent range as FP32	Training on TPUs and newer NVIDIA GPUs
float64 (FP64)	64	~15 decimal digits	Scientific computing (rarely used in ML)
int8	8	-128 to 127	Post-training quantization
int32	32	-2^31 to 2^31 - 1	Index tensors, token IDs
bool	1	True / False	Masks, conditions

float32 is the default data type in most frameworks. It provides good numerical stability for training.

float16 uses half the memory and can double throughput on GPUs with dedicated half-precision hardware (such as NVIDIA Tensor Cores). However, its limited range can cause overflow or underflow during training, which is why it is typically used with a technique called mixed-precision training, where a master copy of weights is kept in float32.

bfloat16 was developed by Google for use on TPUs. It keeps the same exponent range as float32 (8 exponent bits) but sacrifices significand precision (7 mantissa bits instead of 23). This makes it more numerically stable than float16 for training, because it can represent the same range of magnitudes as float32.

int8 quantization reduces model size and speeds up inference by representing weights and activations as 8-bit integers. This is widely used for deploying models on edge devices and in production environments.

Tensors in Physics vs. Machine Learning

The word "tensor" is shared between physics/mathematics and machine learning, but the two meanings are quite different.

Aspect	Physics / Mathematics	Machine Learning
Definition	A geometric object that transforms according to specific rules under coordinate changes	A multi-dimensional array of numbers
Key property	Obeys coordinate transformation laws (covariance/contravariance)	No transformation rules required
Rank meaning	Number of indices, each associated with a vector space or its dual	Number of array dimensions (axes)
Example	Stress tensor, electromagnetic field tensor, Riemann curvature tensor	A batch of images stored as a 4D array
Framework	Differential geometry, general relativity, continuum mechanics	PyTorch, TensorFlow, NumPy

In physics, a rank-2 tensor is not just any matrix; it is a linear map that transforms in a specific way when the basis vectors change. In machine learning, a rank-2 tensor is simply a 2D array of numbers with no transformation rules attached. The ML usage is more informal but is now the dominant meaning in the software engineering and AI communities.

Tensors in Major Frameworks

PyTorch (torch.Tensor)

PyTorch represents all data as torch.Tensor objects. Key characteristics include:

Dynamic computational graph: PyTorch builds the computation graph on the fly during the forward pass, which makes debugging straightforward since standard Python debugging tools work normally.
Device placement: Tensors can live on CPU or GPU. Moving a tensor to GPU is done with tensor.to('cuda') or tensor.cuda(). All tensors participating in an operation must reside on the same device.
Autograd integration: Setting requires_grad=True on a tensor tells PyTorch to track all operations on it, building a computational graph for automatic differentiation.
In-place operations: PyTorch supports in-place operations (suffixed with _, such as tensor.add_()) but these must be used carefully because they can interfere with gradient computation.

TensorFlow (tf.Tensor)

TensorFlow uses tf.Tensor as its core data structure. Key characteristics include:

Eager and graph modes: TensorFlow 2.x defaults to eager execution (similar to PyTorch), but provides tf.function for converting Python functions into optimized computational graphs.
Immutability: Unlike PyTorch tensors, TensorFlow tensors are immutable. To modify values, you must create a new tensor or use tf.Variable.
Device placement: TensorFlow automatically places operations on available GPUs when possible. Explicit placement uses with tf.device('/GPU:0').
XLA compilation: TensorFlow integrates with XLA (Accelerated Linear Algebra), a compiler that fuses multiple operations into optimized GPU/TPU kernels.

NumPy (ndarray)

NumPy provides ndarray, which is the predecessor and inspiration for both PyTorch and TensorFlow tensors. While NumPy arrays run only on CPU and lack automatic differentiation, they remain widely used for data preprocessing and are interoperable with both frameworks through zero-copy conversion functions like torch.from_numpy() and tf.constant().

Tensors on Different Devices

Modern deep learning workloads run on specialized hardware to accelerate tensor operations.

Device	Description	Strengths
CPU	General-purpose processor	Flexible; good for small tensors and data preprocessing
GPU (CUDA)	Massively parallel processor with thousands of cores	Excellent for large matrix multiplications and convolutions
TPU	Google's custom ASIC designed for tensor operations	Optimized for large-scale training with bfloat16
Apple Silicon (MPS)	Apple's GPU framework for M-series chips	Enables local GPU training on Mac hardware

Transferring tensors between devices (for example, from CPU to GPU) involves a data copy across the memory bus, which can become a bottleneck if done too frequently. Best practice is to move data to the target device once and perform all computations there before moving results back.

In PyTorch, a common pattern looks like:

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
tensor = tensor.to(device)
model = model.to(device)

In TensorFlow, GPU placement is typically automatic, but can be controlled explicitly:

with tf.device('/GPU:0'):
    tensor = tf.constant([1.0, 2.0, 3.0])

Autograd and Computational Graphs

Automatic differentiation (autograd) is the mechanism that enables gradient descent optimization in neural networks by computing gradients of a loss function with respect to model parameters. Tensors play a central role in this process.

When a tensor has requires_grad=True in PyTorch (or is wrapped in a tf.Variable in TensorFlow), the framework records every operation applied to it in a directed acyclic graph (DAG). In this graph:

Leaf nodes are the input tensors (typically model weights)
Interior nodes represent operations (addition, multiplication, activation functions)
Root node is typically the loss value

Calling .backward() on the loss tensor traverses this graph from root to leaves using the chain rule, computing the gradient of the loss with respect to every leaf tensor. These gradients are stored in each tensor's .grad attribute and are then consumed by the optimizer to update the weights.

PyTorch uses a dynamic computational graph (define-by-run), meaning the graph is rebuilt on every forward pass. This allows Python control flow (if statements, loops) to affect the graph structure naturally. TensorFlow 2.x also supports dynamic graphs through eager execution, though tf.function can capture a static graph for performance optimization.

To prevent unnecessary gradient tracking during inference, PyTorch provides torch.no_grad() and TensorFlow provides tf.stop_gradient(). Disabling gradient tracking reduces memory usage and speeds up computation.

Sparse Tensors

Many real-world datasets contain tensors where the vast majority of elements are zero. Storing all these zeros wastes memory and computation. Sparse tensor representations store only the non-zero values and their coordinates, dramatically reducing memory usage.

Common sparse storage formats include:

COO (Coordinate): Stores a list of (index, value) pairs. Simple and flexible, used by both PyTorch (torch.sparse_coo_tensor) and TensorFlow (tf.sparse.SparseTensor).
CSR (Compressed Sparse Row): Stores row pointers, column indices, and values. More efficient for row-slicing and matrix-vector products.
CSC (Compressed Sparse Column): Similar to CSR but optimized for column operations.

Sparse tensors are particularly useful for:

Natural language processing, where word-document matrices or TF-IDF representations are extremely sparse
Recommendation systems, where user-item interaction matrices have very few non-zero entries
Graph neural networks, where adjacency matrices are typically sparse

Named Tensors

Traditional tensor operations refer to dimensions by numerical index (0, 1, 2, ...), which can lead to subtle bugs when dimensions are reordered or when code becomes complex. Named tensors attach meaningful names to dimensions, making code more readable and less error-prone.

PyTorch introduced experimental named tensor support, allowing dimensions to be labeled:

images = torch.randn(32, 3, 224, 224, names=('batch', 'channels', 'height', 'width'))

With named dimensions, operations can reference axes by name rather than index, which prevents common errors like summing over the wrong dimension. While PyTorch's named tensor API remains experimental, libraries like einops and the einsum notation provide alternative ways to write dimension-aware tensor operations clearly.

Tensor Shapes in Common Architectures

Different neural network architectures expect input tensors in specific shapes. Understanding these conventions is essential for building and debugging models.

Architecture	Typical Input Shape	Dimension Meanings
Fully connected (MLP)	`(batch, features)`	Batch of flat feature vectors
CNN (PyTorch)	`(batch, channels, height, width)`	NCHW format
CNN (TensorFlow)	`(batch, height, width, channels)`	NHWC format
RNN / LSTM	`(batch, sequence_length, features)`	Batch of variable-length sequences
Transformer	`(batch, sequence_length, d_model)`	Batch of token embeddings
3D CNN (video)	`(batch, channels, depth, height, width)`	Volumetric or temporal data

The two dominant memory layout conventions for image data are NCHW (batch, channels, height, width) and NHWC (batch, height, width, channels). PyTorch defaults to NCHW, while TensorFlow defaults to NHWC. NVIDIA Tensor Cores perform best with NHWC, but automatic layout conversions handle this transparently in most cases. The choice of layout can affect performance by up to 10-30% depending on the hardware and operation.

Historical Context

The mathematical concept of tensors was formalized in the 19th century by Gregorio Ricci-Curbastro and Tullio Levi-Civita as part of their work on differential geometry. Albert Einstein later used tensor calculus extensively in his general theory of relativity.

The adoption of tensors in machine learning began in the early 2000s, when researchers like M. Alex O. Vasilescu and Demetri Terzopoulos introduced multilinear tensor methods into computer vision. The modern usage of tensors as multi-dimensional arrays became widespread with the development of NumPy in 2006, the Theano library in 2007, and eventually TensorFlow in 2015 and PyTorch in 2016.

Specialized tensor hardware followed: NVIDIA released cuDNN in 2014, Google developed TPUs between 2015 and 2017, and NVIDIA introduced Tensor Cores with its Volta GPU architecture in 2017. These hardware advances enabled training models with billions of parameters.

Explain Like I'm 5 (ELI5)

Imagine you have different ways to organize your toys. A single toy car sitting on the floor is like a scalar: just one number. Now line up five toy cars in a row, and that is like a vector: a list of numbers. Arrange the cars in rows and columns on a table, and you have a matrix: a grid of numbers. A tensor is what you get when you stack multiple grids on top of each other, or even organize them in even more complicated patterns. It is basically a container that can hold numbers in any arrangement, no matter how many directions or layers you need.

Computers use tensors to work with all sorts of data. A color photo, for instance, is stored as three grids layered together (one for red, one for green, one for blue). A whole album of photos is a stack of those layers. When a computer learns to recognize cats in photos, it reads these tensors of numbers, does a lot of math on them, and gradually figures out which patterns mean "cat." Tensors are the building blocks that make all of this possible.

References

Kolda, T. G., & Bader, B. W. (2009). "Tensor Decompositions and Applications." *SIAM Review*, 51(3), 455-500.
Paszke, A., et al. (2019). "PyTorch: An Imperative Style, High-Performance Deep Learning Library." *Advances in Neural Information Processing Systems*, 32.
Abadi, M., et al. (2016). "TensorFlow: A System for Large-Scale Machine Learning." *Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI)*, 265-283.
Vasilescu, M. A. O., & Terzopoulos, D. (2002). "Multilinear Analysis of Image Ensembles: TensorFaces." *Proceedings of the European Conference on Computer Vision (ECCV)*, 447-460.
Micikevicius, P., et al. (2018). "Mixed Precision Training." *Proceedings of the International Conference on Learning Representations (ICLR)*.
Jouppi, N. P., et al. (2017). "In-Datacenter Performance Analysis of a Tensor Processing Unit." *Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA)*, 1-12.
Harris, C. R., et al. (2020). "Array Programming with NumPy." *Nature*, 585, 357-362.
PyTorch Documentation. "Automatic Differentiation with torch.autograd." https://pytorch.org/tutorials/beginner/basics/autogradqs_tutorial.html
TensorFlow Documentation. "Introduction to Tensors." https://www.tensorflow.org/guide/tensor
Kazemnejad, A. (2019). "Tensor Considered Harmful: Named Tensors and Einops." https://nlp.seas.harvard.edu/NamedTensor

Introduction

Tensor Hierarchy by Dimension

Tensor Operations

Element-wise Operations

Matrix Multiplication

Reshaping

Slicing and Indexing

Broadcasting

Reduction Operations

Concatenation and Stacking

Tensor Data Types

Tensors in Physics vs. Machine Learning

Tensors in Major Frameworks

PyTorch (torch.Tensor)

TensorFlow (tf.Tensor)

NumPy (ndarray)

Tensors on Different Devices

Autograd and Computational Graphs

Sparse Tensors

Named Tensors

Tensor Shapes in Common Architectures

Historical Context

Explain Like I'm 5 (ELI5)

References

Improve this article

Related Articles

Sparse autoencoder

ARC-AGI 2

GELU (Gaussian Error Linear Unit)

LeNet

Broadcasting

Convolution

Introduction

Tensor Hierarchy by Dimension

Tensor Operations

Element-wise Operations

Matrix Multiplication

Reshaping

Slicing and Indexing

Broadcasting

Reduction Operations

Concatenation and Stacking

Tensor Data Types

Tensors in Physics vs. Machine Learning

Tensors in Major Frameworks

PyTorch (torch.Tensor)

TensorFlow (tf.Tensor)

NumPy (ndarray)

Tensors on Different Devices

Autograd and Computational Graphs

Sparse Tensors

Named Tensors

Tensor Shapes in Common Architectures

Historical Context

Explain Like I'm 5 (ELI5)

References

Related Articles

Sparse autoencoder

ARC-AGI 2

GELU (Gaussian Error Linear Unit)

LeNet

Broadcasting

Convolution