# Tensor

> Source: https://aiwiki.ai/wiki/tensor
> Updated: 2026-06-21
> Categories: Deep Learning, Machine Learning, Mathematics
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

In machine learning, a **tensor** is a multi-dimensional array of numbers that serves as the fundamental data structure for representing and manipulating data. Tensors generalize the familiar concepts of [scalars](/wiki/scalar) (single numbers), vectors (lists of numbers), and matrices (grids of numbers) to arbitrarily many dimensions. Every piece of data flowing through a [neural network](/wiki/neural_network), from raw inputs to final predictions, is represented as a tensor. The original TensorFlow paper defines the term concisely: tensors are "arrays of arbitrary dimensionality where the underlying element type is specified or inferred at graph-construction time."[3] The name "TensorFlow" itself reflects how central tensors are to [deep learning](/wiki/deep_model): the framework is named after the flow of tensors through computational graphs.

While the term "tensor" originates in mathematics and physics, its usage in machine learning is somewhat different. In physics, a tensor is a geometric object that transforms in specific ways under changes of coordinate systems. In machine learning, the term refers more loosely to any multi-dimensional array of numerical data. This article focuses primarily on the machine learning usage, while also noting the connections to the mathematical concept.

## What are the ranks of a tensor?

Tensors are classified by their number of dimensions, also called their rank or order.[1] The table below summarizes the hierarchy.

| Rank | Name | Description | Example Shape | Example Use Case |
|------|------|-------------|---------------|------------------|
| 0 | [Scalar](/wiki/scalar) | A single number with no axes | `()` | A loss value, learning rate |
| 1 | Vector | A one-dimensional array of numbers | `(n,)` | A word [embedding](/wiki/embedding_vector), bias term |
| 2 | Matrix | A two-dimensional grid of numbers arranged in rows and columns | `(m, n)` | A weight matrix in a fully connected [layer](/wiki/layer) |
| 3 | 3D Tensor | A "cube" or stack of matrices | `(d1, d2, d3)` | A batch of text sequences, a color image (H, W, C) |
| 4 | 4D Tensor | A stack of 3D tensors | `(d1, d2, d3, d4)` | A batch of color images (N, C, H, W) |
| 5+ | 5D+ Tensor | Higher-dimensional arrays | `(d1, ..., dn)` | Video data (N, T, C, H, W) |

The **[tensor shape](/wiki/tensor_shape)** describes the size along each dimension. For example, a tensor of shape `(32, 3, 224, 224)` holds a batch of 32 color images, each with 3 color channels and a spatial resolution of 224 by 224 pixels. The **[tensor rank](/wiki/tensor_rank)** (also called the number of dimensions or `ndim`) counts how many axes the tensor has. The total number of elements in a tensor equals the product of all dimension sizes.[9]

## What operations can you perform on tensors?

Tensors support a rich set of mathematical operations that form the backbone of [deep learning](/wiki/deep_model) computations.

### Element-wise Operations

Element-wise (pointwise) operations apply a function independently to each element or to each pair of corresponding elements in tensors of the same shape. Common element-wise operations include addition, subtraction, multiplication, division, and the application of activation functions like ReLU or sigmoid.

### Matrix Multiplication

Matrix multiplication is one of the most critical operations in neural networks. It is used in every fully connected layer, [attention](/wiki/attention) mechanism, and many other components. For two-dimensional tensors, if tensor A has shape `(m, n)` and tensor B has shape `(n, p)`, their matrix product has shape `(m, p)`. Higher-dimensional tensors support batched matrix multiplication, where the operation is applied across batch dimensions.

### Reshaping

Reshaping changes the [tensor shape](/wiki/tensor_shape) without altering the underlying data. For example, a tensor of shape `(2, 3, 4)` can be reshaped to `(6, 4)` or `(2, 12)`, as long as the total number of elements remains the same (24 in this case). Reshaping is commonly used to prepare data for specific network layers.

### Slicing and Indexing

Slicing extracts a contiguous subset of a tensor along one or more axes. For example, given an image tensor of shape `(3, 224, 224)`, slicing `tensor[0, :, :]` extracts the first color channel as a 2D matrix. Advanced indexing allows selecting non-contiguous elements or using boolean masks to filter elements based on conditions.

### Broadcasting

Broadcasting enables arithmetic operations between tensors of different shapes without explicitly copying data. When two tensors have different shapes, the smaller tensor is "broadcast" (virtually expanded) to match the larger tensor's shape. For broadcasting to work, dimensions must either match or one of them must be 1.[7] For example, adding a bias vector of shape `(n,)` to a batch of vectors of shape `(batch, n)` works because the bias is broadcast along the batch dimension. Broadcasting is essential for memory efficiency, since the data is not physically duplicated.

### Reduction Operations

Reduction operations collapse one or more dimensions of a tensor by applying an aggregation function. Common reductions include sum, mean, max, and min. For example, computing the mean across the batch dimension of a loss tensor produces a single [scalar](/wiki/scalar) value for backpropagation.

### Concatenation and Stacking

Concatenation joins tensors along an existing dimension, while stacking joins them along a new dimension. These operations are used frequently when assembling batches of data or combining outputs from multiple network branches.

## What data types do tensors use?

The data type (dtype) of a tensor determines how each element is stored in memory and the precision of computations. Choosing the right data type balances numerical accuracy against memory usage and computational speed.

| Data Type | Bits | Range / Precision | Typical Use |
|-----------|------|-------------------|-------------|
| float32 (FP32) | 32 | ~7 decimal digits, range up to ~3.4 x 10^38 | Default training dtype; high precision |
| float16 (FP16) | 16 | ~3.3 decimal digits, range up to ~65,504 | Mixed-precision training; inference |
| bfloat16 (BF16) | 16 | ~2-3 decimal digits, same exponent range as FP32 | Training on TPUs and newer NVIDIA GPUs |
| float64 (FP64) | 64 | ~15 decimal digits | Scientific computing (rarely used in ML) |
| int8 | 8 | -128 to 127 | Post-training [quantization](/wiki/quantization) |
| int32 | 32 | -2^31 to 2^31 - 1 | Index tensors, token IDs |
| bool | 1 | True / False | Masks, conditions |

**float32** is the default data type in most frameworks. It provides good numerical stability for training.

**float16** allocates 5 exponent bits and 10 mantissa bits, uses half the memory, and can double throughput on GPUs with dedicated half-precision hardware (such as NVIDIA Tensor Cores). However, its limited range (a maximum representable value of about 65,504) can cause overflow or underflow during training, which is why it is typically used with a technique called mixed-precision training, where a master copy of weights is kept in float32.[5]

**bfloat16** was developed by Google for use on [TPUs](/wiki/tpu).[6] It keeps the same exponent range as float32 (8 exponent bits) but sacrifices significand precision (7 mantissa bits instead of 23). This makes it more numerically stable than float16 for training, because it can represent the same range of magnitudes as float32, roughly 1.18 x 10^-38 to 3.39 x 10^38.[11]

**int8** quantization reduces model size and speeds up inference by representing weights and activations as 8-bit integers. This is widely used for deploying models on edge devices and in production environments.

## How is a tensor in machine learning different from a tensor in physics?

The word "tensor" is shared between physics/mathematics and machine learning, but the two meanings are quite different.

| Aspect | Physics / Mathematics | Machine Learning |
|--------|----------------------|------------------|
| Definition | A geometric object that transforms according to specific rules under coordinate changes | A multi-dimensional array of numbers |
| Key property | Obeys coordinate transformation laws (covariance/contravariance) | No transformation rules required |
| Rank meaning | Number of indices, each associated with a vector space or its dual | Number of array dimensions (axes) |
| Example | Stress tensor, electromagnetic field tensor, Riemann curvature tensor | A batch of images stored as a 4D array |
| Framework | Differential geometry, general relativity, continuum mechanics | [PyTorch](/wiki/pytorch), [TensorFlow](/wiki/tensorflow), [NumPy](/wiki/numpy) |

In physics, a rank-2 tensor is not just any matrix; it is a linear map that transforms in a specific way when the basis vectors change.[1] In machine learning, a rank-2 tensor is simply a 2D array of numbers with no transformation rules attached. The ML usage is more informal but is now the dominant meaning in the software engineering and AI communities.

## How do tensors work in major frameworks?

### PyTorch (torch.Tensor)

[PyTorch](/wiki/pytorch) represents all data as `torch.Tensor` objects. Key characteristics include:

- **Dynamic computational graph**: PyTorch builds the computation graph on the fly during the forward pass, which makes debugging straightforward since standard Python debugging tools work normally.[2]
- **Device placement**: Tensors can live on CPU or GPU. Moving a tensor to GPU is done with `tensor.to('cuda')` or `tensor.cuda()`. All tensors participating in an operation must reside on the same device.
- **Autograd integration**: Setting `requires_grad=True` on a tensor tells PyTorch to track all operations on it, building a computational graph for automatic differentiation.
- **In-place operations**: PyTorch supports in-place operations (suffixed with `_`, such as `tensor.add_()`) but these must be used carefully because they can interfere with gradient computation.

### TensorFlow (tf.Tensor)

[TensorFlow](/wiki/tensorflow) uses `tf.Tensor` as its core data structure. Key characteristics include:

- **Eager and graph modes**: TensorFlow 2.x defaults to eager execution (similar to PyTorch), but provides `tf.function` for converting Python functions into optimized computational graphs.
- **Immutability**: Unlike PyTorch tensors, TensorFlow tensors are immutable. To modify values, you must create a new tensor or use `tf.Variable`.
- **Device placement**: TensorFlow automatically places operations on available GPUs when possible. Explicit placement uses `with tf.device('/GPU:0')`.
- **XLA compilation**: TensorFlow integrates with XLA (Accelerated Linear Algebra), a compiler that fuses multiple operations into optimized GPU/TPU kernels.[3]

### NumPy (ndarray)

[NumPy](/wiki/numpy) provides `ndarray`, which is the predecessor and inspiration for both PyTorch and TensorFlow tensors.[7] While NumPy arrays run only on CPU and lack automatic differentiation, they remain widely used for data preprocessing and are interoperable with both frameworks through zero-copy conversion functions like `torch.from_numpy()` and `tf.constant()`.

## What hardware runs tensor operations?

Modern [deep learning](/wiki/deep_model) workloads run on specialized hardware to accelerate tensor operations.

| Device | Description | Strengths |
|--------|-------------|----------|
| CPU | General-purpose processor | Flexible; good for small tensors and data preprocessing |
| GPU (CUDA) | Massively parallel processor with thousands of cores | Excellent for large matrix multiplications and convolutions |
| [TPU](/wiki/tpu) | Google's custom ASIC designed for tensor operations | Optimized for large-scale training with bfloat16 |
| Apple Silicon (MPS) | Apple's GPU framework for M-series chips | Enables local GPU training on Mac hardware |

NVIDIA's Tensor Cores are dedicated hardware units for matrix math. Each Tensor Core performs a 4x4x4 matrix multiply-accumulate operation of the form D = A x B + C per clock cycle, taking FP16 inputs and accumulating in FP32.[12] The Tesla V100, NVIDIA's first Tensor Core GPU, contained 640 Tensor Cores and delivered up to 125 teraFLOPS of deep learning performance, which NVIDIA reported as roughly 12 times the training throughput of the prior Pascal generation.[12]

Transferring tensors between devices (for example, from CPU to GPU) involves a data copy across the memory bus, which can become a bottleneck if done too frequently. Best practice is to move data to the target device once and perform all computations there before moving results back.

In PyTorch, a common pattern looks like:

```python
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
tensor = tensor.to(device)
model = model.to(device)
```

In TensorFlow, GPU placement is typically automatic, but can be controlled explicitly:

```python
with tf.device('/GPU:0'):
    tensor = tf.constant([1.0, 2.0, 3.0])
```

## How do tensors enable automatic differentiation?

Automatic differentiation (autograd) is the mechanism that enables [gradient descent](/wiki/gradient_descent) optimization in neural networks by computing gradients of a loss function with respect to model parameters. Tensors play a central role in this process.

When a tensor has `requires_grad=True` in PyTorch (or is wrapped in a `tf.Variable` in TensorFlow), the framework records every operation applied to it in a directed acyclic graph (DAG). In this graph:

- **Leaf nodes** are the input tensors (typically model weights)
- **Interior nodes** represent operations (addition, multiplication, activation functions)
- **Root node** is typically the loss value

Calling `.backward()` on the loss tensor traverses this graph from root to leaves using the chain rule, computing the gradient of the loss with respect to every leaf tensor. These gradients are stored in each tensor's `.grad` attribute and are then consumed by the optimizer to update the weights.[8]

PyTorch uses a **dynamic computational graph** (define-by-run), meaning the graph is rebuilt on every forward pass. This allows Python control flow (if statements, loops) to affect the graph structure naturally. TensorFlow 2.x also supports dynamic graphs through eager execution, though `tf.function` can capture a static graph for performance optimization.

To prevent unnecessary gradient tracking during inference, PyTorch provides `torch.no_grad()` and TensorFlow provides `tf.stop_gradient()`. Disabling gradient tracking reduces memory usage and speeds up computation.

## What are sparse tensors?

Many real-world datasets contain tensors where the vast majority of elements are zero. Storing all these zeros wastes memory and computation. Sparse tensor representations store only the non-zero values and their coordinates, dramatically reducing memory usage.

Common sparse storage formats include:

- **COO (Coordinate)**: Stores a list of (index, value) pairs. Simple and flexible, used by both PyTorch (`torch.sparse_coo_tensor`) and TensorFlow (`tf.sparse.SparseTensor`).
- **CSR (Compressed Sparse Row)**: Stores row pointers, column indices, and values. More efficient for row-slicing and matrix-vector products.
- **CSC (Compressed Sparse Column)**: Similar to CSR but optimized for column operations.

Sparse tensors are particularly useful for:

- Natural language processing, where word-document matrices or TF-IDF representations are extremely sparse
- [Recommendation systems](/wiki/recommender_system), where user-item interaction matrices have very few non-zero entries
- Graph neural networks, where adjacency matrices are typically sparse

## What are named tensors?

Traditional tensor operations refer to dimensions by numerical index (0, 1, 2, ...), which can lead to subtle bugs when dimensions are reordered or when code becomes complex. Named tensors attach meaningful names to dimensions, making code more readable and less error-prone.

PyTorch introduced experimental named tensor support, allowing dimensions to be labeled:

```python
images = torch.randn(32, 3, 224, 224, names=('batch', 'channels', 'height', 'width'))
```

With named dimensions, operations can reference axes by name rather than index, which prevents common errors like summing over the wrong dimension. While PyTorch's named tensor API remains experimental, libraries like `einops` and the `einsum` notation provide alternative ways to write dimension-aware tensor operations clearly.[10]

## What tensor shapes do common architectures use?

Different [neural network](/wiki/neural_network) architectures expect input tensors in specific shapes. Understanding these conventions is essential for building and debugging models.

| Architecture | Typical Input Shape | Dimension Meanings |
|-------------|--------------------|-----------------|
| Fully connected (MLP) | `(batch, features)` | Batch of flat feature vectors |
| [CNN](/wiki/convolutional_neural_network) (PyTorch) | `(batch, channels, height, width)` | NCHW format |
| CNN (TensorFlow) | `(batch, height, width, channels)` | NHWC format |
| [RNN](/wiki/recurrent_neural_network) / [LSTM](/wiki/long_short-term_memory_lstm) | `(batch, sequence_length, features)` | Batch of variable-length sequences |
| [Transformer](/wiki/transformer) | `(batch, sequence_length, d_model)` | Batch of token embeddings |
| 3D CNN (video) | `(batch, channels, depth, height, width)` | Volumetric or temporal data |

The two dominant memory layout conventions for image data are **NCHW** (batch, channels, height, width) and **NHWC** (batch, height, width, channels). PyTorch defaults to NCHW, while TensorFlow defaults to NHWC. NVIDIA Tensor Cores perform best with NHWC, but automatic layout conversions handle this transparently in most cases. The choice of layout can affect performance by up to 10-30% depending on the hardware and operation.

## When did tensors become central to machine learning?

The mathematical concept of tensors was formalized in the 19th century by Gregorio Ricci-Curbastro and Tullio Levi-Civita as part of their work on differential geometry. Albert Einstein later used tensor calculus extensively in his general theory of relativity.

The adoption of tensors in machine learning began in the early 2000s, when researchers like M. Alex O. Vasilescu and Demetri Terzopoulos introduced multilinear tensor methods into computer vision.[4] The modern usage of tensors as multi-dimensional arrays became widespread with the development of [NumPy](/wiki/numpy) in 2006, the Theano library in 2007, and eventually [TensorFlow](/wiki/tensorflow) in 2015[3] and [PyTorch](/wiki/pytorch) in 2016.[2]

Specialized tensor hardware followed: NVIDIA released cuDNN in 2014, Google developed [TPUs](/wiki/tpu) between 2015 and 2017,[6] and NVIDIA introduced Tensor Cores with its Volta GPU architecture, announced on May 10, 2017.[12] These hardware advances enabled training models with billions of parameters.

## Explain Like I'm 5 (ELI5)

Imagine you have different ways to organize your toys. A single toy car sitting on the floor is like a [scalar](/wiki/scalar): just one number. Now line up five toy cars in a row, and that is like a vector: a list of numbers. Arrange the cars in rows and columns on a table, and you have a matrix: a grid of numbers. A tensor is what you get when you stack multiple grids on top of each other, or even organize them in even more complicated patterns. It is basically a container that can hold numbers in any arrangement, no matter how many directions or layers you need.

Computers use tensors to work with all sorts of data. A color photo, for instance, is stored as three grids layered together (one for red, one for green, one for blue). A whole album of photos is a stack of those layers. When a computer learns to recognize cats in photos, it reads these tensors of numbers, does a lot of math on them, and gradually figures out which patterns mean "cat." Tensors are the building blocks that make all of this possible.

## References

1. Kolda, T. G., & Bader, B. W. (2009). "Tensor Decompositions and Applications." *SIAM Review*, 51(3), 455-500.
2. Paszke, A., et al. (2019). "PyTorch: An Imperative Style, High-Performance Deep Learning Library." *Advances in Neural Information Processing Systems*, 32.
3. Abadi, M., et al. (2016). "TensorFlow: A System for Large-Scale Machine Learning." *Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI)*, 265-283.
4. Vasilescu, M. A. O., & Terzopoulos, D. (2002). "Multilinear Analysis of Image Ensembles: TensorFaces." *Proceedings of the European Conference on Computer Vision (ECCV)*, 447-460.
5. Micikevicius, P., et al. (2018). "Mixed Precision Training." *Proceedings of the International Conference on Learning Representations (ICLR)*.
6. Jouppi, N. P., et al. (2017). "In-Datacenter Performance Analysis of a Tensor Processing Unit." *Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA)*, 1-12.
7. Harris, C. R., et al. (2020). "Array Programming with NumPy." *Nature*, 585, 357-362.
8. PyTorch Documentation. "Automatic Differentiation with torch.autograd." https://pytorch.org/tutorials/beginner/basics/autogradqs_tutorial.html
9. TensorFlow Documentation. "Introduction to Tensors." https://www.tensorflow.org/guide/tensor
10. Kazemnejad, A. (2019). "Tensor Considered Harmful: Named Tensors and Einops." https://nlp.seas.harvard.edu/NamedTensor
11. Google Cloud. "BFloat16: The secret to high performance on Cloud TPUs." https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus
12. NVIDIA (2017). "NVIDIA Tesla V100 GPU Architecture" (whitepaper WP-08608-001). https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
}