# Singular value decomposition

> Source: https://aiwiki.ai/wiki/singular_value_decomposition
> Updated: 2026-06-23
> Categories: Machine Learning, Mathematics
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [principal component analysis](/wiki/principal_component_analysis_pca), [eigenvalue decomposition](/wiki/eigenvalue_decomposition), linear algebra, [matrix factorization](/wiki/matrix_factorization), [pseudoinverse](/wiki/pseudoinverse), [low-rank adaptation](/wiki/lora), [Eckart-Young theorem](/wiki/eckart_young_theorem)*

## Introduction

**Singular value decomposition** (SVD) is a matrix factorization that writes any real or complex m x n matrix **A** as the product **A** = **U** **Sigma** **V**^T, where **U** and **V** are orthogonal matrices (the left and right singular vectors) and **Sigma** is a diagonal matrix of non-negative numbers called the singular values. Geometrically the three factors correspond to a rotation, a non-negative axis-aligned scaling, and a second rotation. Unlike eigendecomposition, the SVD exists for every matrix, including rectangular and rank-deficient ones.[9]

For a real matrix **A** of size m x n, the SVD writes

**A** = **U** **Sigma** **V**^T

where **U** is an m x m orthogonal matrix, **V** is an n x n orthogonal matrix, and **Sigma** is an m x n diagonal matrix whose diagonal entries sigma_1 >= sigma_2 >= ... >= sigma_min(m,n) >= 0 are called the singular values of **A**. The columns of **U** are the left singular vectors, and the columns of **V** are the right singular vectors. The number of strictly positive singular values equals the rank of **A**.[9]

SVD is one of the most important matrix factorizations in all of applied [mathematics](/wiki/mathematics). It underlies [principal component analysis](/wiki/principal_component_analysis_pca), the computation of the [Moore-Penrose pseudoinverse](/wiki/pseudoinverse), the solution of rank-deficient [least squares problems](/wiki/least_squares_regression), [latent semantic analysis](/wiki/latent_semantic_analysis), many [recommender system](/wiki/recommender_system) algorithms, [image compression](/wiki/image_compression), [noise reduction](/wiki/denoising), and the analysis of neural network weight matrices. Gilbert Strang describes the SVD as "absolutely a high point of linear algebra," and numerical analysts routinely call it the most useful decomposition in the toolbox of [numerical linear algebra](/wiki/numerical_linear_algebra).[11][10]

Beyond its computational uses, SVD has deep theoretical significance. The Eckart-Young-Mirsky theorem, first proved by Carl Eckart and Gale Young in 1936, shows that the truncated SVD gives the best low-rank approximation of a matrix with respect to the [Frobenius norm](/wiki/frobenius_norm) and the [spectral norm](/wiki/spectral_norm), providing the mathematical foundation for [dimensionality reduction](/wiki/dimensionality_reduction).[5][6] In modern [deep learning](/wiki/deep_learning), the low-rank structure revealed by SVD motivates techniques such as [LoRA](/wiki/lora) (low-rank adaptation), which reduced the trainable parameters of GPT-3 175B by roughly 10,000 times, and various [model compression](/wiki/model_compression) schemes.[16]

## ELI5 (Explain like I'm 5)

Imagine you have a big table of numbers, like a spreadsheet. Maybe each row is a person and each column is a movie, and each cell says how much that person liked that movie on a scale of one to five. The table has lots of numbers in it and looks complicated.

SVD is a way of splitting that big complicated table into three simpler pieces that, when multiplied back together, give you the original table. The first piece is a list of "people types" (for example, action lovers, romance lovers, comedy lovers). The second piece is a short list of how strong each type is, sorted from most important to least important. The third piece is a list of "movie types" that matches up with the people types.

The magic is that most of the interesting information is usually captured by only a few of the biggest numbers in the middle piece, not all of them. So you can throw away the small numbers and still reconstruct almost the whole table using very little data. That is how SVD helps shrink images, clean up noisy data, and even guess what movie you might like next. It finds the hidden patterns in a pile of numbers and tells you which patterns matter most.

## When was SVD invented?

SVD has a long history that predates its modern applications by almost a century. Its theoretical foundations were laid by several 19th-century mathematicians working on bilinear forms, and its computational aspects were developed in the 20th century as digital computers became available.[18]

### 19th century origins

The SVD was independently discovered multiple times in the late 19th century. Eugenio Beltrami published the first proof for real square matrices in 1873, working in the context of bilinear forms.[1] Camille Jordan arrived at the same result one year later in 1874, in his *Memoire sur les formes bilineaires*.[2] James Joseph Sylvester rediscovered the decomposition in 1889 for real square matrices, calling the singular values "canonical multipliers."[3] Erhard Schmidt generalized the theory to integral operators in 1907, which is the infinite-dimensional analogue.[4] Hermann Weyl extended and refined the theory in the early 20th century, proving perturbation bounds for the singular values that still bear his name.[17]

### The Eckart-Young theorem

A landmark result was proved by Carl Eckart and Gale Young in their 1936 paper "The Approximation of One Matrix by Another of Lower Rank," published in *Psychometrika* (volume 1, issue 3, pages 211-218).[5] They showed that truncating the SVD gives the best low-rank approximation of a matrix in the Frobenius norm. Leon Mirsky later generalized the result to all unitarily invariant norms, including the spectral norm.[6] This theorem is the mathematical foundation of modern [dimensionality reduction](/wiki/dimensionality_reduction) and low-rank approximation.

### Computational era

The SVD remained primarily a theoretical tool until the computational era. The first practical numerical algorithm for computing SVD was developed by Gene Golub and William Kahan in 1965, based on bidiagonalization.[7] Golub and Christian Reinsch published their refined and widely adopted algorithm in a 1970 paper in *Numerische Mathematik*, which became a standard part of the LINPACK and later LAPACK numerical libraries.[8] This made SVD a practical tool in scientific computing and signal processing.

### The machine learning era

Starting in the 1990s and accelerating through the 2000s, SVD became central to many machine learning methods. Scott Deerwester and colleagues introduced latent semantic analysis in 1990, applying SVD to term-document matrices for information retrieval.[12] During the Netflix Prize competition (2006 to 2009), Simon Funk popularized a matrix-factorization approach colloquially called "SVD" (although it is a gradient-descent approximation rather than the exact decomposition) for collaborative filtering.[14] Nathan Halko, Per-Gunnar Martinsson, and Joel Tropp published the randomized SVD algorithm in 2011, extending SVD to datasets too large for classical algorithms.[15] In the 2020s, SVD-inspired methods such as [LoRA](/wiki/lora) brought low-rank ideas to the fine-tuning of [large language models](/wiki/large_language_model).[16]

## What does the SVD theorem state?

This section states the SVD theorem precisely, describes the components **U**, **Sigma**, and **V**, explains the full, thin, and truncated versions, and gives the geometric interpretation.

### Statement of the theorem

**Theorem (Singular value decomposition).** Let **A** be a real m x n matrix. Then there exist orthogonal matrices **U** in R^{m x m} and **V** in R^{n x n}, and an m x n matrix **Sigma** that is zero except on its main diagonal, such that

**A** = **U** **Sigma** **V**^T

The diagonal entries of **Sigma** are non-negative real numbers sigma_1 >= sigma_2 >= ... >= sigma_p >= 0, where p = min(m, n). The number of strictly positive singular values equals the rank of **A**.[9]

For complex matrices, the same theorem holds with **U** and **V** being unitary (complex orthogonal) matrices and **V**^T replaced by the conjugate transpose **V**^*.

### Proof sketch

The standard existence proof starts with the symmetric positive semidefinite matrix **A**^T **A**. Since **A**^T **A** is symmetric, the spectral theorem guarantees it has an orthonormal basis of eigenvectors with real eigenvalues. All eigenvalues are non-negative: if **A**^T **A** **v** = lambda **v** with ||**v**|| = 1, then lambda = lambda ||**v**||^2 = **v**^T (lambda **v**) = **v**^T **A**^T **A** **v** = ||**A** **v**||^2 >= 0. Define the singular values as sigma_k = sqrt(lambda_k), ordered in decreasing order. The right singular vectors are the eigenvectors **v**_k of **A**^T **A**. For the nonzero singular values, define the left singular vectors by **u**_k = **A** **v**_k / sigma_k; one verifies these are orthonormal by direct computation. The left singular vectors for zero singular values are chosen to complete an orthonormal basis of R^m. Assembling these vectors into matrices and verifying the identity gives **A** = **U** **Sigma** **V**^T.[10]

### Components U, Sigma, V

| Component | Size | Role | Key property |
|---|---|---|---|
| **U** | m x m | Left singular vectors (columns) | Orthogonal: **U**^T **U** = **U** **U**^T = **I**_m |
| **Sigma** | m x n | Non-negative diagonal entries (singular values) | Zero off the main diagonal; diagonal entries sorted in decreasing order |
| **V** | n x n | Right singular vectors (columns) | Orthogonal: **V**^T **V** = **V** **V**^T = **I**_n |

The columns of **V** form an orthonormal basis of the input space R^n. The columns of **U** form an orthonormal basis of the output space R^m. The singular values sigma_k describe how much the transformation **A** stretches or compresses along the direction **v**_k. Concretely, **A** **v**_k = sigma_k **u**_k, so the unit vector **v**_k is mapped to sigma_k times the unit vector **u**_k.

The left singular vectors corresponding to nonzero singular values span the column space (range) of **A**. The remaining left singular vectors span the left null space of **A**. Similarly, the right singular vectors corresponding to nonzero singular values span the row space of **A**, and the remaining right singular vectors span the null space (kernel) of **A**. This gives an elegant description of the four fundamental subspaces of a matrix.[11]

### What is the difference between full, thin, and truncated SVD?

In practice, three different forms of SVD are used depending on the application.

| Form | U size | Sigma size | V size | Storage | Use case |
|---|---|---|---|---|---|
| Full SVD | m x m | m x n | n x n | O(m^2 + n^2 + mn) | Theoretical work, four fundamental subspaces |
| Thin (economy) SVD | m x p | p x p (square diagonal) | n x p | O(mp + np + p) | Standard numerical computation; p = min(m, n) |
| Truncated (rank-k) SVD | m x k | k x k | n x k | O(mk + nk + k) | Low-rank approximation, compression, dimensionality reduction |

The **full SVD** includes the entire orthogonal matrices **U** and **V**, with a rectangular **Sigma** padded with zero rows or columns. It is important for theoretical work because it gives orthonormal bases for all four fundamental subspaces, but it is wasteful in storage when the matrix is very rectangular.

The **thin SVD** (also called **economy SVD** or reduced SVD) keeps only the columns of **U** and **V** corresponding to non-zero singular values (or, more precisely, the first p = min(m, n) columns). The resulting **Sigma** is square and contains only the singular values on its diagonal. For a tall matrix (m > n), this means **U** becomes m x n and **Sigma** becomes n x n. The thin SVD is what numerical libraries typically return when full orthogonality of the excess columns is not needed.

The **truncated SVD** keeps only the top k singular values and corresponding singular vectors:

**A**_k = **U**_k **Sigma**_k **V**_k^T = sum_{i=1}^{k} sigma_i **u**_i **v**_i^T

This is a rank-k matrix, and by the Eckart-Young-Mirsky theorem it is the best rank-k approximation of **A** in the Frobenius and spectral norms.[5] Truncated SVD is the workhorse of dimensionality reduction and low-rank approximation.

### Geometric interpretation

The SVD has a striking geometric interpretation. Any linear transformation **A**: R^n -> R^m can be decomposed into three successive steps:

1. A rotation (or reflection) of the input space R^n, implemented by **V**^T.
2. A non-negative scaling along the coordinate axes, implemented by **Sigma**, which may also project onto a subspace or embed into a higher-dimensional space.
3. A rotation (or reflection) of the output space R^m, implemented by **U**.

Geometrically, the unit sphere in R^n is first rotated by **V**^T (still a unit sphere), then each coordinate axis is stretched by the corresponding singular value (turning the sphere into an axis-aligned ellipsoid), then the result is rotated by **U**. The final image of the unit sphere is an ellipsoid in R^m whose principal semi-axes have lengths equal to the singular values of **A**, oriented along the left singular vectors. This gives a visual explanation for many of the facts about SVD: the largest singular value sigma_1 equals the operator 2-norm ||**A**||_2 because it is the longest semi-axis of the ellipsoid, and the ratio sigma_1 / sigma_min equals the aspect ratio of the ellipsoid, which is the [condition number](/wiki/condition_number).[10]

## How does SVD differ from eigendecomposition?

SVD and eigendecomposition are closely related but not identical. Understanding the precise relationship is important because the two are often confused.

### Eigendecomposition

The eigendecomposition of a square matrix **B** is **B** = **P** **D** **P**^{-1}, where **D** is a diagonal matrix of eigenvalues and **P** contains the corresponding eigenvectors as columns. Eigendecomposition exists only for square matrices and only for those that are diagonalizable. When **B** is real symmetric (or complex Hermitian), **P** can be chosen orthogonal (or unitary) and the eigenvalues are real, giving the spectral theorem: **B** = **Q** **Lambda** **Q**^T.

### SVD as eigendecomposition of symmetric matrices

For any real matrix **A**, the matrices **A**^T **A** and **A** **A**^T are symmetric and positive semidefinite, so they admit spectral decompositions. These are directly related to the SVD of **A**:

| Symmetric matrix | Eigendecomposition | Meaning |
|---|---|---|
| **A**^T **A** (n x n) | **V** **Sigma**^T **Sigma** **V**^T | Eigenvectors are right singular vectors; eigenvalues are sigma_i^2 |
| **A** **A**^T (m x m) | **U** **Sigma** **Sigma**^T **U**^T | Eigenvectors are left singular vectors; eigenvalues are sigma_i^2 |

This relationship is the basis of the standard existence proof of SVD: the singular values are the square roots of the [eigenvalues](/wiki/eigenvalue) of **A**^T **A** (or equivalently **A** **A**^T).[9]

### Key differences

| Property | Eigendecomposition | SVD |
|---|---|---|
| Applies to | Square diagonalizable matrices only | Any real or complex rectangular matrix |
| Always exists | No | Yes |
| Bases | May not be orthogonal | Always orthogonal |
| Scalars | Possibly complex eigenvalues | Non-negative real singular values |
| Input and output bases | Same (**P** on both sides) | Different (**U** and **V**) |

For a real symmetric positive semidefinite matrix **S**, the SVD and the spectral eigendecomposition coincide: **S** = **Q** **Lambda** **Q**^T with non-negative eigenvalues, which is exactly the SVD with **U** = **V** = **Q**. For a general square matrix that is diagonalizable but not symmetric, the [eigenvectors](/wiki/eigenvector) need not be orthogonal and the eigenvalues may be negative or complex, so the eigendecomposition and the SVD are genuinely different factorizations.

## Properties

Several fundamental properties follow directly from the definition of SVD.

### Uniqueness

The singular values of **A** are uniquely determined. They are an intrinsic property of the matrix. The singular vectors, on the other hand, are only determined up to certain sign and rotation freedoms.

- If all positive singular values are distinct, then each corresponding pair (**u**_k, **v**_k) is unique up to simultaneous sign flip: replacing **u**_k with -**u**_k and **v**_k with -**v**_k leaves the product **u**_k sigma_k **v**_k^T unchanged.
- If a singular value has multiplicity r > 1, then the corresponding left and right singular vectors are only determined up to an orthogonal rotation within the r-dimensional subspace they span.
- Zero singular values do not uniquely determine singular vectors at all: any orthonormal basis of the kernel (for **V**) or left kernel (for **U**) will do.

The sign ambiguity in particular has practical implications. Different software libraries may return SVDs with singular vectors that differ in sign, even though both are mathematically valid. Applications that depend on a specific sign convention (for example, aligning principal components across datasets) must handle this explicitly.

### Orthogonality and ordering

By convention, singular values are listed in decreasing order sigma_1 >= sigma_2 >= ... >= 0. This convention matters because the truncated SVD (which drops the smallest singular values) only gives the best low-rank approximation when the singular values are ordered.[5] Numerical libraries consistently enforce this ordering.

The singular vectors corresponding to distinct singular values are orthogonal. The columns of **U** and **V** are orthonormal by construction.

### Norms and rank

The SVD gives direct expressions for several fundamental matrix quantities.

| Quantity | Formula in terms of singular values |
|---|---|
| Rank | Number of nonzero singular values |
| [Frobenius norm](/wiki/frobenius_norm) ||**A**||_F | sqrt(sum_i sigma_i^2) |
| [Spectral norm](/wiki/spectral_norm) ||**A**||_2 | sigma_1 (largest singular value) |
| Nuclear norm ||**A**||_* | sum_i sigma_i |
| [Condition number](/wiki/condition_number) kappa_2(**A**) | sigma_1 / sigma_min (for full rank A) |
| Determinant (square A) | product_i sigma_i times sign from orthogonal matrices |

The spectral norm equals the largest singular value because it measures the maximum stretch the matrix applies to any unit vector, which the geometric interpretation of SVD makes transparent. The Frobenius norm is related to the squared singular values because ||**A**||_F^2 = trace(**A**^T **A**) = sum of eigenvalues of **A**^T **A** = sum of squared singular values.[9]

### Invariance under orthogonal transformations

The singular values of **A** are invariant under left or right multiplication by orthogonal matrices. If **Q** and **R** are orthogonal, then **Q** **A** **R**^T has the same singular values as **A**, though generally different singular vectors. This invariance is the reason unitarily invariant norms (such as the Frobenius and spectral norms) can be expressed entirely in terms of singular values.[6]

## How is the SVD computed?

A wide variety of algorithms exist for computing the SVD, trading off precision, speed, and suitability for particular matrix structures.

### Golub-Reinsch algorithm

The classical algorithm for computing a dense SVD is the Golub-Reinsch algorithm, published in 1970.[8] It proceeds in two stages:

1. **Bidiagonalization.** The matrix **A** is reduced to upper bidiagonal form using a sequence of Householder reflections applied from the left and right. After this step, **A** = **U**_1 **B** **V**_1^T, where **B** is bidiagonal and **U**_1, **V**_1 are orthogonal. This step dominates the cost for large matrices.
2. **Diagonalization by implicit QR.** The bidiagonal matrix **B** is then diagonalized using an implicitly shifted QR iteration that preserves the bidiagonal structure. Each iteration chases the bulge caused by the shift down the diagonal, gradually driving the off-diagonal entries to zero.

The algorithm is backward stable and is the basis of the SVD routines in LAPACK, which are called by numerical libraries such as NumPy, SciPy, MATLAB, Julia, and R.[19]

### Jacobi SVD

The one-sided Jacobi SVD algorithm applies a sequence of plane rotations to orthogonalize the columns of **A**. It is slower than Golub-Reinsch for large matrices but can achieve higher relative accuracy, particularly for the small singular values of matrices with a wide range of singular value magnitudes. It is sometimes used when high accuracy is needed for all singular values.

### Divide and conquer

The divide-and-conquer method for bidiagonal SVD recursively splits the bidiagonal matrix into smaller pieces, solves the smaller problems, and combines the results via a secular equation. It has better computational complexity than the QR-based approach for large matrices and is implemented in LAPACK as `gesdd`.[19]

### Randomized SVD

For very large matrices, and particularly when only a low-rank approximation is needed, classical dense SVD is too expensive. The [randomized SVD](/wiki/randomized_svd) algorithm by Halko, Martinsson, and Tropp (2011) produces a near-optimal rank-k approximation by sampling random linear combinations of the columns of **A** and then doing a small dense SVD on the reduced matrix.[15]

The basic idea is:

1. Draw a random matrix **Omega** of size n x (k + p), where p is a small oversampling parameter (typically p = 5 or p = 10).
2. Compute **Y** = **A** **Omega**. The columns of **Y** approximately span the range of **A**.
3. Orthonormalize **Y** via QR decomposition: **Y** = **Q** **R**.
4. Form **B** = **Q**^T **A**, a small (k + p) x n matrix.
5. Compute the SVD of **B** = **U**_B **Sigma**_B **V**_B^T directly using a classical dense algorithm.
6. The approximate SVD of **A** is **A** approx (**Q** **U**_B) **Sigma**_B **V**_B^T.

For increased accuracy, one or two subspace iterations may be added. The randomized algorithm is fast, parallelizes well, needs only a few passes over the matrix (which matters enormously for out-of-core or streaming computations), and has rigorous error bounds.[15] It has become the standard approach for SVD of very large or very sparse matrices, and it is used under the hood in large-scale PCA implementations such as scikit-learn's `TruncatedSVD`.

### Complexity summary

| Algorithm | Complexity | Notes |
|---|---|---|
| Golub-Reinsch (QR iteration) | O(min(m n^2, m^2 n)) flops; roughly 17 n^3 for square n x n | Standard dense SVD; backward stable |
| Divide and conquer (`gesdd`) | Similar leading order; smaller constant, roughly 9 n^3 for square n x n | Faster than QR iteration in practice for large matrices |
| One-sided Jacobi | O(n^3) with larger constant | Higher relative accuracy for small singular values |
| Lanczos / ARPACK iteration | O(m n k) for k leading singular triplets | Good for sparse matrices and few dominant components |
| Randomized SVD | O(m n k) for dense; O(m n log k) with structured random matrices | Approximate; near-optimal for low-rank structure |

For a general dense m x n matrix the overall cost of a full SVD is O(m n^2) if m >= n, and O(m^2 n) if n > m; these bounds reflect the cost of the bidiagonalization step, which dominates.[9]

## Why is truncated SVD the best low-rank approximation?

The truncated SVD is not just *a* low-rank approximation of **A**; it is provably the *best* low-rank approximation in a strong sense.

### Statement

**Theorem (Eckart-Young-Mirsky).** Let **A** be an m x n matrix with SVD **A** = **U** **Sigma** **V**^T, and let **A**_k = sum_{i=1}^{k} sigma_i **u**_i **v**_i^T be its truncation to rank k. Then for any matrix **B** of rank at most k,

||**A** - **A**_k||_F <= ||**A** - **B**||_F

and

||**A** - **A**_k||_2 <= ||**A** - **B**||_2

Moreover, the errors are given exactly by the discarded singular values:

||**A** - **A**_k||_F^2 = sigma_{k+1}^2 + sigma_{k+2}^2 + ... + sigma_p^2

||**A** - **A**_k||_2 = sigma_{k+1}

Mirsky (1960) extended the statement to any unitarily invariant norm.[6] The result is sometimes called the Schmidt-Eckart-Young-Mirsky theorem, acknowledging Erhard Schmidt's earlier work in the infinite-dimensional setting.[4]

### Proof sketch (Frobenius case)

Let **B** be any rank-k matrix. Write the SVD of **A** as **A** = **U** **Sigma** **V**^T and consider the transformed error **U**^T (**A** - **B**) **V** = **Sigma** - **U**^T **B** **V**. The Frobenius norm is unitarily invariant, so ||**A** - **B**||_F = ||**Sigma** - **U**^T **B** **V**||_F. The matrix **U**^T **B** **V** has rank at most k. The problem therefore reduces to: among all rank-k matrices **C**, which minimizes ||**Sigma** - **C**||_F when **Sigma** is diagonal with decreasing entries? Using the interlacing inequalities of Weyl for singular values of sums of matrices (or a direct argument), one shows the minimum is achieved by **C** = diag(sigma_1, ..., sigma_k, 0, ..., 0), yielding the error sum_{i > k} sigma_i^2.[17] This corresponds precisely to the truncation **A**_k.

### Interpretation and consequences

The Eckart-Young theorem provides the mathematical justification for almost every use of SVD as a dimensionality reduction tool.[5] It says that if you want to compress a matrix to rank k, throwing away all but the top k singular values and their vectors is the best you can do with respect to the Frobenius or spectral norm. The quality of the approximation is determined by how fast the singular values decay: if they decay rapidly, a low rank k suffices for high accuracy; if they decay slowly, any low-rank approximation will be poor. Many real-world matrices (natural images, text corpora, user-item rating matrices) have rapidly decaying singular values, which is why low-rank methods work so well in practice.

## Pseudoinverse and least squares

SVD provides the most stable and general way to compute the Moore-Penrose pseudoinverse of a matrix and to solve least squares problems.

### Moore-Penrose pseudoinverse via SVD

The [Moore-Penrose pseudoinverse](/wiki/pseudoinverse) of a matrix **A** with SVD **A** = **U** **Sigma** **V**^T is

**A**^+ = **V** **Sigma**^+ **U**^T

where **Sigma**^+ is obtained from **Sigma** by transposing it and then replacing each nonzero singular value sigma_i by its reciprocal 1/sigma_i (zero singular values remain zero). Equivalently,

**A**^+ = sum_{sigma_i > 0} (1 / sigma_i) **v**_i **u**_i^T

This formula always makes sense, even for rank-deficient or non-square matrices. The pseudoinverse satisfies the four Moore-Penrose conditions and reduces to the ordinary inverse when **A** is square and invertible.[7]

### Least squares solution

For an overdetermined linear system **A** **x** = **b** with **A** of size m x n and m > n, the minimum-norm least squares solution is

**x**^* = **A**^+ **b** = **V** **Sigma**^+ **U**^T **b**

This **x**^* minimizes ||**A** **x** - **b**||_2 over all **x**, and among all minimizers it has the smallest Euclidean norm. When **A** is full column rank, the solution is unique and equals (**A**^T **A**)^{-1} **A**^T **b**, but for rank-deficient **A** the normal equations are singular while the SVD-based formula still works.[8]

Expanded in terms of singular values, the solution can be written as

**x**^* = sum_{sigma_i > 0} ((**u**_i^T **b**) / sigma_i) **v**_i

This expansion shows why small singular values cause numerical trouble: dividing by a very small sigma_i amplifies noise in the corresponding projection **u**_i^T **b**. The condition number of the least squares problem is essentially sigma_1 / sigma_min. When the matrix is ill-conditioned, it is common to drop all singular values below a threshold (the truncated SVD regularization) or apply Tikhonov regularization, which replaces 1/sigma_i with sigma_i / (sigma_i^2 + lambda^2) for a damping parameter lambda.[10]

### Total least squares

The total least squares problem, which allows errors in both **A** and **b**, also has a clean SVD-based solution. Augmenting **A** with **b** to form [**A** | **b**] and computing the SVD, the solution comes from the right singular vector corresponding to the smallest singular value. This is known as the Golub-Van Loan solution to total least squares.[9]

## What is SVD used for in machine learning?

The applications of SVD span nearly every quantitative discipline. This section highlights some of the most important, with an emphasis on machine learning and data analysis.

### Principal component analysis

[Principal component analysis](/wiki/principal_component_analysis_pca) (PCA) is essentially SVD applied to a centered data matrix. Given an n x p data matrix **X** where rows are observations and columns are features, PCA first centers **X** by subtracting the column means and then computes the SVD of the centered matrix **X**_c = **U** **Sigma** **V**^T. The right singular vectors (columns of **V**) are the principal component directions, the scores **U** **Sigma** give the coordinates of the data in the new basis, and the squared singular values divided by (n - 1) give the variances captured by each component. Computing PCA via SVD of **X**_c is numerically preferable to forming and diagonalizing the covariance matrix **X**_c^T **X**_c / (n - 1), because squaring the matrix doubles the condition number.[10] Almost all modern PCA implementations (scikit-learn, MATLAB's `pca`, R's `prcomp`) use SVD under the hood.

### Latent semantic analysis

[Latent semantic analysis](/wiki/latent_semantic_analysis) (LSA), also called latent semantic indexing, was introduced by Deerwester et al. in their 1990 paper "Indexing by Latent Semantic Analysis," published in the *Journal of the American Society for Information Science*.[12] LSA applies truncated SVD to a term-document matrix (often weighted by TF-IDF or log-entropy) where rows are terms and columns are documents. The truncation to a small number of components (typically 100 to 300) projects terms and documents into a shared low-dimensional "semantic" space, where terms with similar meanings and documents about similar topics end up close together. LSA was one of the first successful methods for discovering latent topics in text and for retrieving documents by concept rather than exact keyword match. It is a direct ancestor of modern topic models (such as LDA) and [word embedding](/wiki/word_embedding) methods.

### Recommender systems

In [recommender systems](/wiki/recommender_system), users and items can be arranged in a large sparse rating matrix whose entries represent observed ratings or interactions. Low-rank matrix factorization methods (often loosely called "SVD" in the recommender literature) factor this matrix into a user-factor matrix and an item-factor matrix, treating unobserved entries as unknowns to be predicted.[13] The canonical example is the Netflix Prize competition (2006 to 2009), where matrix factorization became one of the core techniques.[14] Simon Funk popularized a stochastic gradient descent method that trains low-rank factors to minimize squared error on observed ratings. Strictly speaking, this is not the classical SVD because it handles missing entries, but the underlying intuition comes directly from the low-rank structure revealed by SVD.

The winning team BellKor's Pragmatic Chaos used an ensemble that included matrix factorization as a central component.[14] Yehuda Koren and collaborators published influential papers on the approach, including "Matrix Factorization Techniques for Recommender Systems" in *IEEE Computer* (2009).[13]

### Image compression

An image can be represented as a matrix of pixel intensities (or three matrices for color channels). Computing the SVD and keeping only the top k singular triplets reconstructs an approximation **A**_k = **U**_k **Sigma**_k **V**_k^T whose storage requirement is (m + n + 1) * k instead of m * n. For a 1000 x 1000 image, keeping 50 singular values requires storing 100050 numbers instead of 1000000, a tenfold reduction. Because natural images tend to have rapidly decaying singular value spectra, the reconstructed image is often visually indistinguishable from the original even at moderate truncation. While SVD-based compression is not competitive with modern codecs such as JPEG or WebP in terms of perceptual quality per bit, it remains a standard teaching example that illustrates low-rank approximation vividly.

### Noise reduction

When a matrix of signal measurements is corrupted by independent additive noise, the true signal often lies in a low-dimensional subspace while the noise is spread across all dimensions. Truncating the SVD to the top few singular values keeps the signal subspace and discards most of the noise. This idea underlies a wide range of denoising techniques in signal processing, image processing, and bioinformatics, including subspace methods for direction-of-arrival estimation, spectral methods for microarray data, and low-rank plus sparse decompositions (robust PCA) for video background subtraction.

### Numerical linear algebra

SVD is a workhorse of [numerical linear algebra](/wiki/numerical_linear_algebra). It provides the most reliable tool for determining the numerical rank of a matrix (counting how many singular values are above a noise threshold), for solving ill-conditioned or rank-deficient linear systems, for computing the null space and range of a matrix, and for estimating condition numbers.[10] Matrix equation solvers, orthogonal Procrustes problems (finding the best rotation that aligns two sets of points), and many constrained optimization problems rely on the SVD.

### Other domains

| Domain | SVD-based application |
|---|---|
| Control theory | Balanced model reduction, system identification, H-infinity optimization |
| Signal processing | MUSIC and ESPRIT for direction-of-arrival, subspace tracking |
| Computer vision | Essential matrix decomposition, structure from motion, homography estimation |
| Genomics | Gene expression analysis, population structure inference |
| Chemometrics | Multivariate calibration, partial least squares |
| Information retrieval | LSI, document clustering, cross-language retrieval |
| Network science | Spectral clustering, community detection, link prediction |

## How is SVD used in deep learning?

SVD has become newly relevant in the era of large [neural networks](/wiki/neural_network). The weight matrices of modern deep models contain billions of parameters, and low-rank structure offers a path to understanding, compressing, and fine-tuning them efficiently.

### Model compression

One direct application is [model compression](/wiki/model_compression) by low-rank weight decomposition. A dense weight matrix **W** in R^{d x k} can be replaced by a product **W** approx **B** **A** where **B** is in R^{d x r} and **A** is in R^{r x k}, with r much smaller than min(d, k). When the original **W** has fast-decaying singular values, the approximation is accurate and the parameter count drops from d*k to (d + k)*r. This method has been used to compress fully connected layers, convolutional kernels, and self-attention projections, often with modest accuracy loss. Structured SVD variants (Tucker decomposition, CP decomposition for tensors) extend the idea to convolutional filters with multiple dimensions.

### LoRA: low-rank adaptation

[LoRA](/wiki/lora) (low-rank adaptation of large language models), introduced by Hu et al. in 2021, leverages low-rank structure to fine-tune massive [language models](/wiki/large_language_model) efficiently.[16] The authors state their central hypothesis directly: "We hypothesize that the change in weights during model adaptation also has a low 'intrinsic rank', leading to our proposed Low-Rank Adaptation (LoRA) approach."[16] Instead of updating the full pretrained weight matrix **W**_0 in R^{d x k}, LoRA keeps **W**_0 frozen and adds a trainable low-rank update **Delta** **W** = **B** **A**, where **B** is in R^{d x r} and **A** is in R^{r x k} with r much smaller than min(d, k). Only **A** and **B** are trained.

The practical payoff is large. Compared to full fine-tuning of GPT-3 175B with Adam, LoRA reduces the number of trainable parameters by roughly 10,000 times and the GPU memory requirement by roughly 3 times, while matching or exceeding the quality of full fine-tuning on several benchmarks.[16] Because the frozen base model is shared, multiple LoRA adapters can be stored and swapped cheaply, which has made LoRA a default technique for serving many specialized variants of one base model.

LoRA is not literally an SVD of the update, but it is directly inspired by the low-rank structure that SVD reveals, and several follow-up methods (such as AdaLoRA and SVD-LoRA) explicitly use SVD to initialize, allocate, or adapt the rank budget across layers. Similar ideas appear in QLoRA, DoRA, and other parameter-efficient fine-tuning techniques.

### Weight matrix analysis

Researchers have used SVD to probe the structure of trained neural networks. Studying the spectra of weight matrices reveals patterns that relate to generalization, training dynamics, and task information. Random matrix theory analysis of deep networks has shown that before training the singular value spectra match predictions of random matrix models (such as the Marchenko-Pastur distribution), while after training the largest singular values deviate and appear to encode task-specific information.[22] More recent work has also found that the smallest singular values deviate from random predictions, suggesting that the entire singular value spectrum carries meaningful signal and that compression methods should not blindly discard the smallest singular values without considering the task.

### Spectral methods and implicit bias

Gradient descent on linear and deep networks exhibits an implicit bias toward low-rank and low-spectral-norm solutions in certain regimes. Spectral properties of weight matrices (such as sigma_1 and the ratio sigma_1 / sigma_min) are used in generalization bounds, stability analysis, and Lipschitz-constant estimation. Spectral normalization, a technique used in generative adversarial networks, constrains the spectral norm (largest singular value) of each weight matrix to be at most 1 and requires computing or approximating sigma_1 during training, typically via power iteration.[20]

### Neural tangent kernels

In the theoretical analysis of wide neural networks, the neural tangent kernel (NTK) describes the linearized training dynamics and its spectral decomposition governs convergence rates and generalization.[21] SVD (or eigendecomposition) of the NTK Gram matrix provides the principal modes of learning: the network converges fastest along directions corresponding to the largest eigenvalues, a phenomenon known as spectral bias. This gives a quantitative explanation for why deep networks tend to fit low-frequency components of functions first.

## Numerical considerations

Real-world SVD computations require care with respect to numerical stability, scaling, and accuracy targets.

### Condition number

The [condition number](/wiki/condition_number) of **A** with respect to the 2-norm is kappa_2(**A**) = sigma_1 / sigma_min. A matrix is called well-conditioned if this ratio is small (close to 1) and ill-conditioned if it is large. The condition number quantifies how much small perturbations in the input of a linear system **A** **x** = **b** can be amplified in the solution. Roughly speaking, if kappa_2(**A**) approx 10^k and the input has relative error at the machine precision epsilon, the computed solution can have relative error up to 10^k * epsilon. SVD is the gold standard for computing kappa_2 because no other method gives such a direct handle on sigma_min.[10]

### Rank determination

In floating-point arithmetic, a theoretically zero singular value typically computes as a very small number. Determining the numerical rank of a matrix amounts to choosing a threshold below which singular values are treated as zero. A common choice is sigma_1 * max(m, n) * epsilon, where epsilon is the machine precision. The SVD's ability to provide a stable rank-revealing factorization is one of its most practically valuable features.

### Backward stability

The Golub-Reinsch and divide-and-conquer SVD algorithms are backward stable: the computed singular values and vectors are exact for a matrix that is close to the input matrix (within relative error on the order of epsilon).[8] This means the accuracy of the large singular values is limited only by machine precision. The relative accuracy of very small singular values may be worse, which is where Jacobi SVD can offer advantages.

### Storage and access patterns

For dense matrices that fit in memory, Golub-Reinsch or divide-and-conquer are the methods of choice. For sparse matrices, Lanczos-based iterative methods are preferred because they access the matrix only through matrix-vector products and never form dense intermediate results. For very large matrices that do not fit in memory or that are accessed through expensive communication, randomized SVD minimizes the number of passes over the data, often getting a rank-k approximation from just two or three passes.[15]

### Regularization

When using SVD to solve ill-conditioned least squares problems, it is common to regularize by discarding or damping small singular values. Two standard approaches are truncated SVD (setting 1/sigma_i to zero for sigma_i below a threshold) and Tikhonov regularization (replacing 1/sigma_i with sigma_i / (sigma_i^2 + lambda^2)). Both have a natural interpretation in the SVD basis: they filter the contributions of noisy directions to the solution.

## Software implementations

| Library / Tool | Language | Function | Notes |
|---|---|---|---|
| NumPy | Python | `numpy.linalg.svd` | Full or thin SVD via LAPACK `gesdd` |
| SciPy | Python | `scipy.linalg.svd`, `scipy.sparse.linalg.svds` | Dense and sparse (ARPACK) SVD |
| [scikit-learn](/wiki/scikit_learn) | Python | `sklearn.decomposition.TruncatedSVD`, `PCA` | Randomized and classical variants |
| [PyTorch](/wiki/pytorch) | Python | `torch.linalg.svd`, `torch.svd_lowrank` | GPU-accelerated, differentiable |
| [TensorFlow](/wiki/tensorflow) | Python | `tf.linalg.svd` | GPU-accelerated |
| MATLAB | MATLAB | `svd`, `svds` | Dense and sparse |
| R | [R](/wiki/r_programming_language) | `svd`, `irlba` (for truncated) | Built-in and package-based |
| LAPACK | Fortran / C | `gesdd`, `gesvd`, `gejsv` | Low-level reference implementation |
| Julia | Julia | `svd`, `svdvals` | Thin or full via LAPACK |
| Apache Spark MLlib | Scala / Python | `RowMatrix.computeSVD` | Distributed for big data |

For very large or sparse matrices, specialized packages such as ARPACK, PROPACK, SLEPc, and fbpca (randomized SVD in Python) are widely used. GPU-accelerated SVDs are available in cuSOLVER (NVIDIA), MAGMA, and as built-ins in PyTorch and TensorFlow.

## See also

- [Principal component analysis](/wiki/principal_component_analysis_pca)
- [Eigenvalue decomposition](/wiki/eigenvalue_decomposition)
- [Matrix factorization](/wiki/matrix_factorization)
- [Pseudoinverse](/wiki/pseudoinverse)
- [LoRA](/wiki/lora)
- [Latent semantic analysis](/wiki/latent_semantic_analysis)
- [Recommender system](/wiki/recommender_system)
- [Randomized SVD](/wiki/randomized_svd)
- [Dimensionality reduction](/wiki/dimensionality_reduction)
- Linear algebra
- [Numerical linear algebra](/wiki/numerical_linear_algebra)
- [Condition number](/wiki/condition_number)
- [Frobenius norm](/wiki/frobenius_norm)
- [Spectral norm](/wiki/spectral_norm)
- [Eckart-Young theorem](/wiki/eckart_young_theorem)
- [Autoencoder](/wiki/autoencoder)

## References

1. Beltrami, E. (1873). "Sulle funzioni bilineari." *Giornale di Matematiche*, 11, 98-106.

2. Jordan, C. (1874). "Memoire sur les formes bilineaires." *Journal de Mathematiques Pures et Appliquees*, 19, 35-54.

3. Sylvester, J.J. (1889). "A new proof that a general quadric may be reduced to its canonical form (that is, a linear function of squares) by means of a real orthogonal substitution." *Messenger of Mathematics*, 19, 1-5.

4. Schmidt, E. (1907). "Zur Theorie der linearen und nichtlinearen Integralgleichungen. I. Teil: Entwicklung willkurlicher Funktionen nach Systemen vorgeschriebener." *Mathematische Annalen*, 63(4), 433-476.

5. Eckart, C. and Young, G. (1936). "The Approximation of One Matrix by Another of Lower Rank." *Psychometrika*, 1(3), 211-218.

6. Mirsky, L. (1960). "Symmetric gauge functions and unitarily invariant norms." *Quarterly Journal of Mathematics*, 11(1), 50-59.

7. Golub, G.H. and Kahan, W. (1965). "Calculating the Singular Values and Pseudo-Inverse of a Matrix." *Journal of the Society for Industrial and Applied Mathematics, Series B: Numerical Analysis*, 2(2), 205-224.

8. Golub, G.H. and Reinsch, C. (1970). "Singular value decomposition and least squares solutions." *Numerische Mathematik*, 14(5), 403-420.

9. Golub, G.H. and Van Loan, C.F. (2013). *Matrix Computations*, 4th edition. Johns Hopkins University Press.

10. Trefethen, L.N. and Bau, D. III (1997). *Numerical Linear Algebra*. SIAM.

11. Strang, G. (2016). *Introduction to Linear Algebra*, 5th edition. Wellesley-Cambridge Press.

12. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., and Harshman, R. (1990). "Indexing by Latent Semantic Analysis." *Journal of the American Society for Information Science*, 41(6), 391-407.

13. Koren, Y., Bell, R., and Volinsky, C. (2009). "Matrix Factorization Techniques for Recommender Systems." *IEEE Computer*, 42(8), 30-37.

14. Bell, R.M. and Koren, Y. (2007). "Lessons from the Netflix Prize Challenge." *SIGKDD Explorations*, 9(2), 75-79.

15. Halko, N., Martinsson, P.G., and Tropp, J.A. (2011). "Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions." *SIAM Review*, 53(2), 217-288.

16. Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. (2021). "LoRA: Low-Rank Adaptation of Large Language Models." *arXiv preprint arXiv:2106.09685*.

17. Weyl, H. (1912). "Das asymptotische Verteilungsgesetz der Eigenwerte linearer partieller Differentialgleichungen." *Mathematische Annalen*, 71(4), 441-479.

18. Stewart, G.W. (1993). "On the Early History of the Singular Value Decomposition." *SIAM Review*, 35(4), 551-566.

19. Anderson, E. et al. (1999). *LAPACK Users' Guide*, 3rd edition. SIAM.

20. Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. (2018). "Spectral Normalization for Generative Adversarial Networks." *International Conference on Learning Representations (ICLR)*.

21. Jacot, A., Gabriel, F., and Hongler, C. (2018). "Neural Tangent Kernel: Convergence and Generalization in Neural Networks." *Advances in Neural Information Processing Systems (NeurIPS)*.

22. Martin, C.H. and Mahoney, M.W. (2021). "Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning." *Journal of Machine Learning Research*, 22(165), 1-73.

