Singular value decomposition

Introduction

Singular value decomposition (SVD) is a factorization of a real or complex matrix that expresses any rectangular matrix as a product of three structured matrices: a rotation (or reflection), a non-negative diagonal scaling, and another rotation. For any real matrix A of size m x n, the SVD writes

A = U Sigma V^T

where U is an m x m orthogonal matrix, V is an n x n orthogonal matrix, and Sigma is an m x n diagonal matrix whose diagonal entries sigma_1 >= sigma_2 >= ... >= sigma_min(m,n) >= 0 are called the singular values of A. The columns of U are the left singular vectors, and the columns of V are the right singular vectors.

SVD is one of the most important matrix factorizations in all of applied mathematics. It underlies principal component analysis, the computation of the Moore-Penrose pseudoinverse, the solution of rank-deficient least squares problems, latent semantic analysis, many recommender system algorithms, image compression, noise reduction, and the analysis of neural network weight matrices. Gilbert Strang has called SVD "the climax of linear algebra," and numerical analysts routinely describe it as the most useful decomposition in the toolbox of numerical linear algebra.

Beyond its computational uses, SVD has deep theoretical significance. The Eckart-Young-Mirsky theorem shows that truncated SVD gives the best low-rank approximation of a matrix with respect to the Frobenius norm and the spectral norm, providing the mathematical foundation for dimensionality reduction. In modern deep learning, low-rank structure revealed by SVD motivates techniques such as LoRA (low-rank adaptation) and various model compression schemes.

ELI5 (Explain like I'm 5)

Imagine you have a big table of numbers, like a spreadsheet. Maybe each row is a person and each column is a movie, and each cell says how much that person liked that movie on a scale of one to five. The table has lots of numbers in it and looks complicated.

SVD is a way of splitting that big complicated table into three simpler pieces that, when multiplied back together, give you the original table. The first piece is a list of "people types" (for example, action lovers, romance lovers, comedy lovers). The second piece is a short list of how strong each type is, sorted from most important to least important. The third piece is a list of "movie types" that matches up with the people types.

The magic is that most of the interesting information is usually captured by only a few of the biggest numbers in the middle piece, not all of them. So you can throw away the small numbers and still reconstruct almost the whole table using very little data. That is how SVD helps shrink images, clean up noisy data, and even guess what movie you might like next. It finds the hidden patterns in a pile of numbers and tells you which patterns matter most.

History

SVD has a long history that predates its modern applications by almost a century. Its theoretical foundations were laid by several 19th-century mathematicians working on bilinear forms, and its computational aspects were developed in the 20th century as digital computers became available.

19th century origins

The SVD was independently discovered multiple times in the late 19th century. Eugenio Beltrami published the first proof for real square matrices in 1873, working in the context of bilinear forms. Camille Jordan arrived at the same result one year later in 1874, in his Memoire sur les formes bilineaires. James Joseph Sylvester rediscovered the decomposition in 1889 for real square matrices, calling the singular values "canonical multipliers." Erhard Schmidt generalized the theory to integral operators in 1907, which is the infinite-dimensional analogue. Hermann Weyl extended and refined the theory in the early 20th century, proving perturbation bounds for the singular values that still bear his name.

The Eckart-Young theorem

A landmark result was proved by Carl Eckart and Gale Young in their 1936 paper "The Approximation of One Matrix by Another of Lower Rank," published in Psychometrika. They showed that truncating the SVD gives the best low-rank approximation of a matrix in the Frobenius norm. Leon Mirsky later generalized the result to all unitarily invariant norms, including the spectral norm. This theorem is the mathematical foundation of modern dimensionality reduction and low-rank approximation.

Computational era

The SVD remained primarily a theoretical tool until the computational era. The first practical numerical algorithm for computing SVD was developed by Gene Golub and William Kahan in 1965, based on bidiagonalization. Golub and Christian Reinsch published their refined and widely adopted algorithm in a 1970 paper in Numerische Mathematik, which became a standard part of the LINPACK and later LAPACK numerical libraries. This made SVD a practical tool in scientific computing and signal processing.

The machine learning era

Starting in the 1990s and accelerating through the 2000s, SVD became central to many machine learning methods. Scott Deerwester and colleagues introduced latent semantic analysis in 1990, applying SVD to term-document matrices for information retrieval. During the Netflix Prize competition (2006 to 2009), Simon Funk popularized a matrix-factorization approach colloquially called "SVD" (although it is a gradient-descent approximation rather than the exact decomposition) for collaborative filtering. Nathan Halko, Per-Gunnar Martinsson, and Joel Tropp published the randomized SVD algorithm in 2011, extending SVD to datasets too large for classical algorithms. In the 2020s, SVD-inspired methods such as LoRA brought low-rank ideas to the fine-tuning of large language models.

Mathematical formulation

This section states the SVD theorem precisely, describes the components U, Sigma, and V, explains the full, thin, and truncated versions, and gives the geometric interpretation.

Statement of the theorem

Theorem (Singular value decomposition). Let A be a real m x n matrix. Then there exist orthogonal matrices U in R^{m x m} and V in R^{n x n}, and an m x n matrix Sigma that is zero except on its main diagonal, such that

A = U Sigma V^T

The diagonal entries of Sigma are non-negative real numbers sigma_1 >= sigma_2 >= ... >= sigma_p >= 0, where p = min(m, n). The number of strictly positive singular values equals the rank of A.

For complex matrices, the same theorem holds with U and V being unitary (complex orthogonal) matrices and V^T replaced by the conjugate transpose V^*.

Proof sketch

The standard existence proof starts with the symmetric positive semidefinite matrix A^T A. Since A^T A is symmetric, the spectral theorem guarantees it has an orthonormal basis of eigenvectors with real eigenvalues. All eigenvalues are non-negative: if A^T A v = lambda v with ||v|| = 1, then lambda = lambda ||v||^2 = v^T (lambda v) = v^T A^T A v = ||A v||^2 >= 0. Define the singular values as sigma_k = sqrt(lambda_k), ordered in decreasing order. The right singular vectors are the eigenvectors v_k of A^T A. For the nonzero singular values, define the left singular vectors by u_k = A v_k / sigma_k; one verifies these are orthonormal by direct computation. The left singular vectors for zero singular values are chosen to complete an orthonormal basis of R^m. Assembling these vectors into matrices and verifying the identity gives A = U Sigma V^T.

Components U, Sigma, V

Component	Size	Role	Key property
U	m x m	Left singular vectors (columns)	Orthogonal: U^T U = U U^T = I_m
Sigma	m x n	Non-negative diagonal entries (singular values)	Zero off the main diagonal; diagonal entries sorted in decreasing order
V	n x n	Right singular vectors (columns)	Orthogonal: V^T V = V V^T = I_n

The columns of V form an orthonormal basis of the input space R^n. The columns of U form an orthonormal basis of the output space R^m. The singular values sigma_k describe how much the transformation A stretches or compresses along the direction v_k. Concretely, A v_k = sigma_k u_k, so the unit vector v_k is mapped to sigma_k times the unit vector u_k.

The left singular vectors corresponding to nonzero singular values span the column space (range) of A. The remaining left singular vectors span the left null space of A. Similarly, the right singular vectors corresponding to nonzero singular values span the row space of A, and the remaining right singular vectors span the null space (kernel) of A. This gives an elegant description of the four fundamental subspaces of a matrix.

Full, thin, and truncated SVD

In practice, three different forms of SVD are used depending on the application.

Form	U size	Sigma size	V size	Storage	Use case
Full SVD	m x m	m x n	n x n	O(m^2 + n^2 + mn)	Theoretical work, four fundamental subspaces
Thin (economy) SVD	m x p	p x p (square diagonal)	n x p	O(mp + np + p)	Standard numerical computation; p = min(m, n)
Truncated (rank-k) SVD	m x k	k x k	n x k	O(mk + nk + k)	Low-rank approximation, compression, dimensionality reduction

The full SVD includes the entire orthogonal matrices U and V, with a rectangular Sigma padded with zero rows or columns. It is important for theoretical work because it gives orthonormal bases for all four fundamental subspaces, but it is wasteful in storage when the matrix is very rectangular.

The thin SVD (also called economy SVD or reduced SVD) keeps only the columns of U and V corresponding to non-zero singular values (or, more precisely, the first p = min(m, n) columns). The resulting Sigma is square and contains only the singular values on its diagonal. For a tall matrix (m > n), this means U becomes m x n and Sigma becomes n x n. The thin SVD is what numerical libraries typically return when full orthogonality of the excess columns is not needed.

The truncated SVD keeps only the top k singular values and corresponding singular vectors:

A_k = U_k Sigma_k Vk^T = sum{i=1}^{k} sigma_i u_i v_i^T

This is a rank-k matrix, and by the Eckart-Young-Mirsky theorem it is the best rank-k approximation of A in the Frobenius and spectral norms. Truncated SVD is the workhorse of dimensionality reduction and low-rank approximation.

Geometric interpretation

The SVD has a striking geometric interpretation. Any linear transformation A: R^n -> R^m can be decomposed into three successive steps:

A rotation (or reflection) of the input space R^n, implemented by V^T.
A non-negative scaling along the coordinate axes, implemented by Sigma, which may also project onto a subspace or embed into a higher-dimensional space.
A rotation (or reflection) of the output space R^m, implemented by U.

Geometrically, the unit sphere in R^n is first rotated by V^T (still a unit sphere), then each coordinate axis is stretched by the corresponding singular value (turning the sphere into an axis-aligned ellipsoid), then the result is rotated by U. The final image of the unit sphere is an ellipsoid in R^m whose principal semi-axes have lengths equal to the singular values of A, oriented along the left singular vectors. This gives a visual explanation for many of the facts about SVD: the largest singular value sigma_1 equals the operator 2-norm ||A||_2 because it is the longest semi-axis of the ellipsoid, and the ratio sigma_1 / sigma_min equals the aspect ratio of the ellipsoid, which is the condition number.

Relationship to eigendecomposition

SVD and eigendecomposition are closely related but not identical. Understanding the precise relationship is important because the two are often confused.

Eigendecomposition

The eigendecomposition of a square matrix B is B = P D P^{-1}, where D is a diagonal matrix of eigenvalues and P contains the corresponding eigenvectors as columns. Eigendecomposition exists only for square matrices and only for those that are diagonalizable. When B is real symmetric (or complex Hermitian), P can be chosen orthogonal (or unitary) and the eigenvalues are real, giving the spectral theorem: B = Q Lambda Q^T.

SVD as eigendecomposition of symmetric matrices

For any real matrix A, the matrices A^T A and A A^T are symmetric and positive semidefinite, so they admit spectral decompositions. These are directly related to the SVD of A:

Symmetric matrix	Eigendecomposition	Meaning
A^T A (n x n)	V Sigma^T Sigma V^T	Eigenvectors are right singular vectors; eigenvalues are sigma_i^2
A A^T (m x m)	U Sigma Sigma^T U^T	Eigenvectors are left singular vectors; eigenvalues are sigma_i^2

This relationship is the basis of the standard existence proof of SVD: the singular values are the square roots of the eigenvalues of A^T A (or equivalently A A^T).

Key differences

Property	Eigendecomposition	SVD
Applies to	Square diagonalizable matrices only	Any real or complex rectangular matrix
Always exists	No	Yes
Bases	May not be orthogonal	Always orthogonal
Scalars	Possibly complex eigenvalues	Non-negative real singular values
Input and output bases	Same (P on both sides)	Different (U and V)

For a real symmetric positive semidefinite matrix S, the SVD and the spectral eigendecomposition coincide: S = Q Lambda Q^T with non-negative eigenvalues, which is exactly the SVD with U = V = Q. For a general square matrix that is diagonalizable but not symmetric, the eigenvectors need not be orthogonal and the eigenvalues may be negative or complex, so the eigendecomposition and the SVD are genuinely different factorizations.

Properties

Several fundamental properties follow directly from the definition of SVD.

Uniqueness

The singular values of A are uniquely determined. They are an intrinsic property of the matrix. The singular vectors, on the other hand, are only determined up to certain sign and rotation freedoms.

If all positive singular values are distinct, then each corresponding pair (u_k, v_k) is unique up to simultaneous sign flip: replacing u_k with -u_k and v_k with -v_k leaves the product u_k sigma_k v_k^T unchanged.
If a singular value has multiplicity r > 1, then the corresponding left and right singular vectors are only determined up to an orthogonal rotation within the r-dimensional subspace they span.
Zero singular values do not uniquely determine singular vectors at all: any orthonormal basis of the kernel (for V) or left kernel (for U) will do.

The sign ambiguity in particular has practical implications. Different software libraries may return SVDs with singular vectors that differ in sign, even though both are mathematically valid. Applications that depend on a specific sign convention (for example, aligning principal components across datasets) must handle this explicitly.

Orthogonality and ordering

By convention, singular values are listed in decreasing order sigma_1 >= sigma_2 >= ... >= 0. This convention matters because the truncated SVD (which drops the smallest singular values) only gives the best low-rank approximation when the singular values are ordered. Numerical libraries consistently enforce this ordering.

The singular vectors corresponding to distinct singular values are orthogonal. The columns of U and V are orthonormal by construction.

Norms and rank

The SVD gives direct expressions for several fundamental matrix quantities.

Quantity	Formula in terms of singular values
Rank	Number of nonzero singular values
Frobenius norm
Spectral norm
Nuclear norm
Condition number kappa_2(A)	sigma_1 / sigma_min (for full rank A)
Determinant (square A)	product_i sigma_i times sign from orthogonal matrices

The spectral norm equals the largest singular value because it measures the maximum stretch the matrix applies to any unit vector, which the geometric interpretation of SVD makes transparent. The Frobenius norm is related to the squared singular values because ||A||_F^2 = trace(A^T A) = sum of eigenvalues of A^T A = sum of squared singular values.

Invariance under orthogonal transformations

The singular values of A are invariant under left or right multiplication by orthogonal matrices. If Q and R are orthogonal, then Q A R^T has the same singular values as A, though generally different singular vectors. This invariance is the reason unitarily invariant norms (such as the Frobenius and spectral norms) can be expressed entirely in terms of singular values.

Computation

A wide variety of algorithms exist for computing the SVD, trading off precision, speed, and suitability for particular matrix structures.

Golub-Reinsch algorithm

The classical algorithm for computing a dense SVD is the Golub-Reinsch algorithm, published in 1970. It proceeds in two stages:

Bidiagonalization. The matrix A is reduced to upper bidiagonal form using a sequence of Householder reflections applied from the left and right. After this step, A = U_1 B V_1^T, where B is bidiagonal and U_1, V_1 are orthogonal. This step dominates the cost for large matrices.
Diagonalization by implicit QR. The bidiagonal matrix B is then diagonalized using an implicitly shifted QR iteration that preserves the bidiagonal structure. Each iteration chases the bulge caused by the shift down the diagonal, gradually driving the off-diagonal entries to zero.

The algorithm is backward stable and is the basis of the SVD routines in LAPACK, which are called by numerical libraries such as NumPy, SciPy, MATLAB, Julia, and R.

Jacobi SVD

The one-sided Jacobi SVD algorithm applies a sequence of plane rotations to orthogonalize the columns of A. It is slower than Golub-Reinsch for large matrices but can achieve higher relative accuracy, particularly for the small singular values of matrices with a wide range of singular value magnitudes. It is sometimes used when high accuracy is needed for all singular values.

Divide and conquer

The divide-and-conquer method for bidiagonal SVD recursively splits the bidiagonal matrix into smaller pieces, solves the smaller problems, and combines the results via a secular equation. It has better computational complexity than the QR-based approach for large matrices and is implemented in LAPACK as gesdd.

Randomized SVD

For very large matrices, and particularly when only a low-rank approximation is needed, classical dense SVD is too expensive. The randomized SVD algorithm by Halko, Martinsson, and Tropp (2011) produces a near-optimal rank-k approximation by sampling random linear combinations of the columns of A and then doing a small dense SVD on the reduced matrix.

The basic idea is:

Draw a random matrix Omega of size n x (k + p), where p is a small oversampling parameter (typically p = 5 or p = 10).
Compute Y = A Omega. The columns of Y approximately span the range of A.
Orthonormalize Y via QR decomposition: Y = Q R.
Form B = Q^T A, a small (k + p) x n matrix.
Compute the SVD of B = U_B Sigma_B V_B^T directly using a classical dense algorithm.
The approximate SVD of A is A approx (Q U_B) Sigma_B V_B^T.

For increased accuracy, one or two subspace iterations may be added. The randomized algorithm is fast, parallelizes well, needs only a few passes over the matrix (which matters enormously for out-of-core or streaming computations), and has rigorous error bounds. It has become the standard approach for SVD of very large or very sparse matrices, and it is used under the hood in large-scale PCA implementations such as scikit-learn's TruncatedSVD.

Complexity summary

Algorithm	Complexity	Notes
Golub-Reinsch (QR iteration)	O(min(m n^2, m^2 n)) flops; roughly 17 n^3 for square n x n	Standard dense SVD; backward stable
Divide and conquer (`gesdd`)	Similar leading order; smaller constant, roughly 9 n^3 for square n x n	Faster than QR iteration in practice for large matrices
One-sided Jacobi	O(n^3) with larger constant	Higher relative accuracy for small singular values
Lanczos / ARPACK iteration	O(m n k) for k leading singular triplets	Good for sparse matrices and few dominant components
Randomized SVD	O(m n k) for dense; O(m n log k) with structured random matrices	Approximate; near-optimal for low-rank structure

For a general dense m x n matrix the overall cost of a full SVD is O(m n^2) if m >= n, and O(m^2 n) if n > m; these bounds reflect the cost of the bidiagonalization step, which dominates.

Low-rank approximation: Eckart-Young theorem

The truncated SVD is not just a low-rank approximation of A; it is provably the best low-rank approximation in a strong sense.

Statement

Theorem (Eckart-Young-Mirsky). Let A be an m x n matrix with SVD A = U Sigma V^T, and let Ak = sum{i=1}^{k} sigma_i u_i v_i^T be its truncation to rank k. Then for any matrix B of rank at most k,

||A - A_k||_F <= ||A - B||_F

and

||A - A_k||_2 <= ||A - B||_2

Moreover, the errors are given exactly by the discarded singular values:

||A - Ak||F^2 = sigma{k+1}^2 + sigma{k+2}^2 + ... + sigma_p^2

||A - A_k||2 = sigma{k+1}

Mirsky (1960) extended the statement to any unitarily invariant norm. The result is sometimes called the Schmidt-Eckart-Young-Mirsky theorem, acknowledging Erhard Schmidt's earlier work in the infinite-dimensional setting.

Proof sketch (Frobenius case)

Let B be any rank-k matrix. Write the SVD of A as A = U Sigma V^T and consider the transformed error U^T (A - B) V = Sigma - U^T B V. The Frobenius norm is unitarily invariant, so ||A - B||_F = ||Sigma - U^T B V||_F. The matrix U^T B V has rank at most k. The problem therefore reduces to: among all rank-k matrices C, which minimizes ||Sigma - C||F when Sigma is diagonal with decreasing entries? Using the interlacing inequalities of Weyl for singular values of sums of matrices (or a direct argument), one shows the minimum is achieved by C = diag(sigma_1, ..., sigma_k, 0, ..., 0), yielding the error sum{i > k} sigma_i^2. This corresponds precisely to the truncation A_k.

Interpretation and consequences

The Eckart-Young theorem provides the mathematical justification for almost every use of SVD as a dimensionality reduction tool. It says that if you want to compress a matrix to rank k, throwing away all but the top k singular values and their vectors is the best you can do with respect to the Frobenius or spectral norm. The quality of the approximation is determined by how fast the singular values decay: if they decay rapidly, a low rank k suffices for high accuracy; if they decay slowly, any low-rank approximation will be poor. Many real-world matrices (natural images, text corpora, user-item rating matrices) have rapidly decaying singular values, which is why low-rank methods work so well in practice.

Pseudoinverse and least squares

SVD provides the most stable and general way to compute the Moore-Penrose pseudoinverse of a matrix and to solve least squares problems.

Moore-Penrose pseudoinverse via SVD

The Moore-Penrose pseudoinverse of a matrix A with SVD A = U Sigma V^T is

A^+ = V Sigma^+ U^T

where Sigma^+ is obtained from Sigma by transposing it and then replacing each nonzero singular value sigma_i by its reciprocal 1/sigma_i (zero singular values remain zero). Equivalently,

A^+ = sum_{sigma_i > 0} (1 / sigma_i) v_i u_i^T

This formula always makes sense, even for rank-deficient or non-square matrices. The pseudoinverse satisfies the four Moore-Penrose conditions and reduces to the ordinary inverse when A is square and invertible.

Least squares solution

For an overdetermined linear system A x = b with A of size m x n and m > n, the minimum-norm least squares solution is

x^* = A^+ b = V Sigma^+ U^T b

This x^* minimizes ||A x - b||_2 over all x, and among all minimizers it has the smallest Euclidean norm. When A is full column rank, the solution is unique and equals (A^T A)^{-1} A^T b, but for rank-deficient A the normal equations are singular while the SVD-based formula still works.

Expanded in terms of singular values, the solution can be written as

x^* = sum_{sigma_i > 0} ((u_i^T b) / sigma_i) v_i

This expansion shows why small singular values cause numerical trouble: dividing by a very small sigma_i amplifies noise in the corresponding projection u_i^T b. The condition number of the least squares problem is essentially sigma_1 / sigma_min. When the matrix is ill-conditioned, it is common to drop all singular values below a threshold (the truncated SVD regularization) or apply Tikhonov regularization, which replaces 1/sigma_i with sigma_i / (sigma_i^2 + lambda^2) for a damping parameter lambda.

Total least squares

The total least squares problem, which allows errors in both A and b, also has a clean SVD-based solution. Augmenting A with b to form [A | b] and computing the SVD, the solution comes from the right singular vector corresponding to the smallest singular value. This is known as the Golub-Van Loan solution to total least squares.

Applications

The applications of SVD span nearly every quantitative discipline. This section highlights some of the most important.

Principal component analysis

Principal component analysis (PCA) is essentially SVD applied to a centered data matrix. Given an n x p data matrix X where rows are observations and columns are features, PCA first centers X by subtracting the column means and then computes the SVD of the centered matrix X_c = U Sigma V^T. The right singular vectors (columns of V) are the principal component directions, the scores U Sigma give the coordinates of the data in the new basis, and the squared singular values divided by (n - 1) give the variances captured by each component. Computing PCA via SVD of X_c is numerically preferable to forming and diagonalizing the covariance matrix X_c^T X_c / (n - 1), because squaring the matrix doubles the condition number. Almost all modern PCA implementations (scikit-learn, MATLAB's pca, R's prcomp) use SVD under the hood.

Latent semantic analysis

Latent semantic analysis (LSA), also called latent semantic indexing, was introduced by Deerwester et al. in their 1990 paper "Indexing by Latent Semantic Analysis," published in the Journal of the American Society for Information Science. LSA applies truncated SVD to a term-document matrix (often weighted by TF-IDF or log-entropy) where rows are terms and columns are documents. The truncation to a small number of components (typically 100 to 300) projects terms and documents into a shared low-dimensional "semantic" space, where terms with similar meanings and documents about similar topics end up close together. LSA was one of the first successful methods for discovering latent topics in text and for retrieving documents by concept rather than exact keyword match. It is a direct ancestor of modern topic models (such as LDA) and word embedding methods.

Recommender systems

In recommender systems, users and items can be arranged in a large sparse rating matrix whose entries represent observed ratings or interactions. Low-rank matrix factorization methods (often loosely called "SVD" in the recommender literature) factor this matrix into a user-factor matrix and an item-factor matrix, treating unobserved entries as unknowns to be predicted. The canonical example is the Netflix Prize competition (2006 to 2009), where matrix factorization became one of the core techniques. Simon Funk popularized a stochastic gradient descent method that trains low-rank factors to minimize squared error on observed ratings. Strictly speaking, this is not the classical SVD because it handles missing entries, but the underlying intuition comes directly from the low-rank structure revealed by SVD.

The winning team BellKor's Pragmatic Chaos used an ensemble that included matrix factorization as a central component. Yehuda Koren and collaborators published influential papers on the approach, including "Matrix Factorization Techniques for Recommender Systems" in IEEE Computer (2009).

Image compression

An image can be represented as a matrix of pixel intensities (or three matrices for color channels). Computing the SVD and keeping only the top k singular triplets reconstructs an approximation A_k = U_k Sigma_k V_k^T whose storage requirement is (m + n + 1) * k instead of m * n. For a 1000 x 1000 image, keeping 50 singular values requires storing 100050 numbers instead of 1000000, a tenfold reduction. Because natural images tend to have rapidly decaying singular value spectra, the reconstructed image is often visually indistinguishable from the original even at moderate truncation. While SVD-based compression is not competitive with modern codecs such as JPEG or WebP in terms of perceptual quality per bit, it remains a standard teaching example that illustrates low-rank approximation vividly.

Noise reduction

When a matrix of signal measurements is corrupted by independent additive noise, the true signal often lies in a low-dimensional subspace while the noise is spread across all dimensions. Truncating the SVD to the top few singular values keeps the signal subspace and discards most of the noise. This idea underlies a wide range of denoising techniques in signal processing, image processing, and bioinformatics, including subspace methods for direction-of-arrival estimation, spectral methods for microarray data, and low-rank plus sparse decompositions (robust PCA) for video background subtraction.

Numerical linear algebra

SVD is a workhorse of numerical linear algebra. It provides the most reliable tool for determining the numerical rank of a matrix (counting how many singular values are above a noise threshold), for solving ill-conditioned or rank-deficient linear systems, for computing the null space and range of a matrix, and for estimating condition numbers. Matrix equation solvers, orthogonal Procrustes problems (finding the best rotation that aligns two sets of points), and many constrained optimization problems rely on the SVD.

Other domains

Domain	SVD-based application
Control theory	Balanced model reduction, system identification, H-infinity optimization
Signal processing	MUSIC and ESPRIT for direction-of-arrival, subspace tracking
Computer vision	Essential matrix decomposition, structure from motion, homography estimation
Genomics	Gene expression analysis, population structure inference
Chemometrics	Multivariate calibration, partial least squares
Information retrieval	LSI, document clustering, cross-language retrieval
Network science	Spectral clustering, community detection, link prediction

SVD in deep learning

SVD has become newly relevant in the era of large neural networks. The weight matrices of modern deep models contain billions of parameters, and low-rank structure offers a path to understanding, compressing, and fine-tuning them efficiently.

Model compression

One direct application is model compression by low-rank weight decomposition. A dense weight matrix W in R^{d x k} can be replaced by a product W approx B A where B is in R^{d x r} and A is in R^{r x k}, with r much smaller than min(d, k). When the original W has fast-decaying singular values, the approximation is accurate and the parameter count drops from d*k to (d + k)*r. This method has been used to compress fully connected layers, convolutional kernels, and self-attention projections, often with modest accuracy loss. Structured SVD variants (Tucker decomposition, CP decomposition for tensors) extend the idea to convolutional filters with multiple dimensions.

LoRA: low-rank adaptation

LoRA (low-rank adaptation of large language models), introduced by Hu et al. in 2021, leverages low-rank structure to fine-tune massive language models efficiently. Instead of updating the full pretrained weight matrix W_0 in R^{d x k}, LoRA keeps W_0 frozen and adds a trainable low-rank update Delta W = B A, where B is in R^{d x r} and A is in R^{r x k} with r much smaller than min(d, k). Only A and B are trained. The rationale, backed by empirical and SVD-based analyses in the original paper, is that the effective update needed for downstream fine-tuning has low "intrinsic rank," so a low-rank parameterization captures most of the benefit at a small fraction of the cost. Compared to full fine-tuning of GPT-3 175B with Adam, LoRA reduces the number of trainable parameters by roughly 10000 times and the GPU memory requirement by roughly 3 times.

LoRA is not literally an SVD of the update, but it is directly inspired by the low-rank structure that SVD reveals, and several follow-up methods (such as AdaLoRA and SVD-LoRA) explicitly use SVD to initialize, allocate, or adapt the rank budget across layers. Similar ideas appear in QLoRA, DoRA, and other parameter-efficient fine-tuning techniques.

Weight matrix analysis

Researchers have used SVD to probe the structure of trained neural networks. Studying the spectra of weight matrices reveals patterns that relate to generalization, training dynamics, and task information. Random matrix theory analysis of deep networks has shown that before training the singular value spectra match predictions of random matrix models (such as the Marchenko-Pastur distribution), while after training the largest singular values deviate and appear to encode task-specific information. More recent work has also found that the smallest singular values deviate from random predictions, suggesting that the entire singular value spectrum carries meaningful signal and that compression methods should not blindly discard the smallest singular values without considering the task.

Spectral methods and implicit bias

Gradient descent on linear and deep networks exhibits an implicit bias toward low-rank and low-spectral-norm solutions in certain regimes. Spectral properties of weight matrices (such as sigma_1 and the ratio sigma_1 / sigma_min) are used in generalization bounds, stability analysis, and Lipschitz-constant estimation. Spectral normalization, a technique used in generative adversarial networks, constrains the spectral norm (largest singular value) of each weight matrix to be at most 1 and requires computing or approximating sigma_1 during training, typically via power iteration.

Neural tangent kernels

In the theoretical analysis of wide neural networks, the neural tangent kernel (NTK) describes the linearized training dynamics and its spectral decomposition governs convergence rates and generalization. SVD (or eigendecomposition) of the NTK Gram matrix provides the principal modes of learning: the network converges fastest along directions corresponding to the largest eigenvalues, a phenomenon known as spectral bias. This gives a quantitative explanation for why deep networks tend to fit low-frequency components of functions first.

Numerical considerations

Real-world SVD computations require care with respect to numerical stability, scaling, and accuracy targets.

Condition number

The condition number of A with respect to the 2-norm is kappa_2(A) = sigma_1 / sigma_min. A matrix is called well-conditioned if this ratio is small (close to 1) and ill-conditioned if it is large. The condition number quantifies how much small perturbations in the input of a linear system A x = b can be amplified in the solution. Roughly speaking, if kappa_2(A) approx 10^k and the input has relative error at the machine precision epsilon, the computed solution can have relative error up to 10^k * epsilon. SVD is the gold standard for computing kappa_2 because no other method gives such a direct handle on sigma_min.

Rank determination

In floating-point arithmetic, a theoretically zero singular value typically computes as a very small number. Determining the numerical rank of a matrix amounts to choosing a threshold below which singular values are treated as zero. A common choice is sigma_1 * max(m, n) * epsilon, where epsilon is the machine precision. The SVD's ability to provide a stable rank-revealing factorization is one of its most practically valuable features.

Backward stability

The Golub-Reinsch and divide-and-conquer SVD algorithms are backward stable: the computed singular values and vectors are exact for a matrix that is close to the input matrix (within relative error on the order of epsilon). This means the accuracy of the large singular values is limited only by machine precision. The relative accuracy of very small singular values may be worse, which is where Jacobi SVD can offer advantages.

Storage and access patterns

For dense matrices that fit in memory, Golub-Reinsch or divide-and-conquer are the methods of choice. For sparse matrices, Lanczos-based iterative methods are preferred because they access the matrix only through matrix-vector products and never form dense intermediate results. For very large matrices that do not fit in memory or that are accessed through expensive communication, randomized SVD minimizes the number of passes over the data, often getting a rank-k approximation from just two or three passes.

Regularization

When using SVD to solve ill-conditioned least squares problems, it is common to regularize by discarding or damping small singular values. Two standard approaches are truncated SVD (setting 1/sigma_i to zero for sigma_i below a threshold) and Tikhonov regularization (replacing 1/sigma_i with sigma_i / (sigma_i^2 + lambda^2)). Both have a natural interpretation in the SVD basis: they filter the contributions of noisy directions to the solution.

Software implementations

Library / Tool	Language	Function	Notes
NumPy	Python	`numpy.linalg.svd`	Full or thin SVD via LAPACK `gesdd`
SciPy	Python	`scipy.linalg.svd`, `scipy.sparse.linalg.svds`	Dense and sparse (ARPACK) SVD
scikit-learn	Python	`sklearn.decomposition.TruncatedSVD`, `PCA`	Randomized and classical variants
PyTorch	Python	`torch.linalg.svd`, `torch.svd_lowrank`	GPU-accelerated, differentiable
TensorFlow	Python	`tf.linalg.svd`	GPU-accelerated
MATLAB	MATLAB	`svd`, `svds`	Dense and sparse
R	R	`svd`, `irlba` (for truncated)	Built-in and package-based
LAPACK	Fortran / C	`gesdd`, `gesvd`, `gejsv`	Low-level reference implementation
Julia	Julia	`svd`, `svdvals`	Thin or full via LAPACK
Apache Spark MLlib	Scala / Python	`RowMatrix.computeSVD`	Distributed for big data

For very large or sparse matrices, specialized packages such as ARPACK, PROPACK, SLEPc, and fbpca (randomized SVD in Python) are widely used. GPU-accelerated SVDs are available in cuSOLVER (NVIDIA), MAGMA, and as built-ins in PyTorch and TensorFlow.

References

Beltrami, E. (1873). "Sulle funzioni bilineari." *Giornale di Matematiche*, 11, 98-106.
Jordan, C. (1874). "Memoire sur les formes bilineaires." *Journal de Mathematiques Pures et Appliquees*, 19, 35-54.
Sylvester, J.J. (1889). "A new proof that a general quadric may be reduced to its canonical form (that is, a linear function of squares) by means of a real orthogonal substitution." *Messenger of Mathematics*, 19, 1-5.
Schmidt, E. (1907). "Zur Theorie der linearen und nichtlinearen Integralgleichungen. I. Teil: Entwicklung willkurlicher Funktionen nach Systemen vorgeschriebener." *Mathematische Annalen*, 63(4), 433-476.
Eckart, C. and Young, G. (1936). "The Approximation of One Matrix by Another of Lower Rank." *Psychometrika*, 1(3), 211-218.
Mirsky, L. (1960). "Symmetric gauge functions and unitarily invariant norms." *Quarterly Journal of Mathematics*, 11(1), 50-59.
Golub, G.H. and Kahan, W. (1965). "Calculating the Singular Values and Pseudo-Inverse of a Matrix." *Journal of the Society for Industrial and Applied Mathematics, Series B: Numerical Analysis*, 2(2), 205-224.
Golub, G.H. and Reinsch, C. (1970). "Singular value decomposition and least squares solutions." *Numerische Mathematik*, 14(5), 403-420.
Golub, G.H. and Van Loan, C.F. (2013). *Matrix Computations*, 4th edition. Johns Hopkins University Press.
Trefethen, L.N. and Bau, D. III (1997). *Numerical Linear Algebra*. SIAM.
Strang, G. (2016). *Introduction to Linear Algebra*, 5th edition. Wellesley-Cambridge Press.
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., and Harshman, R. (1990). "Indexing by Latent Semantic Analysis." *Journal of the American Society for Information Science*, 41(6), 391-407.
Koren, Y., Bell, R., and Volinsky, C. (2009). "Matrix Factorization Techniques for Recommender Systems." *IEEE Computer*, 42(8), 30-37.
Bell, R.M. and Koren, Y. (2007). "Lessons from the Netflix Prize Challenge." *SIGKDD Explorations*, 9(2), 75-79.
Halko, N., Martinsson, P.G., and Tropp, J.A. (2011). "Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions." *SIAM Review*, 53(2), 217-288.
Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. (2021). "LoRA: Low-Rank Adaptation of Large Language Models." *arXiv preprint arXiv:2106.09685*.
Weyl, H. (1912). "Das asymptotische Verteilungsgesetz der Eigenwerte linearer partieller Differentialgleichungen." *Mathematische Annalen*, 71(4), 441-479.
Stewart, G.W. (1993). "On the Early History of the Singular Value Decomposition." *SIAM Review*, 35(4), 551-566.
Anderson, E. et al. (1999). *LAPACK Users' Guide*, 3rd edition. SIAM.
Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. (2018). "Spectral Normalization for Generative Adversarial Networks." *International Conference on Learning Representations (ICLR)*.
Jacot, A., Gabriel, F., and Hongler, C. (2018). "Neural Tangent Kernel: Convergence and Generalization in Neural Networks." *Advances in Neural Information Processing Systems (NeurIPS)*.
Martin, C.H. and Mahoney, M.W. (2021). "Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning." *Journal of Machine Learning Research*, 22(165), 1-73.

Introduction

ELI5 (Explain like I'm 5)

History

19th century origins

The Eckart-Young theorem

Computational era

The machine learning era

Mathematical formulation

Statement of the theorem

Proof sketch

Components U, Sigma, V

Full, thin, and truncated SVD

Geometric interpretation

Relationship to eigendecomposition

Eigendecomposition

SVD as eigendecomposition of symmetric matrices

Key differences

Properties

Uniqueness

Orthogonality and ordering

Norms and rank

Invariance under orthogonal transformations

Computation

Golub-Reinsch algorithm

Jacobi SVD

Divide and conquer

Randomized SVD

Complexity summary

Low-rank approximation: Eckart-Young theorem

Statement

Proof sketch (Frobenius case)

Interpretation and consequences

Pseudoinverse and least squares

Moore-Penrose pseudoinverse via SVD

Least squares solution

Total least squares

Applications

Principal component analysis

Latent semantic analysis

Recommender systems

Image compression

Noise reduction

Numerical linear algebra

Other domains

SVD in deep learning

Model compression

LoRA: low-rank adaptation

Weight matrix analysis

Spectral methods and implicit bias

Neural tangent kernels

Numerical considerations

Condition number

Rank determination

Backward stability

Storage and access patterns

Regularization

Software implementations

See also

References

Improve this article

Related Articles

ARC-AGI 2

Hyperplane

Scalar

Tensor Rank

Sparse Vector

Matrix factorization

Introduction

ELI5 (Explain like I'm 5)

History

19th century origins

The Eckart-Young theorem

Computational era

The machine learning era

Mathematical formulation

Statement of the theorem

Proof sketch

Components U, Sigma, V

Full, thin, and truncated SVD

Geometric interpretation