Matryoshka representation learning
Last reviewed
May 16, 2026
Sources
20 citations
Review status
Source-backed
Revision
v1 ยท 3,500 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
20 citations
Review status
Source-backed
Revision
v1 ยท 3,500 words
Add missing citations, update stale details, or suggest a clearer explanation.
Matryoshka Representation Learning (MRL) is a representation learning technique that trains a single neural model to produce embedding vectors which remain useful when truncated to many smaller dimensionalities. Named after Russian nesting dolls, the method packs coarse-to-fine information into a single output vector so that the first k coordinates form a self-contained representation for many values of k. MRL was introduced in May 2022 by Aditya Kusupati and colleagues at the University of Washington, Google Research, and the Allen Institute for AI in the paper "Matryoshka Representation Learning" (arXiv:2205.13147), published at NeurIPS 2022.
The technique adds a sum of losses computed on nested prefixes of the embedding to the standard training pipeline, with no architecture changes and no measurable cost at inference. Downstream systems can shorten the model's output embeddings at will, trading a small amount of accuracy for large savings in memory, storage, and search latency without retraining. MRL became one of the most widely adopted techniques in modern embedding stacks. It powers OpenAI's text-embedding-3 family (January 2024), Nomic Embed Text v1.5 (February 2024), Jina Embeddings v3 (September 2024), voyage-code-3 (December 2024), and several Mixedbread (mxbai) releases. The coarse-to-fine principle has also been extended to transformer depth, image resolution, and key-value caches.
Deep learning systems typically learn fixed-dimensional vector representations whose size is set at training time. For information retrieval, large-scale classification, and approximate nearest-neighbor search, embeddings often need to be small to fit in memory and allow fast distance computations, but smaller embeddings tend to be less accurate. Practitioners historically resolved this trade-off in one of three ways: train multiple separate models at different dimensions; apply post hoc dimensionality reduction such as principal component analysis (PCA), random projection, or product quantization; or distill a smaller student from a large teacher. Each approach has drawbacks. Training many models is expensive. Linear projections discard information the encoder spent compute learning, and are fitted over a corpus rather than jointly with the encoder, so they typically underperform a model trained natively at the target dimension. Distillation requires a separate student per target size.
MRL instead makes the model itself responsible for producing embeddings simultaneously usable at many dimensionalities. Because the same embedding is read at multiple granularities during training, the optimizer is forced to place the most important information at the beginning of the vector, with later coordinates progressively refining the representation. Each smaller embedding is literally a prefix of the larger one, in the same way each smaller matryoshka doll fits inside the next.
Let f_theta(x) in R^d be the d-dimensional output of an encoder f with parameters theta, and let M = {m_1, m_2, ..., m_L} be a set of nested dimensions with m_1 < m_2 < ... < m_L = d. For each chosen sub-dimension m in M, the first m coordinates of f_theta(x) are passed through a task head and an associated loss. The MRL training loss is a weighted sum of these per-dimension losses:
L_MRL(theta) = sum over m in M of c_m * L_m(f_theta(x)[1:m], y)
where L_m is the task loss (for example, cross-entropy for classification or a contrastive loss for retrieval), f_theta(x)[1:m] denotes the first m coordinates, y is the target, and c_m is an optional weight (the original paper uses uniform c_m = 1). The original paper evaluated nested sets such as {8, 16, 32, 64, 128, 256, 512, 1024, 2048} for ImageNet-1K with a ResNet-50, and similar geometric series for vision transformers, ALIGN-style vision-language models, and BERT-style text encoders. A lighter variant, MRL-Efficient (MRL-E), shares classifier weight matrices across dimensions through nesting, saving parameters on very large heads such as the 21,000-class ImageNet-21K head with virtually identical accuracy.
At inference time the encoder is unchanged. A consumer of the embedding decides how many leading coordinates to keep and simply truncates the vector. There is no separate sub-encoder, no additional forward pass, and truncation can happen on the client side, so a single vector index built at full dimension can also serve queries from clients that store only the truncated form. The authors call this "adaptive deployment."
MRL pairs naturally with Adaptive Retrieval, in which a search system uses a short prefix (for example, the first 16 to 64 coordinates) to perform an inexpensive shortlist retrieval, then re-ranks the shortlist with longer prefixes or the full vector. On ImageNet-1K and ImageNet-4K the authors report up to a 128-fold reduction in theoretical FLOPs per query and up to a 14-fold real-world wall-clock speedup, while matching the Top-1 retrieval accuracy of the unshortened d = 2048 embedding.
The nested-loss formulation creates a strict ordering on the embedding coordinates. The first m_1 coordinates must solve the task on their own, so the encoder is pressured to put the most informative signal there. Each later block must add information that improves the loss when concatenated with what comes before. Capacity spent on redundant directions does not reduce the loss at larger dimensions, so the optimizer is pressured to allocate orthogonal, finer-grained features to later coordinates. Empirically, Kusupati and colleagues report that MRL representations match or slightly exceed independently trained low-dimensional baselines at every m in M, with no penalty at the full dimension.
The paper was submitted to arXiv as 2205.13147 on 26 May 2022 under the original title "Matryoshka Representations for Adaptive Deployment." It was accepted to NeurIPS 2022 and presented in November 2022, after which the title was shortened to "Matryoshka Representation Learning." The eleven authors are Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, and Ali Farhadi, drawn primarily from the Paul G. Allen School of Computer Science and Engineering at the University of Washington, with additional affiliations at Google Research, the Allen Institute for AI, and Apple. First author Aditya Kusupati was a PhD student at the University of Washington advised by Ali Farhadi and Sham Kakade, with prior research at Microsoft Research India under Prateek Jain. Reference code is hosted at github.com/RAIVNLab/MRL by the Reasoning and Interactive Vision Lab (RAIVNLab).
The original paper evaluates MRL across image classification, image retrieval, vision-language alignment, and language modeling. The headline result is that a single MRL model matches the accuracy of separately trained low-dimensional models at every tested dimension on ImageNet-1K, while also enabling substantial retrieval speedups through adaptive retrieval.
| Setting | Architecture / dataset | MRL claim from the paper |
|---|---|---|
| Image classification | ResNet-50 on ImageNet-1K | Up to 14x smaller embeddings at equal classification accuracy, evaluated at d in {8, 16, 32, ..., 2048} |
| Image retrieval | ResNet-50 on ImageNet-1K, 1.28 million-image corpus | Up to 14x real-world wall-clock speedup at equal Top-1 accuracy versus unshortened d = 2048 baseline using Adaptive Retrieval |
| Image retrieval | ResNet-50 on ImageNet-4K, 4.2 million-image corpus | Adaptive retrieval with first 16-dim shortlist and full re-rank matches full-d Top-1 accuracy |
| Few-shot classification | Long-tail few-shot ImageNet | Up to roughly 2% accuracy improvement at low dimensions versus independently trained baselines |
| Vision-language alignment | ALIGN-style two-tower model | MRL preserves zero-shot ImageNet classification accuracy across nested dimensions |
| Language modeling | BERT-style encoder | Sentence embeddings trained with MRL retain accuracy across truncated dimensions |
| Robustness | ImageNet-A, ImageNet-R, ImageNet-V2, ImageNet-Sketch | MRL representations are as robust as the full-dim baseline across all out-of-distribution sets |
The FLOPs reduction in the retrieval experiments comes from two compounding effects. An inner-product comparison between two d-dimensional vectors costs O(d) operations, so a 16-dim shortlist instead of a 2048-dim one reduces per-comparison cost by 128x. Approximate nearest-neighbor indexes such as ScaNN or HNSW also benefit because their inverted-list scoring and graph traversal touch fewer floating-point coordinates per visit.
MRL went from an academic technique to a default building block of commercial embedding models in roughly 18 months. The earliest large-scale public adoption was OpenAI's text-embedding-3 family in January 2024, followed by a wave of open-weight releases from Nomic, Mixedbread, Jina AI, and Voyage AI through 2024.
On 25 January 2024 OpenAI released text-embedding-3-small and text-embedding-3-large, a new generation of text embedding models that replaced text-embedding-ada-002. Both models were trained with MRL and exposed to API users through a dimensions parameter that lets the caller specify how many coordinates to return. text-embedding-3-small defaults to 1,536 dimensions and can be truncated to 512; text-embedding-3-large defaults to 3,072 and can be truncated to 256. OpenAI reported that text-embedding-3-large truncated to 256 dimensions still outperforms text-embedding-ada-002 at its native 1,536 dimensions on the MTEB benchmark. The launch post described the technique in plain language: "developers can shorten embeddings (i.e. remove some numbers from the end of the sequence) without the embedding losing its concept-representing properties." The change was widely interpreted as the first mainstream commercial deployment of MRL. (See text-embedding-3.)
Nomic AI followed with nomic-embed text v1.5, released on 14 February 2024 as an open-weights model under Apache 2.0. Where the original nomic-embed-text-v1 produced 768-dim vectors at a single size, v1.5 was retrained with MRL and supports any dimension between 64 and 768. Nomic recommends canonical sizes of 768, 512, 256, 128, and 64, and reports that MTEB scores degrade only slightly between 768 and 256 dimensions.
Mixedbread, a German embedding startup, has shipped multiple MRL-trained models since early 2024. Their flagship mxbai-embed-large-v1 (335 million parameters) combines Matryoshka loss with quantization-aware training, allowing users to truncate and optionally binarize for a stacked compression factor of 64x relative to float32 full-dimensional baselines. Recommended dimensions are 1024, 768, 512, 256, 128, and 64.
Jina embeddings v3 was released on 18 September 2024 by Jina AI. It supports MRL output sizes between 1024 and 32 dimensions and 89 languages, with task-specific LoRA adapters for retrieval, separation, classification, and STS. Jina reports the model ranked 2nd on the MTEB English leaderboard among models under 1 billion parameters at release. Voyage AI shipped voyage-code-3 on 4 December 2024, a code-retrieval embedding trained with MRL and supporting output dimensions of 256, 512, 1024, and 2048, with optional int8 and binary quantization. Voyage reports voyage-code-3 at 256 dimensions matches or exceeds voyage-code-2 at 1536 dimensions on internal code search benchmarks.
The sentence transformers library, maintained by Tom Aarsen at Hugging Face, added native MRL support in version 2.5 in early 2024. The library exposes a MatryoshkaLoss class that wraps any inner loss (such as MultipleNegativesRankingLoss or CoSENTLoss) and evaluates it at a user-specified set of nested dimensions:
loss = MatryoshkaLoss(
model=model,
loss=base_loss,
matryoshka_dims=[768, 512, 256, 128, 64],
)
Sentence Transformers also ships Matryoshka2dLoss, which combines MatryoshkaLoss with AdaptiveLayerLoss to trade off both the number of output dimensions and the number of transformer layers used at inference. Hugging Face documentation reports retention of 98 percent of full-dimensional STSBenchmark performance at 8.3 percent of the original embedding size.
| Model | Provider | Release date | Native dim | Truncatable down to | License |
|---|---|---|---|---|---|
| text-embedding-3-small | OpenAI | January 25, 2024 | 1,536 | 512 | Proprietary, API only |
| text-embedding-3-large | OpenAI | January 25, 2024 | 3,072 | 256 | Proprietary, API only |
| nomic-embed-text-v1.5 | Nomic AI | February 14, 2024 | 768 | 64 | Apache 2.0 |
| mxbai-embed-large-v1 | Mixedbread | March 2024 | 1,024 | 64 | Apache 2.0 |
| mxbai-embed-2d-large-v1 | Mixedbread | April 2024 | 1,024 | 64 (+ layer truncation) | Apache 2.0 |
| jina-embeddings-v3 | Jina AI | September 18, 2024 | 1,024 | 32 | CC BY-NC 4.0 |
| voyage-code-3 | Voyage AI | December 4, 2024 | 2,048 | 256 | Proprietary, API only |
| Snowflake Arctic-Embed 2 | Snowflake | December 2024 | 1,024 | 256 | Apache 2.0 |
| GTE-Qwen2 family | Alibaba | 2024 | 1,536 to 4,096 | varies | Apache 2.0 |
Several open-weight models on the list pair Matryoshka loss with quantization-aware training (int8 or 1-bit binary), so a fully compressed vector can be a few dozen bytes per document while still supporting useful semantic search.
MRL is one of several methods for fitting embedding systems into limited memory or compute budgets, best understood by comparison with the alternatives it competes with or complements.
| Technique | Where it acts | Trained jointly with encoder? | Storage savings | Quality vs natively trained low-d | Notes |
|---|---|---|---|---|---|
| Matryoshka Representation Learning | Output embedding, by truncation | Yes | Linear in number of dropped coordinates | Matches or exceeds independently trained low-d models | Adds a few percent training cost; no inference overhead |
| PCA truncation | Output embedding, post hoc linear projection | No, fitted on encoded corpus | Linear, requires storing projection matrix | Typically below natively trained low-d at the same target size | Easy to apply to any pretrained encoder |
| Random projection (Johnson-Lindenstrauss) | Output embedding, post hoc | No | Linear | Worse than PCA at same size | Useful when corpus is too large to fit PCA |
| Product Quantization (PQ) and OPQ | Vector codebooks for ANN search | No (codebooks fitted on encoded corpus) | Logarithmic in number of codewords | Lossy but acceptable for ANN | Standard in FAISS and ScaNN; combines well with MRL |
| Scalar / binary quantization | Per-coordinate precision (float32 to int8 / 1-bit) | Optionally yes (quantization-aware training) | 4x to 32x | Small accuracy drop with QAT | Stacks cleanly with MRL: truncate then quantize |
| Distillation to a smaller student | Whole encoder | Yes (separate run) | Linear in student size | Excellent but requires retraining per target | Often used together with MRL, not as a replacement |
| Pruning / low-rank factorization | Encoder weights | Sometimes | Reduces inference cost too | Variable | Targets compute rather than embedding size |
| 2D Matryoshka (Matryoshka2d) | Output embedding and layer count | Yes | Linear in both axes | Slightly below 1D MRL at extremes | Reduces both storage and compute |
The practical recipe used by most 2024 embedding releases combines MRL with quantization. Because each technique acts on a different axis (coordinate count, bits per coordinate, ANN codebook packing), the savings multiply: a 1024-dim float32 vector occupying 4 kilobytes can be reduced to a 256-dim int8 vector occupying 256 bytes (16x compression) with only a few MTEB points of accuracy loss. MRL is also distinct from hierarchical embeddings such as Poincare or hyperbolic embeddings, which impose structure on the geometry of the embedding space; MRL imposes structure on the ordering of coordinates within a Euclidean embedding.
The paper provides three complementary perspectives on why nested losses produce useful prefix representations. As multi-task learning, each sub-dimension defines a separate task and the shared encoder must satisfy all of them simultaneously; the regularization effect of shared training tends to improve, rather than hurt, the full-dimensional accuracy. As approximately spectral decomposition (a view developed in follow-up work by Aniket Rege and collaborators in 2024), the optimal coordinates of a Matryoshka embedding align with the top singular directions of the data distribution, in the same way PCA components are ordered by explained variance. The view explains why MRL outperforms PCA in practice: the encoder is nonlinear and jointly optimized for the downstream task, rather than minimizing a fixed reconstruction objective. From an information-theoretic angle, early coordinates carry high-mutual-information content with the target while later coordinates carry refinements with diminishing returns, consistent with the empirical observation that the marginal accuracy gain from each additional doubling shrinks monotonically.
Since 2022 the matryoshka principle has been extended along several axes beyond the embedding dimension. 2D Matryoshka Sentence Embeddings, proposed in early 2024 by Xianming Li, Zongxi Li, Jing Li, and Haoran Xie (arXiv:2402.14776), generalize MRL to also nest along the layer dimension of a transformer, so any prefix of the layers and any prefix of the output dimensions yields a usable embedding. Mixedbread's mxbai-embed-2d-large-v1 implements this approach. Apple researchers published Matryoshka Diffusion Models (arXiv:2310.15111) in October 2023, which apply nested resolution training to image diffusion so a single denoiser generates consistent outputs at many resolutions simultaneously. MatryoshkaKV (arXiv:2410.14731), from October 2024, applies the matryoshka idea to KV-cache compression in large language model inference: trainable orthogonal projections are learned per layer and per head, and a matryoshka loss is applied across nested target ranks so the same projection can be truncated at inference time. The authors report retaining more than 90 percent of baseline accuracy at a 60 percent average cache compression rate on LLaMA-2 7B and Mistral 7B. Further 2024 and 2025 extensions mix matryoshka training with quantization-aware training to produce single models that serve embeddings at many (dimension, bit-width) operating points.
In practice the matryoshka recipe has matured into a small set of conventions: nested dimensions follow a geometric series with doubling steps (typically {32 or 64, 128, 256, 512, 768 or 1024}); per-dimension loss weights are uniform; the smallest target is 32 or 64 (below 16, accuracy drops sharply on most tasks); classifier weights are usually tied across dimensions when label counts are large (MRL-E); MRL is increasingly paired with int8 or binary quantization; and per-dimension MTEB or Top-1 accuracy is reported at each evaluation checkpoint. A common engineering pitfall is to train a model at a single dimension and then truncate after the fact, which almost always destroys quality because the model was never trained to keep important information at the front of the vector. MRL must be present in the training objective; it is not a post hoc transformation.
MRL has had unusually rapid uptake for a methodology paper. By the end of 2024, every major commercial provider of off-the-shelf text embeddings (OpenAI, Cohere, Voyage AI, Nomic, Mixedbread, Jina AI, Snowflake, and Alibaba's GTE family) had shipped at least one MRL-trained model. Open-source vector databases including Weaviate, Vespa, Qdrant, Milvus, and pgvector published guides on ingesting matryoshka embeddings, storing them at full precision in primary indexes, and serving queries against truncated representations for low-latency lookups. Independent academic citations have run into the thousands by 2026 according to Google Scholar, and the phrase "matryoshka embeddings" is now in common use to describe any embedding that supports prefix-based truncation. The technique has been less disruptive in image and video retrieval, where product quantization and learned binary codes remain dominant, but it has shaped how multimodal systems are deployed under memory pressure.
MRL trades a small amount of accuracy for flexibility. At very small dimensions (typically 16 or below), MRL models can fall behind natively trained low-dimensional baselines, with classification tolerating aggressive truncation better than fine-grained retrieval. The nested-loss formulation also interacts non-trivially with very large batch sizes and heavy negative-sampling regimes, requiring careful tuning of the smallest dimensions. MRL controls only the dimensional axis; it does not reduce encoder compute, latency, or parameter count by itself, so it is typically paired with distillation, pruning, layer-skipping, or 2D Matryoshka. Theoretical understanding remains partial: the link between matryoshka prefixes and spectral decomposition is informal, and sharper guarantees on the optimality of nested prefixes remain an active topic.