Matryoshka representation learning

Information Retrieval Machine Learning Natural Language Processing

18 min read

Updated Jun 24, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 24, 2026

Fact-checked

In review queue

Sources

20 citations

Revision

v2 · 3,655 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Matryoshka Representation Learning (MRL) is a representation learning technique that trains a single neural model to produce embedding vectors which remain useful when truncated to many smaller dimensionalities, like nested Russian dolls. Named after Russian nesting dolls, the method packs coarse-to-fine information into a single output vector so that the first k coordinates form a self-contained representation for many values of k. On ImageNet-1K its authors report up to a 14x smaller embedding at the same classification accuracy and up to a 14x real-world speedup (128x in theoretical FLOPs) for large-scale retrieval, while the truncated vectors stay "at least as accurate and rich as independently trained low-dimensional representations." ^[1] MRL was introduced in May 2022 by Aditya Kusupati and colleagues at the University of Washington, Google Research, and the Allen Institute for AI in the paper "Matryoshka Representation Learning" (arXiv:2205.13147), published at NeurIPS 2022. ^[1]^[2]

The technique adds a sum of losses computed on nested prefixes of the embedding to the standard training pipeline, with no architecture changes and no measurable cost at inference. Downstream systems can shorten the model's output embeddings at will, trading a small amount of accuracy for large savings in memory, storage, and search latency without retraining. MRL became one of the most widely adopted techniques in modern embedding stacks. It powers OpenAI's text-embedding-3 family (January 2024), Nomic Embed Text v1.5 (February 2024), Jina Embeddings v3 (September 2024), voyage-code-3 (December 2024), and several Mixedbread (mxbai) releases. ^[4]^[5]^[8]^[9] The coarse-to-fine principle has also been extended to transformer depth, image resolution, and key-value caches.

What problem does MRL solve?

Deep learning systems typically learn fixed-dimensional vector representations whose size is set at training time. For information retrieval, large-scale classification, and approximate nearest-neighbor search, embeddings often need to be small to fit in memory and allow fast distance computations, but smaller embeddings tend to be less accurate. Practitioners historically resolved this trade-off in one of three ways: train multiple separate models at different dimensions; apply post hoc dimensionality reduction such as principal component analysis (PCA), random projection, or product quantization; or distill a smaller student from a large teacher. Each approach has drawbacks. Training many models is expensive. Linear projections discard information the encoder spent compute learning, and are fitted over a corpus rather than jointly with the encoder, so they typically underperform a model trained natively at the target dimension. Distillation requires a separate student per target size.

The paper frames the goal as a single question: "can we design a flexible representation that can adapt to multiple downstream tasks with varying computational resources?" ^[1] MRL answers it by making the model itself responsible for producing embeddings simultaneously usable at many dimensionalities. Because the same embedding is read at multiple granularities during training, the optimizer is forced to place the most important information at the beginning of the vector, with later coordinates progressively refining the representation. Each smaller embedding is literally a prefix of the larger one, in the same way each smaller matryoshka doll fits inside the next.

How does the MRL method work?

Training objective

Let f_theta(x) in R^d be the d-dimensional output of an encoder f with parameters theta, and let M = {m_1, m_2, ..., m_L} be a set of nested dimensions with m_1 < m_2 < ... < m_L = d. For each chosen sub-dimension m in M, the first m coordinates of f_theta(x) are passed through a task head and an associated loss. The MRL training loss is a weighted sum of these per-dimension losses:

L_MRL(theta) = sum over m in M of c_m * L_m(f_theta(x)[1:m], y)

where L_m is the task loss (for example, cross-entropy for classification or a contrastive loss for retrieval), f_theta(x)[1:m] denotes the first m coordinates, y is the target, and c_m is an optional weight (the original paper uses uniform c_m = 1). The original paper evaluated nested sets such as {8, 16, 32, 64, 128, 256, 512, 1024, 2048} for ImageNet-1K with a ResNet-50, and similar geometric series for vision transformers, ALIGN-style vision-language models, and BERT-style text encoders. ^[1] A lighter variant, MRL-Efficient (MRL-E), shares classifier weight matrices across dimensions through nesting, saving parameters on very large heads such as the 21,000-class ImageNet-21K head with virtually identical accuracy.

Inference and adaptive deployment

At inference time the encoder is unchanged. A consumer of the embedding decides how many leading coordinates to keep and simply truncates the vector. There is no separate sub-encoder, no additional forward pass, and truncation can happen on the client side, so a single vector index built at full dimension can also serve queries from clients that store only the truncated form. The authors call this "adaptive deployment." ^[1]

MRL pairs naturally with Adaptive Retrieval, in which a vector database or search system uses a short prefix (for example, the first 16 to 64 coordinates) to perform an inexpensive shortlist retrieval, then re-ranks the shortlist with longer prefixes or the full vector. On ImageNet-1K and ImageNet-4K the authors report up to a 128-fold reduction in theoretical FLOPs per query and up to a 14-fold real-world wall-clock speedup, while matching the Top-1 retrieval accuracy of the unshortened d = 2048 embedding. ^[1]^[2]

Why nested losses produce coarse-to-fine codes

The nested-loss formulation creates a strict ordering on the embedding coordinates. The first m_1 coordinates must solve the task on their own, so the encoder is pressured to put the most informative signal there. Each later block must add information that improves the loss when concatenated with what comes before. Capacity spent on redundant directions does not reduce the loss at larger dimensions, so the optimizer is pressured to allocate orthogonal, finer-grained features to later coordinates. Empirically, Kusupati and colleagues report that MRL representations match or slightly exceed independently trained low-dimensional baselines at every m in M, with no penalty at the full dimension. ^[1]

Who created MRL and when was it published?

The paper was submitted to arXiv as 2205.13147 on 26 May 2022 under the original title "Matryoshka Representations for Adaptive Deployment." ^[1] It was accepted to NeurIPS 2022 and presented in November 2022, after which the title was shortened to "Matryoshka Representation Learning." The eleven authors are Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, and Ali Farhadi, drawn primarily from the Paul G. Allen School of Computer Science and Engineering at the University of Washington, with additional affiliations at Google Research, the Allen Institute for AI, and Apple. ^[1] First author Aditya Kusupati was a PhD student at the University of Washington advised by Ali Farhadi and Sham Kakade, with prior research at Microsoft Research India under Prateek Jain. Reference code is hosted at github.com/RAIVNLab/MRL by the Reasoning and Interactive Vision Lab (RAIVNLab). ^[3]

What benchmark results did the original paper report?

The original paper evaluates MRL across image classification, image retrieval, vision-language alignment, and language modeling. The headline result is that a single MRL model matches the accuracy of separately trained low-dimensional models at every tested dimension on ImageNet-1K, while also enabling substantial retrieval speedups through adaptive retrieval. ^[1]

Setting	Architecture / dataset	MRL claim from the paper
Image classification	ResNet-50 on ImageNet-1K	Up to 14x smaller embeddings at equal classification accuracy, evaluated at d in {8, 16, 32, ..., 2048}
Image retrieval	ResNet-50 on ImageNet-1K, 1.28 million-image corpus	Up to 14x real-world wall-clock speedup at equal Top-1 accuracy versus unshortened d = 2048 baseline using Adaptive Retrieval
Image retrieval	ResNet-50 on ImageNet-4K, 4.2 million-image corpus	Adaptive retrieval with first 16-dim shortlist and full re-rank matches full-d Top-1 accuracy
Few-shot classification	Long-tail few-shot ImageNet	Up to roughly 2% accuracy improvement at low dimensions versus independently trained baselines
Vision-language alignment	ALIGN-style two-tower model	MRL preserves zero-shot ImageNet classification accuracy across nested dimensions
Language modeling	BERT-style encoder	Sentence embeddings trained with MRL retain accuracy across truncated dimensions
Robustness	ImageNet-A, ImageNet-R, ImageNet-V2, ImageNet-Sketch	MRL representations are as robust as the full-dim baseline across all out-of-distribution sets

The FLOPs reduction in the retrieval experiments comes from two compounding effects. An inner-product comparison between two d-dimensional vectors costs O(d) operations, so a 16-dim shortlist instead of a 2048-dim one reduces per-comparison cost by 128x. ^[1] Approximate nearest-neighbor indexes such as ScaNN or HNSW also benefit because their inverted-list scoring and graph traversal touch fewer floating-point coordinates per visit.

Which industry embedding models use MRL?

MRL went from an academic technique to a default building block of commercial embedding models in roughly 18 months. The earliest large-scale public adoption was OpenAI's text-embedding-3 family in January 2024, followed by a wave of open-weight releases from Nomic, Mixedbread, Jina AI, and Voyage AI through 2024.

OpenAI text-embedding-3

On 25 January 2024 OpenAI released text-embedding-3-small and text-embedding-3-large, a new generation of text embedding models that replaced text-embedding-ada-002. ^[4] Both models were trained with MRL and exposed to API users through a dimensions parameter that lets the caller specify how many coordinates to return. text-embedding-3-small defaults to 1,536 dimensions and can be truncated to 512; text-embedding-3-large defaults to 3,072 and can be truncated to 256. ^[4]^[18] OpenAI reported that text-embedding-3-large truncated to 256 dimensions still outperforms text-embedding-ada-002 at its native 1,536 dimensions on the MTEB benchmark. ^[4] The launch post described the technique in plain language: "developers can shorten embeddings (i.e. remove some numbers from the end of the sequence) without the embedding losing its concept-representing properties." ^[4] The change was widely interpreted as the first mainstream commercial deployment of MRL. (See text-embedding-3.)

Nomic Embed v1.5

Nomic AI followed with nomic-embed text v1.5, released on 14 February 2024 as an open-weights model under Apache 2.0. ^[5]^[6] Where the original nomic-embed-text-v1 produced 768-dim vectors at a single size, v1.5 was retrained with MRL and supports any dimension between 64 and 768. Nomic recommends canonical sizes of 768, 512, 256, 128, and 64, and reports that MTEB scores degrade only slightly between 768 and 256 dimensions. ^[5]

Mixedbread mxbai models

Mixedbread, a German embedding startup, has shipped multiple MRL-trained models since early 2024. Their flagship mxbai-embed-large-v1 (335 million parameters) combines Matryoshka loss with quantization-aware training, allowing users to truncate and optionally binarize for a stacked compression factor of 64x relative to float32 full-dimensional baselines. ^[10] Recommended dimensions are 1024, 768, 512, 256, 128, and 64.

Jina Embeddings v3 and Voyage AI

Jina embeddings v3 was released on 18 September 2024 by Jina AI. It supports MRL output sizes between 1024 and 32 dimensions and 89 languages, with task-specific LoRA adapters for retrieval, separation, classification, and STS. Jina reports the model ranked 2nd on the MTEB English leaderboard among models under 1 billion parameters at release. ^[8] Voyage AI shipped voyage-code-3 on 4 December 2024, a code-retrieval embedding trained with MRL and supporting output dimensions of 256, 512, 1024, and 2048, with optional int8 and binary quantization. Voyage reports voyage-code-3 at 256 dimensions matches or exceeds voyage-code-2 at 1536 dimensions on internal code search benchmarks. ^[9]

Sentence transformers ecosystem

The sentence transformers library, maintained by Tom Aarsen at Hugging Face, added native MRL support in version 2.5 in early 2024. The library exposes a MatryoshkaLoss class that wraps any inner loss (such as MultipleNegativesRankingLoss or CoSENTLoss) and evaluates it at a user-specified set of nested dimensions:

loss = MatryoshkaLoss(
    model=model,
    loss=base_loss,
    matryoshka_dims=[768, 512, 256, 128, 64],
)

Sentence Transformers also ships Matryoshka2dLoss, which combines MatryoshkaLoss with AdaptiveLayerLoss to trade off both the number of output dimensions and the number of transformer layers used at inference. In a Hugging Face reference experiment on the STSBenchmark test set, at 8.3 percent of the original embedding size (64 of 768 dimensions) a Matryoshka-trained model preserved 98.37 percent of full-dimensional performance, versus 96.46 percent for a model trained the standard way. ^[7]

Summary table of industry adoption

Model	Provider	Release date	Native dim	Truncatable down to	License
text-embedding-3-small	OpenAI	January 25, 2024	1,536	512	Proprietary, API only
text-embedding-3-large	OpenAI	January 25, 2024	3,072	256	Proprietary, API only
nomic-embed-text-v1.5	Nomic AI	February 14, 2024	768	64	Apache 2.0
mxbai-embed-large-v1	Mixedbread	March 2024	1,024	64	Apache 2.0
mxbai-embed-2d-large-v1	Mixedbread	April 2024	1,024	64 (+ layer truncation)	Apache 2.0
jina-embeddings-v3	Jina AI	September 18, 2024	1,024	32	CC BY-NC 4.0
voyage-code-3	Voyage AI	December 4, 2024	2,048	256	Proprietary, API only
Snowflake Arctic-Embed 2	Snowflake	December 2024	1,024	256	Apache 2.0
GTE-Qwen2 family	Alibaba	2024	1,536 to 4,096	varies	Apache 2.0

Several open-weight models on the list pair Matryoshka loss with quantization-aware training (int8 or 1-bit binary), so a fully compressed vector can be a few dozen bytes per document while still supporting useful semantic search.

How does MRL differ from PCA and other compression techniques?

MRL is one of several methods for fitting embedding systems into limited memory or compute budgets, best understood by comparison with the alternatives it competes with or complements.

Technique	Where it acts	Trained jointly with encoder?	Storage savings	Quality vs natively trained low-d	Notes
Matryoshka Representation Learning	Output embedding, by truncation	Yes	Linear in number of dropped coordinates	Matches or exceeds independently trained low-d models	Adds a few percent training cost; no inference overhead
PCA truncation	Output embedding, post hoc linear projection	No, fitted on encoded corpus	Linear, requires storing projection matrix	Typically below natively trained low-d at the same target size	Easy to apply to any pretrained encoder
Random projection (Johnson-Lindenstrauss)	Output embedding, post hoc	No	Linear	Worse than PCA at same size	Useful when corpus is too large to fit PCA
Product Quantization (PQ) and OPQ	Vector codebooks for ANN search	No (codebooks fitted on encoded corpus)	Logarithmic in number of codewords	Lossy but acceptable for ANN	Standard in FAISS and ScaNN; combines well with MRL
Scalar / binary quantization	Per-coordinate precision (float32 to int8 / 1-bit)	Optionally yes (quantization-aware training)	4x to 32x	Small accuracy drop with QAT	Stacks cleanly with MRL: truncate then quantize
Distillation to a smaller student	Whole encoder	Yes (separate run)	Linear in student size	Excellent but requires retraining per target	Often used together with MRL, not as a replacement
Pruning / low-rank factorization	Encoder weights	Sometimes	Reduces inference cost too	Variable	Targets compute rather than embedding size
2D Matryoshka (Matryoshka2d)	Output embedding and layer count	Yes	Linear in both axes	Slightly below 1D MRL at extremes	Reduces both storage and compute

The practical recipe used by most 2024 embedding releases combines MRL with quantization. Because each technique acts on a different axis (coordinate count, bits per coordinate, ANN codebook packing), the savings multiply: a 1024-dim float32 vector occupying 4 kilobytes can be reduced to a 256-dim int8 vector occupying 256 bytes (16x compression) with only a few MTEB points of accuracy loss. MRL is also distinct from hierarchical embeddings such as Poincare or hyperbolic embeddings, which impose structure on the geometry of the embedding space; MRL imposes structure on the ordering of coordinates within a Euclidean embedding.

Theoretical interpretations

The paper provides three complementary perspectives on why nested losses produce useful prefix representations. As multi-task learning, each sub-dimension defines a separate task and the shared encoder must satisfy all of them simultaneously; the regularization effect of shared training tends to improve, rather than hurt, the full-dimensional accuracy. As approximately spectral decomposition (a view developed in follow-up work by Aniket Rege and collaborators in 2024), the optimal coordinates of a Matryoshka embedding align with the top singular directions of the data distribution, in the same way PCA components are ordered by explained variance. ^[14] The view explains why MRL outperforms PCA in practice: the encoder is nonlinear and jointly optimized for the downstream task, rather than minimizing a fixed reconstruction objective. From an information-theoretic angle, early coordinates carry high-mutual-information content with the target while later coordinates carry refinements with diminishing returns, consistent with the empirical observation that the marginal accuracy gain from each additional doubling shrinks monotonically.

Extensions and follow-up work

Since 2022 the matryoshka principle has been extended along several axes beyond the embedding dimension. 2D Matryoshka Sentence Embeddings, proposed in early 2024 by Xianming Li, Zongxi Li, Jing Li, and Haoran Xie (arXiv:2402.14776), generalize MRL to also nest along the layer dimension of a transformer, so any prefix of the layers and any prefix of the output dimensions yields a usable embedding. ^[11] Mixedbread's mxbai-embed-2d-large-v1 implements this approach. Apple researchers published Matryoshka Diffusion Models (arXiv:2310.15111) in October 2023, which apply nested resolution training to image diffusion so a single denoiser generates consistent outputs at many resolutions simultaneously. ^[12] MatryoshkaKV (arXiv:2410.14731), from October 2024, applies the matryoshka idea to KV-cache compression in large language model inference: trainable orthogonal projections are learned per layer and per head, and a matryoshka loss is applied across nested target ranks so the same projection can be truncated at inference time. The authors report retaining more than 90 percent of baseline accuracy at a 60 percent average cache compression rate on LLaMA-2 7B and Mistral 7B. ^[13] Further 2024 and 2025 extensions mix matryoshka training with quantization-aware training to produce single models that serve embeddings at many (dimension, bit-width) operating points.

Practical guidance

In practice the matryoshka recipe has matured into a small set of conventions: nested dimensions follow a geometric series with doubling steps (typically {32 or 64, 128, 256, 512, 768 or 1024}); per-dimension loss weights are uniform; the smallest target is 32 or 64 (below 16, accuracy drops sharply on most tasks); classifier weights are usually tied across dimensions when label counts are large (MRL-E); MRL is increasingly paired with int8 or binary quantization; and per-dimension MTEB or Top-1 accuracy is reported at each evaluation checkpoint. A common engineering pitfall is to train a model at a single dimension and then truncate after the fact, which almost always destroys quality because the model was never trained to keep important information at the front of the vector. MRL must be present in the training objective; it is not a post hoc transformation.

Reception and impact

MRL has had unusually rapid uptake for a methodology paper. By the end of 2024, every major commercial provider of off-the-shelf text embeddings (OpenAI, Cohere, Voyage AI, Nomic, Mixedbread, Jina AI, Snowflake, and Alibaba's GTE family) had shipped at least one MRL-trained model. Open-source vector databases including Weaviate, Vespa, Qdrant, Milvus, and pgvector published guides on ingesting matryoshka embeddings, storing them at full precision in primary indexes, and serving queries against truncated representations for low-latency lookups. ^[15]^[16] Independent academic citations have run into the thousands by 2026 according to Google Scholar, and the phrase "matryoshka embeddings" is now in common use to describe any embedding that supports prefix-based truncation. The technique has been less disruptive in image and video retrieval, where product quantization and learned binary codes remain dominant, but it has shaped how multimodal systems are deployed under memory pressure.

Limitations and open problems

MRL trades a small amount of accuracy for flexibility. At very small dimensions (typically 16 or below), MRL models can fall behind natively trained low-dimensional baselines, with classification tolerating aggressive truncation better than fine-grained retrieval. The nested-loss formulation also interacts non-trivially with very large batch sizes and heavy negative-sampling regimes, requiring careful tuning of the smallest dimensions. MRL controls only the dimensional axis; it does not reduce encoder compute, latency, or parameter count by itself, so it is typically paired with distillation, pruning, layer-skipping, or 2D Matryoshka. Theoretical understanding remains partial: the link between matryoshka prefixes and spectral decomposition is informal, and sharper guarantees on the optimality of nested prefixes remain an active topic.

References

Kusupati, A. et al. "Matryoshka Representation Learning." arXiv:2205.13147, NeurIPS 2022. arxiv.org/abs/2205.13147. ↩
Kusupati, A. et al. "Matryoshka Representation Learning." NeurIPS 35 (2022). proceedings.neurips.cc/paper_files/paper/2022/file/c32319f4868da7613d78af9993100e42-Paper-Conference.pdf. ↩
RAIVNLab. "MRL code repository." github.com/RAIVNLab/MRL. ↩
OpenAI. "New embedding models and API updates." 25 January 2024. openai.com/index/new-embedding-models-and-api-updates. ↩
Nomic AI. "Unboxing Nomic Embed v1.5: Resizable Embeddings with MRL." February 2024. blog.nomic.ai/posts/nomic-embed-matryoshka. ↩
Nomic AI. "nomic-embed-text-v1.5 model card." huggingface.co/nomic-ai/nomic-embed-text-v1.5. ↩
Sentence Transformers. "Matryoshka Embeddings documentation." sbert.net/examples/sentence_transformer/training/matryoshka/README.html. ↩
Jina AI. "Jina Embeddings v3: A Frontier Multilingual Embedding Model." 18 September 2024. jina.ai/news/jina-embeddings-v3-a-frontier-multilingual-embedding-model. ↩
Voyage AI. "voyage-code-3 release." 4 December 2024. blog.voyageai.com/2024/12/04/voyage-code-3. ↩
Mixedbread. "mxbai-embed-large-v1 model card." mixedbread.com/blog. ↩
Li, X. et al. "2D Matryoshka Sentence Embeddings." arXiv:2402.14776, February 2024. ↩
Gu, J. et al. "Matryoshka Diffusion Models." arXiv:2310.15111, October 2023. ↩
Lin, H. et al. "MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection." arXiv:2410.14731, October 2024. ↩
Rege, A. "Matryoshka Representation Learning (MRL) from the Ground Up." 2024. aniketrege.github.io/blog/2024/mrl. ↩
Weaviate. "OpenAI's Matryoshka Embeddings in Weaviate." weaviate.io/blog/openais-matryoshka-embeddings-in-weaviate. ↩
Vespa Engine. "OpenAI Matryoshka embeddings with Vespa." blog.vespa.ai/matryoshka-embeddings-in-vespa. ↩
Zilliz. "MRL Explained: The Method Behind OpenAI's Efficient Text Embeddings." zilliz.com/blog/matryoshka-representation-learning-method-behind-openai-text-embeddings.
OpenAI. "Embeddings API documentation." platform.openai.com/docs/guides/embeddings. ↩
Snowflake. "Arctic Embed 2.0 release." December 2024.
MTEB benchmark leaderboard. huggingface.co/spaces/mteb/leaderboard.

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What problem does MRL solve?

How does the MRL method work?

Training objective

Inference and adaptive deployment

Why nested losses produce coarse-to-fine codes

Who created MRL and when was it published?

What benchmark results did the original paper report?

Which industry embedding models use MRL?

OpenAI text-embedding-3

Nomic Embed v1.5

Mixedbread mxbai models

Jina Embeddings v3 and Voyage AI

Sentence transformers ecosystem

Summary table of industry adoption

How does MRL differ from PCA and other compression techniques?

Theoretical interpretations

Extensions and follow-up work

Practical guidance

Reception and impact

Limitations and open problems

See also

References

Improve this article

What problem does MRL solve?

How does the MRL method work?

Training objective

Inference and adaptive deployment

Why nested losses produce coarse-to-fine codes

Who created MRL and when was it published?

What benchmark results did the original paper report?

Which industry embedding models use MRL?

OpenAI text-embedding-3

Nomic Embed v1.5

Mixedbread mxbai models

Jina Embeddings v3 and Voyage AI

Sentence transformers ecosystem

Summary table of industry adoption

How does MRL differ from PCA and other compression techniques?

Theoretical interpretations

Extensions and follow-up work

Practical guidance

Reception and impact

Limitations and open problems

See also

References

Improve this article

Related Articles

Similarity Measure

Embeddings

Information Retrieval

Question answering

Vector embeddings

LlamaIndex

What links here

Related Articles

Similarity Measure

Embeddings

Information Retrieval

Question answering

Vector embeddings

LlamaIndex

What links here