Sentence-transformers/all-MiniLM-L6-v2 model
Hugging Face
sentence-transformers/all-MiniLM-L6-v2
Name
all-MiniLM-L6-v2
User / Organization
sentence-transformers
Type
Natural Language Processing, Sentence Embedding
Task
Sentence Similarity, Feature Extraction
Library
PyTorch, TensorFlow, Rust, Sentence Transformers, Transformers
Architecture
BERT (6 layers, 384 hidden, 12 heads), ~22.7M parameters
Output dimension
384
Max sequence length
256 word pieces
Base model
nreimers/MiniLM-L6-H384-uncased (layer-pruned from microsoft/MiniLM-L12-H384-uncased)
Released
August 2021
Language
English
Paper
arxiv:2002.10957 (MiniLM), arxiv:1908.10084 (Sentence-BERT), arxiv:2104.08727
License
Apache 2.0
Related
BERT, MiniLM, Sentence-BERT, all-mpnet-base-v2
The sentence-transformers/all-MiniLM-L6-v2 model is a compact sentence-embedding model published on the Hugging Face Hub by the sentence-transformers project. It maps English sentences and short paragraphs into a 384-dimensional dense vector space, where semantically similar inputs lie close together under cosine similarity. The model is built on a six-layer MiniLM backbone derived from work by Microsoft Research, then fine-tuned on more than 1.17 billion sentence pairs using contrastive learning. It has roughly 22.7 million parameters and a model file size of about 80 MB.
Despite being released in 2021, all-MiniLM-L6-v2 has remained the de facto default embedding model for semantic search and retrieval-augmented generation tutorials across libraries such as LangChain, LlamaIndex, Haystack, and Chroma. Hugging Face reports more than 200 million downloads per month for the repository, putting it among the most downloaded models of any kind on the Hub. The combination of small size, fast CPU inference, permissive Apache 2.0 license, and competitive quality on general English text explains why so many production systems and prototypes still ship it as the first embedding model to try.
The model is the smallest member of the "all-*" family of Sentence-BERT style encoders that came out of a community project organized by Hugging Face in the summer of 2021. The project, titled "Train the Best Sentence Embedding Model Ever with 1B Training Pairs," was part of the JAX/Flax community sprint and aimed to combine a very large mixture of public sentence-pair datasets with a single contrastive training recipe so that the resulting models would generalize across many downstream tasks rather than excelling at one benchmark. Three sibling models came out of the same effort: all-MiniLM-L6-v2 (smallest, fastest), all-MiniLM-L12-v2 (slightly larger), and all-mpnet-base-v2 (largest, highest quality).
All three were trained with essentially the same data and loss; the only thing that differs is the underlying transformer backbone. The L6 variant uses a six-layer MiniLM checkpoint that traces its lineage back to Microsoft Research's 2020 MiniLM paper, which introduced a knowledge distillation technique called "deep self-attention distillation."
Under the hood, all-MiniLM-L6-v2 is a standard BERT-style encoder with reduced depth and width, followed by a mean-pooling layer that converts per-token contextual vectors into a single fixed-length sentence embedding.
| Property | Value |
|---|---|
| Model type | bert (Hugging Face transformers) |
| Hidden layers | 6 |
| Hidden size | 384 |
| Attention heads | 12 |
| Intermediate size (FFN) | 1,536 |
| Hidden activation | GELU |
| Vocabulary | WordPiece, 30,522 tokens (uncased) |
| Max position embeddings | 512 |
| Default max sequence length | 256 word pieces (tokenizer truncation) |
| Output embedding dimension | 384 |
| Pooling | Mean pooling over non-padded tokens |
| Output normalization | L2-normalized embeddings |
| Parameter count | ~22.7 million |
| Model file size | ~80 MB (FP32) |
A sentence is tokenized with the BERT uncased WordPiece tokenizer (30,522 tokens), truncated to 256 word pieces, and then passed through six transformer encoder layers. The token embeddings from the final layer are averaged using the attention mask so that padding tokens are ignored, and the resulting 384-dimensional vector is L2-normalized so that dot products on the output are equivalent to cosine similarities.
Although the underlying BERT config sets max_position_embeddings = 512, the sentence-transformers wrapper truncates inputs to 256 tokens by default. Longer documents need to be split into chunks; the model is not intended for full long-document encoding.
The backbone is nreimers/MiniLM-L6-H384-uncased, a six-layer model created by Nils Reimers (the maintainer of the sentence-transformers library) by keeping every other layer of Microsoft's microsoft/MiniLM-L12-H384-uncased. That 12-layer model was the original "general purpose" MiniLM checkpoint released by Microsoft.
The 12-layer MiniLM was itself distilled from a BERT-base-sized teacher (specifically Microsoft's UniLM v2) using deep self-attention distillation, the technique introduced by Wang et al. (2020). Rather than matching hidden states or output logits, deep self-attention distillation trains the student to mimic two things in the teacher's last transformer layer: the attention probability distribution over keys, and the relation matrix formed by the scaled dot-product between value vectors. Because both targets are agnostic to the number of student layers and heads, the student can have a different shape from the teacher while still learning the teacher's attention behavior.
The practical chain therefore looks like:
| Stage | Model | Source |
|---|---|---|
| Teacher | UniLM v2 (BERT-base size, ~110M params) | Microsoft Research |
| Distilled student | microsoft/MiniLM-L12-H384-uncased (12 layers, 33M params) | Wang et al. 2020 |
| Layer-pruned variant | nreimers/MiniLM-L6-H384-uncased (6 layers, ~22.7M params) | Reimers |
| Sentence-embedding fine-tune | sentence-transformers/all-MiniLM-L6-v2 | Hugging Face JAX/Flax sprint, 2021 |
This is why the embedding model inherits the speed advantage of MiniLM (small width, few layers) without inheriting the masked language modeling head or any task-specific head from BERT.
The model was fine-tuned during the 2021 JAX/Flax community sprint using a single contrastive learning objective on a very large mixture of sentence-pair datasets. The training corpus and the loss are shared with the other "all-*" siblings.
The full training mixture contains 1,170,060,424 sentence pairs sampled from roughly 30 datasets. The largest contributors dominate the mix. The following table lists the biggest sources, taken from the model card:
| Dataset | Approximate pairs | What it provides |
|---|---|---|
| Reddit comments (2015 to 2018) | 726,484,430 | Conversational pairs (comment, reply) |
| S2ORC citation pairs (abstracts) | 116,288,806 | Scientific citation context |
| WikiAnswers duplicate questions | 77,427,422 | Paraphrased question pairs |
| PAQ (question, answer) | 64,371,441 | Generated QA pairs over Wikipedia |
| S2ORC citation pairs (titles) | 52,603,982 | Scientific paper titles in citation graph |
| S2ORC (title, abstract) | 41,769,185 | Scientific title to abstract |
| Stack Exchange (title, body) | 25,316,456 | Programming and technical Q&A |
| Stack Exchange (title+body, answer) | 21,396,559 | Question to accepted answer |
| Stack Exchange (title, answer) | 21,396,559 | Title to answer |
| MS MARCO triplets | 9,144,553 | Web passage retrieval triples |
| SNLI, MultiNLI, QQP, Flickr30k captions, SimpleWiki, Yahoo Answers, AltLex, NQ, TriviaQA, ELI5, WikiHow, SearchQA, GooAQ, CodeSearchNet, SPECTER and others | ~14M combined | Diverse smaller corpora |
The heavy Reddit weighting is a deliberate choice; Reddit comment threads supply enormous amounts of weakly aligned sentence pairs, and the resulting mixture is far more domain-diverse than any single retrieval dataset.
Training uses a contrastive objective often called Multiple Negatives Ranking Loss in the sentence-transformers library. For each batch of N sentence pairs (a_i, b_i), the model computes embeddings for all 2N sentences, builds the N x N matrix of cosine similarities between every a_i and every b_j, divides by a temperature, and applies a cross-entropy loss whose target is the diagonal. In other words, every other sentence in the batch acts as an in-batch negative for a_i, and the model is rewarded for ranking the true partner higher than all of them. This loss avoids the need to mine hard negatives explicitly and scales well with batch size.
| Setting | Value |
|---|---|
| Hardware | 7 TPU v3-8 pods (Google TPU Research Cloud) |
| Framework | JAX / Flax |
| Steps | 100,000 |
| Global batch size | 1,024 (128 per TPU core) |
| Sequence length during training | 128 tokens |
| Optimizer | AdamW |
| Learning rate | 2e-5 |
| Warmup steps | 500 |
| Weighted dataset sampling | Yes (datasets sampled in proportion to a fixed weight per source) |
The Google TPU Research Cloud donated the hardware, which is why the training run was feasible despite the size of the data mixture.
The "all-*" family was published as a coherent set so users could pick a point on the size/speed/quality curve. The three core models are:
| Model | Backbone | Layers | Hidden | Output dim | Approx params | Relative speed | Quality |
|---|---|---|---|---|---|---|---|
| all-MiniLM-L6-v2 | MiniLM | 6 | 384 | 384 | ~22.7M | Fastest (~5x faster than mpnet on CPU) | Good |
| all-MiniLM-L12-v2 | MiniLM | 12 | 384 | 384 | ~33M | Faster than mpnet | Slightly higher than L6 |
| all-mpnet-base-v2 | MPNet | 12 | 768 | 768 | ~110M | Slowest of the three | Highest |
For multilingual workloads, the recommended sibling is paraphrase-multilingual-MiniLM-L12-v2, which uses the same six-layer MiniLM idea but is trained on 50+ languages with parallel data; the all-MiniLM-L6-v2 model itself is English-only.
On the Massive Text Embedding Benchmark (MTEB), all-MiniLM-L6-v2 averages roughly 56 across the 56 English tasks, compared to roughly 58 for all-mpnet-base-v2. The gap is small in absolute terms; the gap on speed and memory is much larger.
| Model | MTEB English avg (approx) | Embedding dim | Disk size |
|---|---|---|---|
| all-MiniLM-L6-v2 | ~56 | 384 | ~80 MB |
| all-MiniLM-L12-v2 | ~57 | 384 | ~120 MB |
| all-mpnet-base-v2 | ~58 | 768 | ~420 MB |
| BGE-large-en-v1.5 (BAAI, 2023) | ~64 | 1,024 | ~1.3 GB |
| BGE-M3 (BAAI, 2024) | ~66 (multilingual) | 1,024 | ~2.3 GB |
| NV-Embed-v2 (NVIDIA, 2024) | ~72 | 4,096 | ~14 GB |
The top of the MTEB leaderboard has moved well beyond all-MiniLM-L6-v2 since 2023; modern open-source models such as BGE, GTE, E5-Mistral, NV-Embed, and Qwen3-Embedding score 8 to 16 points higher on average. They are also one to two orders of magnitude larger and slower, which is exactly why the small MiniLM model has not been displaced from default-pick status in tutorial code.
The canonical way to use the model is through the sentence-transformers Python library:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
embeddings = model.encode([
"That is a happy person",
"That is a happy dog",
"Today is a sunny day",
])
print(embeddings.shape) # (3, 384)
The model can also be loaded directly through the Hugging Face transformers library, in which case the user is responsible for applying mean pooling and L2 normalization manually:
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
tok = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
def mean_pool(last_hidden, mask):
mask = mask.unsqueeze(-1).float()
summed = (last_hidden * mask).sum(1)
counts = mask.sum(1).clamp(min=1e-9)
return summed / counts
batch = tok(["hello world", "good morning"], padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
out = model(**batch)
emb = F.normalize(mean_pool(out.last_hidden_state, batch["attention_mask"]), p=2, dim=1)
Because the embeddings are L2-normalized, cosine similarity reduces to a simple dot product, which is what most vector databases compute internally.
On a modern x86 CPU, the model encodes thousands of short sentences per second; on a single GPU it scales to tens of thousands per second with adequate batching. ONNX, OpenVINO, GGUF, Rust (Candle), and JavaScript (Transformers.js) ports are widely available, which is why the model is often embedded inside browser extensions, edge devices, mobile apps, and serverless functions where loading a 400 MB model would be impractical.
Most real deployments of all-MiniLM-L6-v2 fall into one of the following categories:
| Use case | Why this model fits |
|---|---|
| Default retriever for retrieval-augmented generation | Cheap to host, fast to encode, good enough for most chunked text |
| Semantic search over English document collections | 384-dim vectors keep index size small at scale |
| Clustering and topic discovery | Pairwise cosine similarity is fast, embeddings cluster well |
| Near-duplicate detection (FAQs, support tickets, news headlines) | Robust to paraphrase due to the diverse training mixture |
| Reranker baselines and bi-encoder candidates for cross-encoder rerankers | Strong recall@k for first-stage retrieval |
| Embedding caches and feature inputs for downstream classifiers | 384-dim features can be fed into logistic regression or small MLPs |
| Edge inference, browser-side semantic search | Quantized variants run in WebAssembly via Transformers.js |
The model is the default embedding choice in many tutorials and quickstarts, including LangChain, LlamaIndex, Haystack, Chroma, and Qdrant, and it ships as the default nomic-embed-class fallback in some local LLM stacks via Ollama (all-minilm:l6-v2).
A few limitations are worth knowing before using the model in production.
The model is monolingual English. Inputs in other languages will produce embeddings, but quality drops sharply; use paraphrase-multilingual-MiniLM-L12-v2 or BGE-M3 for cross-lingual workloads.
The maximum sequence length is 256 word pieces. Longer text is silently truncated, which means a 4,000-word document encoded as a single string is represented entirely by its first few hundred tokens. Real systems chunk documents before encoding.
Quality on 2024-2026 benchmarks is no longer state of the art. On MTEB, models such as BGE-large-en, GTE-large, E5-Mistral, NV-Embed-v2, and Qwen3-Embedding outperform all-MiniLM-L6-v2 by 8 to 16 average points. For workloads where the cost of a larger model is acceptable, the quality gap matters; for workloads where p95 latency or memory dominates, the small MiniLM model is still hard to beat.
The training data ends in 2021, and the Reddit-heavy mixture biases the model toward conversational English. Domain-specific corpora (legal, biomedical, financial) often benefit from fine-tuning or from a domain-specialized embedding model.
Finally, because the embeddings are only 384-dimensional, they have less representational headroom than 768- or 1,024-dimensional embeddings. For very large indexes (tens of millions of documents), the quality ceiling can become a real constraint, and switching to all-mpnet-base-v2 or a modern BGE/GTE/E5 model is usually the right move.
Four years after release, all-MiniLM-L6-v2 remains the most-downloaded embedding model on Hugging Face by a wide margin, with reported download counts exceeding 200 million per month. Hundreds of community fine-tunes and dozens of quantized variants have been published, including ONNX, OpenVINO, GGUF, and Candle ports. The model is used as a baseline in most embedding papers published since 2022, and it appears in the default code path of effectively every popular open-source RAG framework.
The persistence of a 22.7M-parameter 2021 model at the top of a fast-moving field is partly an artifact of inertia and partly a real efficiency story. For a developer prototyping a vector search, the choice between a model that fits in 80 MB and runs on a laptop CPU and a model that needs a multi-gigabyte download and a GPU is not a close call until quality numbers actually matter. all-MiniLM-L6-v2 happens to sit at the price point where almost nothing else competes.