Sentence-transformers/all-MiniLM-L6-v2 model

Hugging Face

sentence-transformers/all-MiniLM-L6-v2

Name

all-MiniLM-L6-v2

User / Organization

sentence-transformers

Type

Natural Language Processing, Sentence Embedding

Task

Sentence Similarity, Feature Extraction

Library

PyTorch, TensorFlow, Rust, Sentence Transformers, Transformers

Architecture

BERT (6 layers, 384 hidden, 12 heads), ~22.7M parameters

Output dimension

384

Max sequence length

256 word pieces

Base model

nreimers/MiniLM-L6-H384-uncased (layer-pruned from microsoft/MiniLM-L12-H384-uncased)

Released

August 2021

Language

English

Paper

arxiv:2002.10957 (MiniLM), arxiv:1908.10084 (Sentence-BERT), arxiv:2104.08727

License

Apache 2.0

The sentence-transformers/all-MiniLM-L6-v2 model is a compact sentence-embedding model published on the Hugging Face Hub by the sentence-transformers project. It maps English sentences and short paragraphs into a 384-dimensional dense vector space, where semantically similar inputs lie close together under cosine similarity. The model is built on a six-layer MiniLM backbone derived from work by Microsoft Research, then fine-tuned on more than 1.17 billion sentence pairs using contrastive learning. It has roughly 22.7 million parameters and a model file size of about 80 MB.

Despite being released in 2021, all-MiniLM-L6-v2 has remained the de facto default embedding model for semantic search and retrieval-augmented generation tutorials across libraries such as LangChain, LlamaIndex, Haystack, and Chroma. Hugging Face reports more than 200 million downloads per month for the repository, putting it among the most downloaded models of any kind on the Hub. The combination of small size, fast CPU inference, permissive Apache 2.0 license, and competitive quality on general English text explains why so many production systems and prototypes still ship it as the first embedding model to try.

origins

The model is the smallest member of the "all-*" family of Sentence-BERT style encoders that came out of a community project organized by Hugging Face in the summer of 2021. The project, titled "Train the Best Sentence Embedding Model Ever with 1B Training Pairs," was part of the JAX/Flax community sprint and aimed to combine a very large mixture of public sentence-pair datasets with a single contrastive training recipe so that the resulting models would generalize across many downstream tasks rather than excelling at one benchmark. Three sibling models came out of the same effort: all-MiniLM-L6-v2 (smallest, fastest), all-MiniLM-L12-v2 (slightly larger), and all-mpnet-base-v2 (largest, highest quality).

All three were trained with essentially the same data and loss; the only thing that differs is the underlying transformer backbone. The L6 variant uses a six-layer MiniLM checkpoint that traces its lineage back to Microsoft Research's 2020 MiniLM paper, which introduced a knowledge distillation technique called "deep self-attention distillation."

architecture

Under the hood, all-MiniLM-L6-v2 is a standard BERT-style encoder with reduced depth and width, followed by a mean-pooling layer that converts per-token contextual vectors into a single fixed-length sentence embedding.

Property	Value
Model type	bert (Hugging Face transformers)
Hidden layers	6
Hidden size	384
Attention heads	12
Intermediate size (FFN)	1,536
Hidden activation	GELU
Vocabulary	WordPiece, 30,522 tokens (uncased)
Max position embeddings	512
Default max sequence length	256 word pieces (tokenizer truncation)
Output embedding dimension	384
Pooling	Mean pooling over non-padded tokens
Output normalization	L2-normalized embeddings
Parameter count	~22.7 million
Model file size	~80 MB (FP32)

A sentence is tokenized with the BERT uncased WordPiece tokenizer (30,522 tokens), truncated to 256 word pieces, and then passed through six transformer encoder layers. The token embeddings from the final layer are averaged using the attention mask so that padding tokens are ignored, and the resulting 384-dimensional vector is L2-normalized so that dot products on the output are equivalent to cosine similarities.

Although the underlying BERT config sets max_position_embeddings = 512, the sentence-transformers wrapper truncates inputs to 256 tokens by default. Longer documents need to be split into chunks; the model is not intended for full long-document encoding.

distillation lineage

The backbone is nreimers/MiniLM-L6-H384-uncased, a six-layer model created by Nils Reimers (the maintainer of the sentence-transformers library) by keeping every other layer of Microsoft's microsoft/MiniLM-L12-H384-uncased. That 12-layer model was the original "general purpose" MiniLM checkpoint released by Microsoft.

The 12-layer MiniLM was itself distilled from a BERT-base-sized teacher (specifically Microsoft's UniLM v2) using deep self-attention distillation, the technique introduced by Wang et al. (2020). Rather than matching hidden states or output logits, deep self-attention distillation trains the student to mimic two things in the teacher's last transformer layer: the attention probability distribution over keys, and the relation matrix formed by the scaled dot-product between value vectors. Because both targets are agnostic to the number of student layers and heads, the student can have a different shape from the teacher while still learning the teacher's attention behavior.

The practical chain therefore looks like:

Stage	Model	Source
Teacher	UniLM v2 (BERT-base size, ~110M params)	Microsoft Research
Distilled student	microsoft/MiniLM-L12-H384-uncased (12 layers, 33M params)	Wang et al. 2020
Layer-pruned variant	nreimers/MiniLM-L6-H384-uncased (6 layers, ~22.7M params)	Reimers
Sentence-embedding fine-tune	sentence-transformers/all-MiniLM-L6-v2	Hugging Face JAX/Flax sprint, 2021

This is why the embedding model inherits the speed advantage of MiniLM (small width, few layers) without inheriting the masked language modeling head or any task-specific head from BERT.

training

The model was fine-tuned during the 2021 JAX/Flax community sprint using a single contrastive learning objective on a very large mixture of sentence-pair datasets. The training corpus and the loss are shared with the other "all-*" siblings.

datasets

The full training mixture contains 1,170,060,424 sentence pairs sampled from roughly 30 datasets. The largest contributors dominate the mix. The following table lists the biggest sources, taken from the model card:

Dataset	Approximate pairs	What it provides
Reddit comments (2015 to 2018)	726,484,430	Conversational pairs (comment, reply)
S2ORC citation pairs (abstracts)	116,288,806	Scientific citation context
WikiAnswers duplicate questions	77,427,422	Paraphrased question pairs
PAQ (question, answer)	64,371,441	Generated QA pairs over Wikipedia
S2ORC citation pairs (titles)	52,603,982	Scientific paper titles in citation graph
S2ORC (title, abstract)	41,769,185	Scientific title to abstract
Stack Exchange (title, body)	25,316,456	Programming and technical Q&A
Stack Exchange (title+body, answer)	21,396,559	Question to accepted answer
Stack Exchange (title, answer)	21,396,559	Title to answer
MS MARCO triplets	9,144,553	Web passage retrieval triples
SNLI, MultiNLI, QQP, Flickr30k captions, SimpleWiki, Yahoo Answers, AltLex, NQ, TriviaQA, ELI5, WikiHow, SearchQA, GooAQ, CodeSearchNet, SPECTER and others	~14M combined	Diverse smaller corpora

The heavy Reddit weighting is a deliberate choice; Reddit comment threads supply enormous amounts of weakly aligned sentence pairs, and the resulting mixture is far more domain-diverse than any single retrieval dataset.

objective

Training uses a contrastive objective often called Multiple Negatives Ranking Loss in the sentence-transformers library. For each batch of N sentence pairs (a_i, b_i), the model computes embeddings for all 2N sentences, builds the N x N matrix of cosine similarities between every a_i and every b_j, divides by a temperature, and applies a cross-entropy loss whose target is the diagonal. In other words, every other sentence in the batch acts as an in-batch negative for a_i, and the model is rewarded for ranking the true partner higher than all of them. This loss avoids the need to mine hard negatives explicitly and scales well with batch size.

hardware and hyperparameters

Setting	Value
Hardware	7 TPU v3-8 pods (Google TPU Research Cloud)
Framework	JAX / Flax
Steps	100,000
Global batch size	1,024 (128 per TPU core)
Sequence length during training	128 tokens
Optimizer	AdamW
Learning rate	2e-5
Warmup steps	500
Weighted dataset sampling	Yes (datasets sampled in proportion to a fixed weight per source)

The Google TPU Research Cloud donated the hardware, which is why the training run was feasible despite the size of the data mixture.

position in the sentence-transformers family

The "all-*" family was published as a coherent set so users could pick a point on the size/speed/quality curve. The three core models are:

Model	Backbone	Layers	Hidden	Output dim	Approx params	Relative speed	Quality
all-MiniLM-L6-v2	MiniLM	6	384	384	~22.7M	Fastest (~5x faster than mpnet on CPU)	Good
all-MiniLM-L12-v2	MiniLM	12	384	384	~33M	Faster than mpnet	Slightly higher than L6
all-mpnet-base-v2	MPNet	12	768	768	~110M	Slowest of the three	Highest

For multilingual workloads, the recommended sibling is paraphrase-multilingual-MiniLM-L12-v2, which uses the same six-layer MiniLM idea but is trained on 50+ languages with parallel data; the all-MiniLM-L6-v2 model itself is English-only.

benchmarks

On the Massive Text Embedding Benchmark (MTEB), all-MiniLM-L6-v2 averages roughly 56 across the 56 English tasks, compared to roughly 58 for all-mpnet-base-v2. The gap is small in absolute terms; the gap on speed and memory is much larger.

Model	MTEB English avg (approx)	Embedding dim	Disk size
all-MiniLM-L6-v2	~56	384	~80 MB
all-MiniLM-L12-v2	~57	384	~120 MB
all-mpnet-base-v2	~58	768	~420 MB
BGE-large-en-v1.5 (BAAI, 2023)	~64	1,024	~1.3 GB
BGE-M3 (BAAI, 2024)	~66 (multilingual)	1,024	~2.3 GB
NV-Embed-v2 (NVIDIA, 2024)	~72	4,096	~14 GB

The top of the MTEB leaderboard has moved well beyond all-MiniLM-L6-v2 since 2023; modern open-source models such as BGE, GTE, E5-Mistral, NV-Embed, and Qwen3-Embedding score 8 to 16 points higher on average. They are also one to two orders of magnitude larger and slower, which is exactly why the small MiniLM model has not been displaced from default-pick status in tutorial code.

inference and usage

The canonical way to use the model is through the sentence-transformers Python library:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
embeddings = model.encode([
    "That is a happy person",
    "That is a happy dog",
    "Today is a sunny day",
])
print(embeddings.shape)  # (3, 384)

The model can also be loaded directly through the Hugging Face transformers library, in which case the user is responsible for applying mean pooling and L2 normalization manually:

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel

tok = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

def mean_pool(last_hidden, mask):
    mask = mask.unsqueeze(-1).float()
    summed = (last_hidden * mask).sum(1)
    counts = mask.sum(1).clamp(min=1e-9)
    return summed / counts

batch = tok(["hello world", "good morning"], padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
    out = model(**batch)
emb = F.normalize(mean_pool(out.last_hidden_state, batch["attention_mask"]), p=2, dim=1)

Because the embeddings are L2-normalized, cosine similarity reduces to a simple dot product, which is what most vector databases compute internally.

speed and footprint

On a modern x86 CPU, the model encodes thousands of short sentences per second; on a single GPU it scales to tens of thousands per second with adequate batching. ONNX, OpenVINO, GGUF, Rust (Candle), and JavaScript (Transformers.js) ports are widely available, which is why the model is often embedded inside browser extensions, edge devices, mobile apps, and serverless functions where loading a 400 MB model would be impractical.

use cases

Most real deployments of all-MiniLM-L6-v2 fall into one of the following categories:

Use case	Why this model fits
Default retriever for retrieval-augmented generation	Cheap to host, fast to encode, good enough for most chunked text
Semantic search over English document collections	384-dim vectors keep index size small at scale
Clustering and topic discovery	Pairwise cosine similarity is fast, embeddings cluster well
Near-duplicate detection (FAQs, support tickets, news headlines)	Robust to paraphrase due to the diverse training mixture
Reranker baselines and bi-encoder candidates for cross-encoder rerankers	Strong recall@k for first-stage retrieval
Embedding caches and feature inputs for downstream classifiers	384-dim features can be fed into logistic regression or small MLPs
Edge inference, browser-side semantic search	Quantized variants run in WebAssembly via Transformers.js

The model is the default embedding choice in many tutorials and quickstarts, including LangChain, LlamaIndex, Haystack, Chroma, and Qdrant, and it ships as the default nomic-embed-class fallback in some local LLM stacks via Ollama (all-minilm:l6-v2).

limitations

A few limitations are worth knowing before using the model in production.

The model is monolingual English. Inputs in other languages will produce embeddings, but quality drops sharply; use paraphrase-multilingual-MiniLM-L12-v2 or BGE-M3 for cross-lingual workloads.

The maximum sequence length is 256 word pieces. Longer text is silently truncated, which means a 4,000-word document encoded as a single string is represented entirely by its first few hundred tokens. Real systems chunk documents before encoding.

Quality on 2024-2026 benchmarks is no longer state of the art. On MTEB, models such as BGE-large-en, GTE-large, E5-Mistral, NV-Embed-v2, and Qwen3-Embedding outperform all-MiniLM-L6-v2 by 8 to 16 average points. For workloads where the cost of a larger model is acceptable, the quality gap matters; for workloads where p95 latency or memory dominates, the small MiniLM model is still hard to beat.

The training data ends in 2021, and the Reddit-heavy mixture biases the model toward conversational English. Domain-specific corpora (legal, biomedical, financial) often benefit from fine-tuning or from a domain-specialized embedding model.

Finally, because the embeddings are only 384-dimensional, they have less representational headroom than 768- or 1,024-dimensional embeddings. For very large indexes (tens of millions of documents), the quality ceiling can become a real constraint, and switching to all-mpnet-base-v2 or a modern BGE/GTE/E5 model is usually the right move.

reception and continued popularity

Four years after release, all-MiniLM-L6-v2 remains the most-downloaded embedding model on Hugging Face by a wide margin, with reported download counts exceeding 200 million per month. Hundreds of community fine-tunes and dozens of quantized variants have been published, including ONNX, OpenVINO, GGUF, and Candle ports. The model is used as a baseline in most embedding papers published since 2022, and it appears in the default code path of effectively every popular open-source RAG framework.

The persistence of a 22.7M-parameter 2021 model at the top of a fast-moving field is partly an artifact of inertia and partly a real efficiency story. For a developer prototyping a vector search, the choice between a model that fits in 80 MB and runs on a laptop CPU and a model that needs a multi-gigabyte download and a GPU is not a close call until quality numbers actually matter. all-MiniLM-L6-v2 happens to sit at the price point where almost nothing else competes.

references

sentence-transformers, "all-MiniLM-L6-v2 model card," Hugging Face. https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., Zhou, M. "MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers." NeurIPS 2020. https://arxiv.org/abs/2002.10957
Reimers, N., Gurevych, I. "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." EMNLP 2019. https://arxiv.org/abs/1908.10084
Microsoft, "MiniLM-L12-H384-uncased model card," Hugging Face. https://huggingface.co/microsoft/MiniLM-L12-H384-uncased
Reimers, N., "MiniLM-L6-H384-uncased model card," Hugging Face. https://huggingface.co/nreimers/MiniLM-L6-H384-uncased
Hugging Face, "Train the Best Sentence Embedding Model Ever with 1 Billion Training Pairs," JAX/Flax community sprint, 2021. https://discuss.huggingface.co/t/train-the-best-sentence-embedding-model-ever-with-1b-training-pairs/7354
Sentence Transformers, "Pretrained Models documentation." https://www.sbert.net/docs/sentence_transformer/pretrained_models.html
Muennighoff, N., Tazi, N., Magne, L., Reimers, N. "MTEB: Massive Text Embedding Benchmark." EACL 2023. https://arxiv.org/abs/2210.07316
MTEB Leaderboard. https://huggingface.co/spaces/mteb/leaderboard
Milvus, "Popular pretrained Sentence Transformer models: all-MiniLM-L6-v2 vs all-mpnet-base-v2." https://milvus.io/ai-quick-reference/what-are-some-popular-pretrained-sentence-transformer-models-and-how-do-they-differ-for-example-allminilml6v2-vs-allmpnetbasev2

Sentence-transformers/all-MiniLM-L6-v2 model

origins

architecture

distillation lineage

training

datasets

objective

hardware and hyperparameters

position in the sentence-transformers family

benchmarks

inference and usage

speed and footprint

use cases

limitations

reception and continued popularity

see also

references

Improve this article

origins

architecture

distillation lineage

training

datasets

objective

hardware and hyperparameters

position in the sentence-transformers family

benchmarks

inference and usage

speed and footprint

use cases

limitations

reception and continued popularity

see also

references

origins

architecture

distillation lineage

training

datasets

objective

hardware and hyperparameters

position in the sentence-transformers family

benchmarks

inference and usage

speed and footprint

use cases

limitations

reception and continued popularity

see also

references

Improve this article

Related Articles

Sentence-transformers/all-mpnet-base-v2 model

SmolVLA

StarCoder

Clément Delangue

Hugging Face Transformers

MTEB (Massive Text Embedding Benchmark)

origins

architecture

distillation lineage

training

datasets

objective

hardware and hyperparameters

position in the sentence-transformers family

benchmarks

inference and usage

speed and footprint

use cases

limitations

reception and continued popularity

see also

references

Related Articles

Sentence-transformers/all-mpnet-base-v2 model

SmolVLA

StarCoder

Clément Delangue

Hugging Face Transformers

MTEB (Massive Text Embedding Benchmark)