SPLADE
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,350 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,350 words
Add missing citations, update stale details, or suggest a clearer explanation.
SPLADE (Sparse Lexical and Expansion Model) is a family of neural information retrieval models that produce sparse, vocabulary-sized representations of queries and documents by exploiting the masked language modeling (MLM) head of a BERT-style encoder. Introduced by Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant of Naver Labs Europe at SIGIR 2021, SPLADE projects each input into the WordPiece vocabulary space and learns nonzero weights for both the literal tokens of the text and a set of expansion terms, yielding representations that score documents through dot products on a traditional inverted index.[1] The model is trained end-to-end with a ranking objective combined with an explicit FLOPS regularizer that penalizes the expected per-dimension cost of posting-list traversal, producing very sparse outputs that remain competitive with dense bi-encoders.[2][1] Successive releases, SPLADE v2, SPLADE++, Efficient SPLADE, and SPLADE-v3, extended the design with knowledge distillation, hard-negative mining, separated query and document encoders, and improved training recipes.[3][4][5][6] SPLADE is widely adopted as the canonical learned sparse retriever and is integrated into open-source toolkits such as Pyserini and managed services such as Pinecone's sparse-dense indexes and Vespa's WAND operators.[7][8][9]
| Property | Value |
|---|---|
| Original paper | "SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking"[1] |
| Venue | SIGIR 2021 (44th ACM SIGIR Conference), Virtual Event, Canada[10] |
| Original authors | Thibault Formal, Benjamin Piwowarski, Stéphane Clinchant (Naver Labs Europe)[1][10] |
| First arXiv version | 2107.05720, 12 July 2021[1] |
| Successor papers | SPLADE v2 (arXiv:2109.10086, 21 Sep 2021); SPLADE++ (arXiv:2205.04733, 10 May 2022); Efficient SPLADE (arXiv:2207.03834, SIGIR 2022); SPLADE-v3 (arXiv:2403.06789, 11 Mar 2024)[2][4][5][6] |
| Reference implementation | github.com/naver/splade[3] |
| License (code) | Creative Commons Attribution-NonCommercial-ShareAlike 4.0[3] |
| Representation size | 30,522 sparse dimensions (BERT WordPiece vocabulary)[11] |
| Pooling | max over MLM logits with log(1+ReLU) saturation[11] |
| Regularizer | FLOPS (squared mean term weight per batch)[11][12] |
For decades, first-stage information retrieval over large document collections relied on bag-of-words term-matching scores such as TF-IDF and BM25, evaluated efficiently through inverted indexes that store posting lists keyed by vocabulary terms.[7] Such scores exhibit two well-known weaknesses: they fail when relevant documents use different surface forms than the query (the vocabulary mismatch problem), and they assign weights using unsupervised heuristics rather than task-specific signal. Two strands of work attempted to close that gap. Dense bi-encoders, exemplified by Dense Passage Retrieval (see DPR), embed queries and documents into a low-dimensional Euclidean space and retrieve with approximate nearest-neighbor search, but they discard the interpretability of explicit term matching and require specialized vector indexes.[13] Learned sparse retrieval kept the inverted-index pipeline and instead trained neural models to predict per-term weights, with DeepCT, doc2query, uniCOIL, DeepImpact, and SparTerm as early examples.[14][13]
SparTerm proposed predicting term importance directly from the logits of BERT's MLM head over the WordPiece vocabulary, multiplied by a learned binary gating mask that controls expansion.[15] The resulting representation reuses the inverted-index machinery while allowing the model to assign nonzero weight to vocabulary terms not literally present in the input, effectively learning a per-document expansion.[15][11] SparTerm's training did not directly optimize for sparsity beyond the gating mask, and its end-to-end behavior was difficult to tune.[1]
Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant introduced SPLADE in a short paper at SIGIR 2021, "SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking", published in the proceedings of the 44th International ACM SIGIR Conference and posted to arXiv on 12 July 2021.[1][10] The paper replaced SparTerm's binary gate with a log-saturation activation, w_j = max over input tokens i of log(1 + ReLU(s_{ij})), where s_{ij} is the MLM logit for vocabulary term j at input position i, and trained the model end-to-end with a ranking loss plus an explicit FLOPS regularizer on the sparsity of the representation.[11][1] The combination allowed SPLADE to outperform BM25, doc2query-T5, and previous sparse neural baselines on MS MARCO Passage Ranking while remaining indexable with standard inverted-file engines.[1][11]
Two months after SIGIR, Formal, Carlos Lassance, Piwowarski, and Clinchant posted a longer technical report on arXiv as SPLADE v2 (arXiv:2109.10086, 21 September 2021).[2][16] The v2 paper changed the pooling strategy from sum to max, introduced a document-only expansion variant (SPLADE-doc) that drops query expansion to lower query-time cost, and replaced the original cross-entropy training with a distillation objective using a cross-encoder teacher.[2][16] On MS MARCO TREC DL 2019, v2 reported "more than 9% gains on NDCG@10" over the SIGIR baseline and posted state-of-the-art zero-shot scores on the MTEB-adjacent BEIR benchmark of 18 datasets.[2]
At SIGIR 2022, the same group published "From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective" (arXiv:2205.04733, 10 May 2022).[4][17] The paper is the canonical reference for the SPLADE++ checkpoints, which combine knowledge distillation from a cross-encoder, hard-negative mining over a larger negative pool, and a co-condenser pre-trained language model initialization.[4][17] The released checkpoints naver/splade-cocondenser-selfdistil and naver/splade-cocondenser-ensembledistil reached MRR@10 of 37.6 and 38.3 respectively on the MS MARCO development set, and the EnsembleDistil variant achieved an average nDCG@10 of 50.7 on the BEIR-13 subset reported by the authors.[3][18]
Also at SIGIR 2022, Carlos Lassance and Stéphane Clinchant published "An Efficiency Study for SPLADE Models" (arXiv:2207.03834, 8 July 2022), focused on production latency rather than effectiveness ceilings.[5][19] The paper proposes L1 regularization specifically on the query side, separation of the query and document encoders so each can be tuned independently, a FLOPS-regularized middle-training stage, and faster query-side architectures.[5][19] The released Efficient SPLADE models, naver/efficient-splade-V-large-doc plus its query encoder, reach MRR@10 of 38.8 on MS MARCO dev while serving queries within a few milliseconds of BM25 on the same hardware.[3][5]
Carlos Lassance, Hervé Déjean, Thibault Formal, and Stéphane Clinchant released "SPLADE-v3: New baselines for SPLADE" on arXiv on 11 March 2024 (arXiv:2403.06789), accompanying a refreshed release of the official library and Hugging Face checkpoints.[6][20] SPLADE-v3 keeps the architecture of SPLADE++ but changes the training mix: warm-starting from the SPLADE++SelfDistil checkpoint, sampling 100 negatives per query (50 top-50 hard negatives and 50 random from the top-1000), and combining a KL-divergence loss with a MarginMSE loss weighted 0.05.[20][21] The base model reaches MRR@10 of 40.2 on MS MARCO and an average nDCG@10 of 51.7 on BEIR-13, improving the out-of-domain BEIR score by roughly two points over SPLADE++.[21][20] The release also includes variant models: SPLADE-v3-DistilBERT for lower-footprint deployment, SPLADE-v3-Lexical that drops query expansion, and SPLADE-v3-Doc that drops query-side computation entirely.[20]
Given a sequence of input WordPiece tokens, SPLADE first encodes the sequence with a BERT (or DistilBERT) backbone and then runs the language modeling head, which produces a logit s_{ij} for every vocabulary term j at every input position i.[11][22] These are the same logits that BERT uses for masked-token prediction during pre-training; SPLADE repurposes them as a per-position relevance signal over the 30,522-entry vocabulary.[22][11] No fine-tuning of the MLM head from scratch is required: training instead adapts the existing weights so the logits encode retrieval relevance rather than next-token likelihood.[11]
A nonlinearity is applied to convert logits into nonnegative term weights before pooling across positions. SPLADE uses the saturation function log(1 + [[relu|ReLU]](s_{ij})), which has two effects: ReLU zeros out negative logits, contributing the sparsity floor of the representation, and the logarithm dampens the impact of any single dominant logit, mimicking the saturating term-frequency curves used in BM25.[11][22] The v1 paper aggregated across input positions with a sum pool; v2 changed this to a max pool, which empirically improved both in-domain and zero-shot effectiveness and remains the default for SPLADE++ and SPLADE-v3.[2][21]
End-to-end sparsity is enforced by an auxiliary loss term known as the FLOPS regularizer.[12][11] For a batch of B representations, the regularizer is the sum over vocabulary dimensions j of the squared mean activation, sum_j (1/B sum_b w_{b,j})^2.[12] Because the squared mean penalizes consistent activation of the same dimension across the batch more heavily than isolated activations, it pushes the model toward representations whose nonzero entries are distributed unevenly across documents, which approximates the expected cost of scoring a query against an inverted index: a posting list is short when only a few documents activate that vocabulary entry.[12][11] Separate regularization weights lambda_q and lambda_d are applied to queries and documents, with a warmup schedule that ramps the coefficients up over the first thousand training steps so the model first learns to retrieve and then learns to be sparse.[11][16]
Because SPLADE outputs a 30,522-dimensional sparse vector whose dimensions correspond directly to vocabulary tokens, every nonzero entry is human-readable.[7][22] A SPLADE encoding of a document therefore decomposes into two intuitive components: the in-document terms, which behave like a learned variant of TF weights, and the expansion terms, which are vocabulary entries that did not appear in the document but received nonzero weight because the MLM head deemed them semantically relevant.[7][22] The dual property holds for queries: a query SPLADE vector is itself a bag of literal query words plus expansion words, so the standard inverted-index scoring formula, sum over shared vocabulary entries of w_q,j x w_d,j, gives the SPLADE relevance score.[7][11] This makes SPLADE results inspectable in a way that dense bi-encoders are not: failures can be traced to specific over- or under-weighted vocabulary entries.[22]
A SPLADE document representation is, by construction, a sparse weighted multiset over the BERT vocabulary, which fits exactly the data structure that classical search engines use for postings.[7][11] At indexing time, documents are encoded once with the document encoder, their nonzero entries written into Lucene-compatible impact-weighted postings, and the resulting index supports the same disjunctive top-k traversal that classical BM25 indexes use, including WAND and BlockMaxWAND skipping for sublinear retrieval.[7][8] At query time, only the (much smaller) query encoder runs, the query SPLADE vector is treated as a weighted disjunction of terms, and the standard impact-ranking machinery returns the top-k.[8][7] This compatibility is the main practical reason SPLADE has been adopted: existing search infrastructure based on Lucene, Tantivy, or commercial inverted-file engines can serve SPLADE without redesign.[8][7]
The naver/splade GitHub repository and the corresponding Hugging Face organization host multiple checkpoints; the table below records the headline scores documented in the official README and the per-model Hugging Face cards.
| Model | Backbone | MRR@10 (MS MARCO dev) | Notes |
|---|---|---|---|
| splade_v2_max | DistilBERT | 34.0 | Original v1 max-pool checkpoint[3] |
| splade_v2_distil (DistilSPLADE-max) | DistilBERT | 36.8 | v2 with distillation[3] |
| splade-cocondenser-selfdistil (SPLADE++ SD) | co-condenser | 37.6 | SIGIR 2022 self-distillation[3] |
| splade-cocondenser-ensembledistil (SPLADE++ ED) | co-condenser | 38.3 | Ensemble teacher distillation[3] |
| efficient-splade-V-large (doc + query) | BERT | 38.8 | Separated encoders, L1 query reg[3] |
| efficient-splade-VI-BT-large (doc + query) | BERT | 38.0 | Lower-latency configuration[3] |
| splade-v3 | BERT (warm from ++SD) | 40.2 | KL-Div + MarginMSE, 100 negatives[21] |
| splade-v3-distilbert | DistilBERT | 38.7 | Distilled variant of v3[20] |
| splade-v3-lexical | BERT | 40.0 | No query expansion[20] |
| splade-v3-doc | BERT | 37.8 | No query-side neural model[20] |
The repository is released under CC-BY-NC-SA 4.0, and per the README has accumulated nearly a thousand stars and is actively maintained.[3]
BEIR is a heterogeneous zero-shot retrieval benchmark spanning 18 datasets (typically the 13-dataset publicly evaluable subset). SPLADE has consistently set strong baselines on this benchmark: the SPLADE v2 paper introduced BEIR evaluation for sparse models and reported then-state-of-the-art zero-shot scores; SPLADE++ EnsembleDistil reached an average nDCG@10 of 50.7 on the BEIR-13 evaluation reported by Formal and colleagues; and SPLADE-v3 raised the BEIR-13 average to 51.7 nDCG@10 while exceeding 40 MRR@10 on MS MARCO dev.[2][4][21] The SPLADE-v3 paper conducts a meta-analysis across more than 40 query sets, including MS MARCO, BEIR, LoTTE and TREC collections, and reports that the v3 base model is statistically significantly more effective than both BM25 and SPLADE++ while remaining competitive with cross-encoder rerankers on those collections.[20]
The official repository at github.com/naver/splade provides training, indexing, and evaluation pipelines, integrated with Hugging Face Transformers and PyTorch.[3] The README documents training recipes for SPLADE, SPLADE++, Efficient SPLADE, and SPLADE-v3, end-to-end indexing via Pyserini-compatible JSON output, and reproduction scripts for MS MARCO and BEIR.[3]
Pyserini, the Lucene/Anserini-based Python toolkit for reproducible IR research from the University of Waterloo, supports SPLADE through its LuceneImpactSearcher class.[7][23] The toolkit ships pre-tokenized SPLADE indexes for MS MARCO and provides documentation that reproduces SPLADE v2's reported MRR@10 of ~0.368 on the development set, either by encoding queries on the fly with the distill-splade-max model or by using pre-computed query impacts.[7][23] Pyserini's SPLADE integration uses the --impact --pretokenized flags to bypass BM25 score normalization and treat the SPLADE-emitted weights as direct posting-list impacts.[7]
Pinecone announced general support for sparse-dense vectors on 23 February 2023, allowing users to upload SPLADE document vectors alongside dense embeddings in the same vector store and to retrieve with a tunable alpha blend between the two scores.[9] The Pinecone Text Client library ships a SpladeEncoder class that wraps the naver/splade-cocondenser-ensembledistil checkpoint and emits the sparse component in the format expected by Pinecone's sparse-dense index, enabling hybrid lexical-semantic retrieval without operating a separate sparse engine.[24][9]
The Vespa search engine supports SPLADE via its weightedset<int> document field and the wand query operator, which implements top-k disjunctive retrieval over weighted posting lists.[8] Because Vespa's WAND operator requires integer term identifiers, the recommended pattern is to map the 30,522 BERT WordPiece IDs into the Vespa field directly, allowing dense and sparse retrieval to be combined within a single query expression that mixes wand with Vespa's nearestNeighbor operator for dense vector embeddings.[8]
Beyond the headline integrations, the SPLADE checkpoints are usable from Sentence Transformers via its SparseEncoder API, allowing SPLADE document and query encoding from the same library that handles dense bi-encoders.[18] Downstream production users include open-source vector database stacks that surface sparse-dense hybrid retrieval, and academic systems that compare learned sparse retrieval against dense retrieval baselines across MS MARCO and BEIR.[13][7]
SPLADE established a viable middle path between BM25-style lexical retrieval and dense bi-encoder retrieval.[13][7] Its three principal practical advantages are: compatibility with existing inverted-file infrastructure, which removes the operational cost of running a separate vector index for first-stage retrieval; inherent interpretability of the per-dimension vocabulary weights, which simplifies debugging and explanation in production retrieval systems; and competitive zero-shot generalization on out-of-domain benchmarks such as BEIR, where dense bi-encoders historically degraded sharply outside their training distribution.[7][22][2] These properties have made SPLADE a frequent first-stage choice in semantic search pipelines, retrieval-augmented question answering, and as the sparse component of sparse-dense hybrid retrieval stacks.[9][7]
A SPLADE-encoded document can have many more nonzero vocabulary entries than its surface tokens, which increases inverted-index size and the average length of posting lists relative to BM25.[12][7] Aggressive FLOPS regularization controls this at training time, but trades off effectiveness, the cost of which is the subject of the Efficient SPLADE paper and subsequent work on DF-FLOPS that penalizes the use of high document-frequency terms specifically.[12][5]
Naive SPLADE serving runs a full transformer pass at query time before the inverted-index lookup, which adds latency relative to BM25's keyword-only processing path.[5] The Efficient SPLADE work narrowed the gap to under four milliseconds on standard hardware by using smaller query-side encoders and stronger L1 regularization on the query representation, but a residual cost relative to BM25 remains, especially under high concurrency.[5][19]
Because the representation space is fixed to the BERT WordPiece vocabulary, SPLADE inherits its tokenizer's strengths and weaknesses. Out-of-vocabulary terms must be subword-tokenized, and the WordPiece units may not match the natural language properties of non-English collections without retraining or multilingual variants.[11][22]
The reference SPLADE code and weights are released under CC-BY-NC-SA 4.0, a noncommercial license, which restricts certain production uses without a separate commercial agreement.[3]
| Approach | Representation | Index | Interpretability | Notes |
|---|---|---|---|---|
| BM25 | Unsupervised TF/IDF, sparse | Inverted | High | Strong unsupervised baseline[7] |
| DeepCT, doc2query, uniCOIL, DeepImpact | Learned per-term weights, sparse | Inverted | High | Earlier learned sparse retrievers[13][14] |
| SparTerm | MLM logits with binary gate, sparse | Inverted | High | Direct antecedent of SPLADE[15] |
| SPLADE family | MLM logits with FLOPS, sparse | Inverted | High | Subject of this article[1][2][4][5][6] |
| DPR and successor dense bi-encoders | Dense vector, low-d | ANN/HNSW | Low | Strong on in-domain, can degrade zero-shot[13] |
| Cross-encoder rerankers | Pairwise transformer score | None (rerank) | Medium | Used as teachers in SPLADE distillation[4][20] |