fastText

Machine Learning Natural Language Processing Open Source AI

20 min read

Updated Jun 22, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 22, 2026

Fact-checked

In review queue

Sources

18 citations

Revision

v3 · 3,899 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

fastText is an open-source library for learning word embeddings and performing text classification, developed by Facebook AI Research (FAIR) and released to the public on August 18, 2016 ^[1]^[2]. Its defining idea is to represent every word as a bag of character n-grams whose vectors are summed to produce the word representation, rather than treating each word as an atomic unit as word2vec does. This subword approach lets the model generate embeddings for words it never encountered during training, captures regular morphology in inflected languages, and underpins a complementary classification mode that, in FAIR's own benchmark, can "train fastText on more than one billion words in less than ten minutes using a standard multicore CPU, and classify half a million sentences among 312K classes in less than a minute" ^[4].

fastText was authored by Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomáš Mikolov, the same Mikolov who led the original word2vec work at Google in 2013 before joining FAIR around 2014 ^[3]^[5]. The project consists of two foundational papers from 2016 and 2017, a C++ command-line tool, official Python bindings, and a series of pretrained model releases. Its most widely cited contribution, "Enriching Word Vectors with Subword Information," has accumulated more than ten thousand citations ^[3]. Although the GitHub repository was archived on March 19, 2024, fastText models remain in active production use in industrial pipelines, search systems, multilingual NLP, and as a fast baseline for classification tasks ^[6].

What problem do word embeddings solve?

A word embedding is a dense, low-dimensional vector representation of a word, typically with 100 to 300 dimensions, learned from large unlabeled text corpora. The central premise, often summarized as the distributional hypothesis, is that words appearing in similar contexts tend to have similar meanings. By placing those words near one another in a continuous vector space, embeddings expose semantic and syntactic regularities that are difficult to capture with one-hot indicators or sparse co-occurrence matrices.

The modern wave of embedding methods began with word2vec by Mikolov and colleagues at Google in 2013 ^[5]. Word2vec offered two efficient log-linear architectures: continuous bag of words (CBOW), which predicts a target word from its surrounding context, and skip-gram, which predicts surrounding context words from a single target. Both models were trained at scale using negative sampling or hierarchical softmax to avoid the cost of a full softmax over a large vocabulary. The Stanford GloVe model, introduced by Pennington, Socher, and Manning in 2014, took a complementary route by factorizing a global word co-occurrence matrix while preserving similar regularities in the resulting vector space ^[7].

Both word2vec and GloVe shared a structural limitation: each word in the training vocabulary received a single, atomic vector, and any token outside that vocabulary, an out-of-vocabulary word, had no representation at all. The atomic-vector treatment also discards information about a word's internal structure, so morphologically related words such as "play," "plays," "playing," and "played" sit in the model independently rather than sharing parameters that reflect their shared root. For morphologically rich languages such as Finnish, Turkish, Czech, or Arabic, where a single root can produce dozens or hundreds of surface forms, this is a serious source of data sparsity. fastText was designed to address exactly this gap.

How does the subword representation work?

The core innovation of fastText is subword embedding, introduced in Bojanowski, Grave, Joulin, and Mikolov's 2017 TACL paper "Enriching Word Vectors with Subword Information" ^[3]. As the authors describe it, "we propose a new approach based on the skipgram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram; words being represented as the sum of these representations" ^[3]. Instead of associating each word in the vocabulary with a single vector, fastText associates a vector with every character n-gram appearing in the corpus, and represents each word as the sum of the vectors of its constituent n-grams plus a special vector for the word itself.

To extract the n-grams, the model first wraps each word in special boundary tokens. The angle brackets < and > are prepended and appended so that prefixes and suffixes can be distinguished from interior n-grams. fastText then enumerates all character n-grams of length between a minimum minn and a maximum maxn. The default settings in the library use n-grams from 3 to 6 characters ^[6]^[8].

For example, the word "apple" with the default range produces the following set:

n	n-grams generated for "apple"
3	`<ap`, `app`, `ppl`, `ple`, `le>`
4	`<app`, `appl`, `pple`, `ple>`
5	`<appl`, `apple`, `pple>`
6	`<apple`, `apple>`

In addition to those n-gram vectors, the model maintains a separate vector for the word with the boundary tokens, written <apple>, which lets fastText distinguish the full word from any of its substrings. The embedding for "apple" is then the elementwise sum of all these vectors. Because the n-gram vectors are shared across every word that contains those substrings, two morphologically related words such as "played" and "playing" automatically share many of the same parameters through their common stems and affixes, encouraging similar representations even when the two surface forms appear in different distributions.

This subword decomposition is what gives fastText its out-of-vocabulary capability. When asked for a vector for a word that never appeared in training, the model still has vectors for the word's constituent n-grams (most of which will have been seen in other words), and can compose a usable embedding by summing those subword vectors. Static word2vec or GloVe models cannot do this at all; they simply have no entry for the unseen token. The cost is modest: fastText must learn many more parameters than a comparable word-level model, because the number of distinct n-grams in a large corpus typically exceeds the number of distinct words.

To control the resulting memory footprint, fastText applies the hashing trick. Each n-gram string is mapped to an integer in a fixed range using the FNV-1a variant of the Fowler-Noll-Vo hash function, and the model maintains an embedding table indexed by that hashed value ^[8]^[9]. The default bucket size is 2,000,000 ^[8]. Hash collisions cause distinct n-grams to share an embedding, but in practice this is acceptable because the most informative n-grams tend to occur frequently enough to dominate the shared parameter, and uncommon collisions act as a mild form of regularization. Crucially, the hashing trick decouples model size from vocabulary growth, so adding more text never enlarges the n-gram parameter table.

How is fastText trained?

For unsupervised word representation learning, fastText extends the skip-gram model from word2vec. Given a center word, the network is trained to predict surrounding context words within a fixed window. The change introduced by fastText is that the input representation for the center word is no longer a single vector lookup but the sum of the lookup vectors for its character n-grams (plus the full-word vector) ^[3].

In matrix terms, if a word w is represented as a set of n-gram indices G_w, and z_g is the embedding vector for n-gram g, then the center-word vector used in scoring is s_w = sum over g in G_w of z_g. The score between a center word w and a candidate context word c, with context-side vector v_c, is the dot product s_w . v_c. Training maximizes this score for true (word, context) pairs and minimizes it for randomly sampled negatives.

fastText supports three loss functions, controlled by the -loss flag: ns (negative sampling), hs (hierarchical softmax), and full softmax ^[10]. Negative sampling is the default for skipgram and cbow modes and is recommended for large vocabularies. With negative sampling, every positive (word, context) pair is paired with a fixed number of negative samples (commonly 5) drawn from a noise distribution proportional to a smoothed unigram frequency, which sharpens the contrast the model has to learn. Hierarchical softmax replaces the full softmax with a binary tree over the vocabulary and is useful when the vocabulary is very large or the loss must be exact. The plain softmax option is mainly used for small vocabularies, especially in supervised classification with a modest number of labels.

The library also supports CBOW training in unsupervised mode, where the architecture is reversed: the model uses the average of context-word representations to predict the center word. The pretrained vectors released for 157 languages in 2018 actually use a position-weighted CBOW variant rather than skip-gram ^[11], which the project found gave better quality on those large multilingual training sets. Default optimization is stochastic gradient descent with a linearly decaying learning rate, the typical default vector dimension is 100 for command-line training, and 300 for the publicly released pretrained vectors ^[10]^[11].

How does the text classification mode work?

The second foundational paper, "Bag of Tricks for Efficient Text Classification" by Joulin, Grave, Bojanowski, and Mikolov, was presented at EACL 2017 in Valencia, Spain ^[4]. It describes fastText's supervised mode, a deliberately simple architecture that nonetheless reaches accuracy comparable to deep convolutional and recurrent baselines on a wide range of text classification tasks while being many orders of magnitude faster. As the abstract puts it, "our fast text classifier fastText is often on par with deep learning classifiers in terms of accuracy, and many orders of magnitude faster for training and evaluation" ^[4].

The classifier maps an input text to a single embedding by averaging the vectors of all its features. The features can be word embeddings, word n-gram embeddings (bigrams or higher), or character n-gram embeddings, depending on configuration. That averaged document vector is then passed through a single linear layer followed by a softmax (or hierarchical softmax) over the label set. There are no recurrent layers, no convolutions, and no nonlinearity between the average and the classifier output, which is exactly why training and inference are so fast.

The paper benchmarks fastText against character-level convolutional networks (Zhang et al. 2015) and other deep models on eight standard datasets: AG News, Sogou News, DBpedia, Yelp Review Polarity, Yelp Review Full, Yahoo Answers, Amazon Review Full, and Amazon Review Polarity ^[4]. fastText matches or beats most of the deep baselines while training in seconds to minutes on a single multicore CPU rather than hours on a GPU. The headline numbers reported in the paper are striking: training on more than one billion words in less than ten minutes on a standard multicore CPU, and classifying 500,000 sentences across 312,000 categories in under a minute ^[2]^[4].

When the label set is large, fastText uses hierarchical softmax based on a Huffman coding tree to make training and inference logarithmic rather than linear in the number of classes ^[4]. This is what makes the 312,000-class throughput feasible. For tasks with only a handful of labels, the plain softmax loss is typically faster and slightly more accurate. The classifier also accepts word n-gram features through the -wordNgrams flag, which captures local word order at low computational cost; bigrams in particular often deliver a meaningful accuracy bump on sentiment tasks at the price of a larger feature table.

What pretrained models does fastText provide?

FAIR has released several large pretrained model packages that are widely used as drop-in resources.

Wikipedia vectors for 294 languages. The earliest large multilingual release shipped 300-dimensional word vectors trained on Wikipedia for 294 languages, obtained with the skip-gram model from Bojanowski et al. (2016) at default parameters and distributed under the Creative Commons Attribution-Share-Alike 3.0 license ^[17]. This 294-language Wikipedia set was later superseded by the larger Common Crawl plus Wikipedia release covering 157 languages, but it remains available and is still cited for lower-resource languages absent from the newer set.

English vectors trained on Common Crawl and Wikipedia. Following the 2018 LREC paper "Advances in Pre-Training Distributed Word Representations" by Mikolov, Grave, Bojanowski, Puhrsch, and Joulin, fastText released 300-dimensional English word vectors trained on a Common Crawl snapshot containing 600 billion tokens, with a 2,000,000-word vocabulary ^[11]^[12]. A second set covers 1,000,000 words trained on Wikipedia 2017, the UMBC webbase corpus, and the statmt.org news corpus, totaling 16 billion tokens. Variants are distributed both with and without subword information, the former being noticeably more robust to rare or out-of-vocabulary inputs.

Word vectors for 157 languages. The companion 2018 LREC paper "Learning Word Vectors for 157 Languages" by Grave, Bojanowski, Gupta, Joulin, and Mikolov accompanies a release of 300-dimensional vectors trained on Common Crawl and Wikipedia for 157 languages ^[13]. These models use position-weighted CBOW with a window of 5 and 10 negative samples, and they apply language-appropriate tokenizers, including the Stanford segmenter for Chinese, MeCab for Japanese, UETsegmenter for Vietnamese, and ICU tokenization for many other scripts ^[13]. The vectors are distributed under the Creative Commons Attribution-Share-Alike 3.0 license.

Aligned multilingual vectors (MUSE). The MUSE project ("Multilingual Unsupervised or Supervised word Embeddings") aligned fastText Wikipedia vectors for 30 languages into a shared vector space, using either a small bilingual dictionary or fully unsupervised adversarial alignment. The accompanying paper "Word Translation Without Parallel Data" by Conneau, Lample, Ranzato, Denoyer, and Jégou demonstrated that good cross-lingual word translation could be obtained without any parallel corpora when starting from monolingual fastText vectors ^[14].

Language identification. fastText also distributes two compact supervised language identification models, lid.176.bin and the quantized lid.176.ftz, trained on Wikipedia, Tatoeba, and SETimes data and capable of recognizing 176 languages ^[15]. These models, used widely as preprocessing steps in multilingual pipelines, illustrate the supervised classifier's main practical strengths: tiny models, very fast inference, and accuracy that holds up well on real text.

Pretrained release	Languages	Tokens / corpus	Dim
Wikipedia skip-gram vectors	294	Wikipedia	300
English Common Crawl vectors	1	600B (Common Crawl)	300
English Wikipedia + UMBC + news	1	16B	300
157 languages crawl+wiki	157	Common Crawl + Wikipedia	300
MUSE aligned multilingual	30	Wikipedia	300
`lid.176` language identification	176	Wikipedia, Tatoeba, SETimes	n/a

How small can a fastText model be compressed?

A December 2016 paper, "FastText.zip: Compressing Text Classification Models" by Joulin, Grave, Bojanowski, Douze, Jégou, and Mikolov, introduced model quantization techniques that ship in the main library as the quantize subcommand ^[16]. The technique combines product quantization for the embedding tables, a feature pruning step that drops infrequent features, and an entropy-coded representation for the resulting compact codes.

Product quantization splits each embedding vector into several subvectors and clusters each subvector independently with k-means, replacing the original floating-point storage with a few small integer codebook indices per vector. Decoding is a lookup, and the cost of the lookup at inference time is far smaller than the original dot product against a dense matrix. The paper reports compression by two orders of magnitude (around 100x smaller models) with only modest accuracy loss on the standard text classification benchmarks ^[16]. The result is a model that fits comfortably on a mobile phone, which was an explicit motivation for the work.

How does fastText compare to word2vec, GloVe, and BERT?

The value proposition of fastText is most visible against the contemporaries it was designed to compete with: word2vec and GloVe for embeddings, and convolutional or recurrent neural networks for classification.

Method	Embedding type	OOV handling	Polysemy	Typical training cost
word2vec	Word-level static	None	None	CPU minutes to hours
GloVe	Word-level static (matrix factorization)	None	None	CPU hours
fastText	Subword-aware static	Yes (sum of n-grams)	None	CPU minutes to hours
ELMo	Contextual (BiLSTM)	Yes (character CNN)	Yes	GPU days
BERT	Contextual (Transformer)	Yes (subword tokenization)	Yes	GPU/TPU weeks

On intrinsic word similarity and analogy tasks, fastText matched or improved on word2vec's skip-gram for English, and consistently improved over it for morphologically richer languages such as German, Russian, Czech, and Arabic, where the subword vectors carry meaningful linguistic information that word-level models cannot exploit ^[3]. The improvements were largest on rare and inflected words, exactly the regions where word-level models suffer the most.

For text classification, the EACL 2017 paper showed that an averaged-bag-of-features representation followed by a linear classifier reached competitive accuracy on AG News, DBpedia, and the various Yelp and Amazon review datasets while training in seconds on a CPU rather than hours on a GPU ^[4]. The combination of accuracy parity and dramatic speed advantage made fastText a popular choice for production classifiers and as a strong baseline against which to evaluate more elaborate models.

The most consequential limitation of fastText is that it assigns a single vector per word, so it cannot represent polysemy. The word "bank" gets one embedding regardless of whether it appears in a financial or geographic context. Contextual models such as ELMo (2018) and BERT (2018) explicitly compute a different vector for each occurrence of a word given its surrounding sentence, and they routinely outperform fastText embeddings on downstream NLP tasks where context is decisive. fastText's classification mode is also a bag-of-features model: it averages over input features rather than reading them in order, which means it cannot model long-range syntactic structure the way a Transformer can.

What is fastText's influence and where does it fall short?

The Bojanowski et al. 2017 paper has been cited more than ten thousand times and is one of the most influential static-embedding papers in modern NLP ^[3]. The conceptual point that subword units are useful for representing rare and morphologically complex words has carried into a much broader trend toward subword tokenization in modern language models. Byte pair encoding, WordPiece, and SentencePiece, used by GPT, BERT, and most contemporary large language models, all rest on the principle that breaking words into subword pieces yields a vocabulary that gracefully covers rare words and morphology. Strictly speaking, BPE-style tokenization differs from fastText's character n-gram embeddings: BPE chooses one specific segmentation per word at preprocessing time, while fastText sums embeddings over all n-grams in a fixed range. But the underlying intuition that subword sharing pays off in both representation quality and out-of-vocabulary coverage is the same.

fastText has been largely superseded for state-of-the-art accuracy by contextual encoders. Where ELMo, BERT, and the broader transformer family compute fresh, context-sensitive representations every time a word appears, fastText offers only a fixed lookup. For tasks with limited compute, low-resource languages, or where extreme inference speed and small footprint are decisive, fastText nonetheless remains attractive. It is widely used as a fast classification baseline, as initialization for downstream models in low-resource settings, in document deduplication and retrieval pipelines (especially via cosine similarity over averaged embeddings), and as the language identification component in multilingual data preparation, including web-scale corpora used to train modern large language models.

The library has visible limitations beyond the absence of contextualization. Quality degrades when the corpus is small, since rare n-grams collide more aggressively under the hashing trick. Hyperparameter sensitivity is real: minimum and maximum n-gram length, bucket size, vector dimension, learning rate, and number of negative samples all interact with corpus size and language. The supervised mode is essentially a linear classifier over averaged features, so it cannot model interactions or long-range structure beyond what word n-gram features can capture. Finally, while Meta AI (the rebranded successor of Facebook AI Research) maintained the project for several years, the GitHub repository has been archived since March 19, 2024 ^[6], so users should expect the codebase to be feature-frozen.

How do you use fastText in practice?

fastText is written in C++ with a thin command-line interface and an officially supported Python binding distributed on PyPI. The library is released under the MIT License ^[6]. The most common interactions look like this in spirit (commands shown only descriptively).

Unsupervised training of skip-gram word vectors uses the skipgram subcommand and produces both a binary .bin model file and a text .vec file containing the vocabulary vectors. The cbow subcommand offers the alternative architecture. Supervised classification is invoked through the supervised subcommand and consumes a labeled file where each line begins with a label prefix such as __label__ followed by the input text. After training, the predict and predict-prob subcommands produce labels (and probabilities) for new inputs, and the quantize subcommand compresses an existing supervised model.

The Python package is installed with pip install fasttext. It exposes the same functionality through fasttext.train_unsupervised, fasttext.train_supervised, and methods on the resulting model object, including get_word_vector, get_sentence_vector, predict, and quantize. There are also widely used third-party integrations: Gensim provides a fastText implementation in Python, and bindings exist for R, Java, and Node.js. Hugging Face hosts the official Meta-released vectors and the language identification models in convenient downloadable form.

Default hyperparameters in fastText follow the values reported in the original papers. For unsupervised training, the typical defaults include vector dimension 100, learning rate 0.05, window size 5, 5 epochs, 5 negative samples, and character n-grams of length 3 to 6 ^[10]. For supervised classification, the defaults differ: learning rate 0.1, vector dimension 100, 5 epochs, no character n-grams by default, and word n-grams of size 1 (unigrams), with bigrams or trigrams enabled through -wordNgrams 2 or higher ^[10]. The bucket size is 2,000,000 in both modes ^[10].

References

Joulin, A. (2016). "fastText". Meta Research blog (Facebook AI Research). https://research.facebook.com/blog/2016/08/fasttext/ ↩
TechCrunch (August 18, 2016). "Facebook's Artificial Intelligence Research lab releases open source fastText on GitHub". https://techcrunch.com/2016/08/18/facebooks-artificial-intelligence-research-lab-releases-open-source-fasttext-on-github/ ↩
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). "Enriching Word Vectors with Subword Information". *Transactions of the Association for Computational Linguistics*, 5, 135-146. arXiv:1607.04606. https://aclanthology.org/Q17-1010/ ↩
Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2017). "Bag of Tricks for Efficient Text Classification". *Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2017)*, Volume 2, pages 427-431, Valencia, Spain. arXiv:1607.01759. https://aclanthology.org/E17-2068/ ↩
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). "Efficient Estimation of Word Representations in Vector Space". arXiv:1301.3781. https://arxiv.org/abs/1301.3781 ↩
facebookresearch/fastText. GitHub repository (archived March 19, 2024). https://github.com/facebookresearch/fastText ↩
Pennington, J., Socher, R., & Manning, C. D. (2014). "GloVe: Global Vectors for Word Representation". *Proceedings of EMNLP 2014*. https://aclanthology.org/D14-1162/ ↩
fastText official documentation. "FAQ". https://fasttext.cc/docs/en/faqs.html ↩
Fowler, G., Noll, L. C., & Vo, K.-P. "FNV Hash". http://www.isthe.com/chongo/tech/comp/fnv/ ↩
fastText official documentation. "List of options". https://fasttext.cc/docs/en/options.html ↩
Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., & Joulin, A. (2018). "Advances in Pre-Training Distributed Word Representations". *Proceedings of LREC 2018*. arXiv:1712.09405. https://aclanthology.org/L18-1008/ ↩
fastText official documentation. "English word vectors". https://fasttext.cc/docs/en/english-vectors.html ↩
Grave, E., Bojanowski, P., Gupta, P., Joulin, A., & Mikolov, T. (2018). "Learning Word Vectors for 157 Languages". *Proceedings of LREC 2018*. http://www.lrec-conf.org/proceedings/lrec2018/pdf/721.pdf ↩
Conneau, A., Lample, G., Ranzato, M., Denoyer, L., & Jégou, H. (2018). "Word Translation Without Parallel Data". *ICLR 2018*. arXiv:1710.04087. https://arxiv.org/abs/1710.04087 ↩
fastText official documentation. "Language identification". https://fasttext.cc/docs/en/language-identification.html ↩
Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jégou, H., & Mikolov, T. (2016). "FastText.zip: Compressing Text Classification Models". arXiv:1612.03651. https://arxiv.org/abs/1612.03651 ↩
fastText official documentation. "Wiki word vectors" (pretrained vectors for 294 languages). https://fasttext.cc/docs/en/pretrained-vectors.html ↩
fastText official website. https://fasttext.cc/

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

Bag of Words Common Crawl Data preprocessing DeepSeekMath Embedding Space Embedding vector FineWeb GloVe (Global Vectors for Word Representation)RedPajama RefinedWeb Representation Self-Supervised Learning Softmax Text Classification Models The Pile (dataset)Vector embeddings Word Embedding word2vec

What problem do word embeddings solve?

How does the subword representation work?

How is fastText trained?

How does the text classification mode work?

What pretrained models does fastText provide?

How small can a fastText model be compressed?

How does fastText compare to word2vec, GloVe, and BERT?

What is fastText's influence and where does it fall short?

How do you use fastText in practice?

See also

References

Improve this article

Related Articles

The Pile (dataset)

FineWeb

RedPajama

SentencePiece

LLaMA

Qwen

What links here

Related Articles

The Pile (dataset)

FineWeb

RedPajama

SentencePiece

LLaMA

Qwen

What links here