fastText
Last reviewed
Apr 30, 2026
Sources
18 citations
Review status
Source-backed
Revision
v2 · 3,666 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 30, 2026
Sources
18 citations
Review status
Source-backed
Revision
v2 · 3,666 words
Add missing citations, update stale details, or suggest a clearer explanation.
fastText is an open-source library for learning word embeddings and performing text classification, developed by Facebook AI Research (FAIR) and released to the public on August 18, 2016 [1][2]. The library builds on the principles of word2vec but extends it in a key way: instead of treating each word as an atomic unit with its own vector, fastText represents every word as a bag of character n-grams whose vectors are summed to produce the word representation. This subword approach lets the model generate embeddings for words it never encountered during training, captures regular morphology in inflected languages, and underpins a complementary classification mode that runs orders of magnitude faster than contemporary deep learning baselines [3][4].
fastText was authored by Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomáš Mikolov, the same Mikolov who led the original word2vec work at Google in 2013 before joining FAIR around 2014 [3][5]. The project consists of two foundational papers from 2016 and 2017, a C++ command-line tool, official Python bindings, and a series of pretrained model releases. Its most widely cited contribution, "Enriching Word Vectors with Subword Information," has accumulated more than ten thousand citations [3]. Although the GitHub repository was archived on March 19, 2024, fastText models remain in active production use in industrial pipelines, search systems, multilingual NLP, and as a fast baseline for classification tasks [6].
A word embedding is a dense, low-dimensional vector representation of a word, typically with 100 to 300 dimensions, learned from large unlabeled text corpora. The central premise, often summarized as the distributional hypothesis, is that words appearing in similar contexts tend to have similar meanings. By placing those words near one another in a continuous vector space, embeddings expose semantic and syntactic regularities that are difficult to capture with one-hot indicators or sparse co-occurrence matrices.
The modern wave of embedding methods began with word2vec by Mikolov and colleagues at Google in 2013 [5]. Word2vec offered two efficient log-linear architectures: continuous bag of words (CBOW), which predicts a target word from its surrounding context, and skip-gram, which predicts surrounding context words from a single target. Both models were trained at scale using negative sampling or hierarchical softmax to avoid the cost of a full softmax over a large vocabulary. The Stanford GloVe model, introduced by Pennington, Socher, and Manning in 2014, took a complementary route by factorizing a global word co-occurrence matrix while preserving similar regularities in the resulting vector space [7].
Both word2vec and GloVe shared a structural limitation: each word in the training vocabulary received a single, atomic vector, and any token outside that vocabulary, an out-of-vocabulary word, had no representation at all. The atomic-vector treatment also discards information about a word's internal structure, so morphologically related words such as "play," "plays," "playing," and "played" sit in the model independently rather than sharing parameters that reflect their shared root. For morphologically rich languages such as Finnish, Turkish, Czech, or Arabic, where a single root can produce dozens or hundreds of surface forms, this is a serious source of data sparsity. fastText was designed to address exactly this gap.
The core innovation of fastText is subword embedding, introduced in Bojanowski, Grave, Joulin, and Mikolov's 2017 TACL paper "Enriching Word Vectors with Subword Information" [3]. Instead of associating each word in the vocabulary with a single vector, fastText associates a vector with every character n-gram appearing in the corpus, and represents each word as the sum of the vectors of its constituent n-grams plus a special vector for the word itself.
To extract the n-grams, the model first wraps each word in special boundary tokens. The angle brackets < and > are prepended and appended so that prefixes and suffixes can be distinguished from interior n-grams. fastText then enumerates all character n-grams of length between a minimum minn and a maximum maxn. The default settings in the library use n-grams from 3 to 6 characters [6][8].
For example, the word "apple" with the default range produces the following set:
| n | n-grams generated for "apple" |
|---|---|
| 3 | <ap, app, ppl, ple, le> |
| 4 | <app, appl, pple, ple> |
| 5 | <appl, apple, pple> |
| 6 | <apple, apple> |
In addition to those n-gram vectors, the model maintains a separate vector for the word with the boundary tokens, written <apple>, which lets fastText distinguish the full word from any of its substrings. The embedding for "apple" is then the elementwise sum of all these vectors. Because the n-gram vectors are shared across every word that contains those substrings, two morphologically related words such as "played" and "playing" automatically share many of the same parameters through their common stems and affixes, encouraging similar representations even when the two surface forms appear in different distributions.
This subword decomposition is what gives fastText its out-of-vocabulary capability. When asked for a vector for a word that never appeared in training, the model still has vectors for the word's constituent n-grams (most of which will have been seen in other words), and can compose a usable embedding by summing those subword vectors. Static word2vec or GloVe models cannot do this at all; they simply have no entry for the unseen token. The cost is modest: fastText must learn many more parameters than a comparable word-level model, because the number of distinct n-grams in a large corpus typically exceeds the number of distinct words.
To control the resulting memory footprint, fastText applies the hashing trick. Each n-gram string is mapped to an integer in a fixed range using the FNV-1a variant of the Fowler-Noll-Vo hash function, and the model maintains an embedding table indexed by that hashed value [8][9]. The default bucket size is 2,000,000 [8]. Hash collisions cause distinct n-grams to share an embedding, but in practice this is acceptable because the most informative n-grams tend to occur frequently enough to dominate the shared parameter, and uncommon collisions act as a mild form of regularization. Crucially, the hashing trick decouples model size from vocabulary growth, so adding more text never enlarges the n-gram parameter table.
For unsupervised word representation learning, fastText extends the skip-gram model from word2vec. Given a center word, the network is trained to predict surrounding context words within a fixed window. The change introduced by fastText is that the input representation for the center word is no longer a single vector lookup but the sum of the lookup vectors for its character n-grams (plus the full-word vector) [3].
In matrix terms, if a word w is represented as a set of n-gram indices G_w, and z_g is the embedding vector for n-gram g, then the center-word vector used in scoring is s_w = sum over g in G_w of z_g. The score between a center word w and a candidate context word c, with context-side vector v_c, is the dot product s_w . v_c. Training maximizes this score for true (word, context) pairs and minimizes it for randomly sampled negatives.
fastText supports three loss functions, controlled by the -loss flag: ns (negative sampling), hs (hierarchical softmax), and full softmax [10]. Negative sampling is the default for skipgram and cbow modes and is recommended for large vocabularies. With negative sampling, every positive (word, context) pair is paired with a fixed number of negative samples (commonly 5) drawn from a noise distribution proportional to a smoothed unigram frequency, which sharpens the contrast the model has to learn. Hierarchical softmax replaces the full softmax with a binary tree over the vocabulary and is useful when the vocabulary is very large or the loss must be exact. The plain softmax option is mainly used for small vocabularies, especially in supervised classification with a modest number of labels.
The library also supports CBOW training in unsupervised mode, where the architecture is reversed: the model uses the average of context-word representations to predict the center word. The pretrained vectors released for 157 languages in 2018 actually use a position-weighted CBOW variant rather than skip-gram [11], which the project found gave better quality on those large multilingual training sets. Default optimization is stochastic gradient descent with a linearly decaying learning rate, the typical default vector dimension is 100 for command-line training, and 300 for the publicly released pretrained vectors [10][11].
The second foundational paper, "Bag of Tricks for Efficient Text Classification" by Joulin, Grave, Bojanowski, and Mikolov, was presented at EACL 2017 in Valencia, Spain [4]. It describes fastText's supervised mode, a deliberately simple architecture that nonetheless reaches accuracy comparable to deep convolutional and recurrent baselines on a wide range of text classification tasks while being many orders of magnitude faster.
The classifier maps an input text to a single embedding by averaging the vectors of all its features. The features can be word embeddings, word n-gram embeddings (bigrams or higher), or character n-gram embeddings, depending on configuration. That averaged document vector is then passed through a single linear layer followed by a softmax (or hierarchical softmax) over the label set. There are no recurrent layers, no convolutions, and no nonlinearity between the average and the classifier output, which is exactly why training and inference are so fast.
The paper benchmarks fastText against character-level convolutional networks (Zhang et al. 2015) and other deep models on eight standard datasets: AG News, Sogou News, DBpedia, Yelp Review Polarity, Yelp Review Full, Yahoo Answers, Amazon Review Full, and Amazon Review Polarity [4]. fastText matches or beats most of the deep baselines while training in seconds to minutes on a single multicore CPU rather than hours on a GPU. The headline numbers reported in the paper are striking: training on more than one billion words in less than ten minutes on a standard multicore CPU, and classifying 500,000 sentences across 312,000 categories in under a minute [2][4].
When the label set is large, fastText uses hierarchical softmax based on a Huffman coding tree to make training and inference logarithmic rather than linear in the number of classes [4]. This is what makes the 312,000-class throughput feasible. For tasks with only a handful of labels, the plain softmax loss is typically faster and slightly more accurate. The classifier also accepts word n-gram features through the -wordNgrams flag, which captures local word order at low computational cost; bigrams in particular often deliver a meaningful accuracy bump on sentiment tasks at the price of a larger feature table.
FAIR has released several large pretrained model packages that are widely used as drop-in resources.
English vectors trained on Common Crawl and Wikipedia. Following the 2018 LREC paper "Advances in Pre-Training Distributed Word Representations" by Mikolov, Grave, Bojanowski, Puhrsch, and Joulin, fastText released 300-dimensional English word vectors trained on a Common Crawl snapshot containing 600 billion tokens, with a 2,000,000-word vocabulary [11][12]. A second set covers 1,000,000 words trained on Wikipedia 2017, the UMBC webbase corpus, and the statmt.org news corpus, totaling 16 billion tokens. Variants are distributed both with and without subword information, the former being noticeably more robust to rare or out-of-vocabulary inputs.
Word vectors for 157 languages. The companion 2018 LREC paper "Learning Word Vectors for 157 Languages" by Grave, Bojanowski, Gupta, Joulin, and Mikolov accompanies a release of 300-dimensional vectors trained on Common Crawl and Wikipedia for 157 languages [13]. These models use position-weighted CBOW with a window of 5 and 10 negative samples, and they apply language-appropriate tokenizers, including the Stanford segmenter for Chinese, MeCab for Japanese, UETsegmenter for Vietnamese, and ICU tokenization for many other scripts [13]. The vectors are distributed under the Creative Commons Attribution-Share-Alike 3.0 license.
Aligned multilingual vectors (MUSE). The MUSE project ("Multilingual Unsupervised or Supervised word Embeddings") aligned fastText Wikipedia vectors for 30 languages into a shared vector space, using either a small bilingual dictionary or fully unsupervised adversarial alignment. The accompanying paper "Word Translation Without Parallel Data" by Conneau, Lample, Ranzato, Denoyer, and Jégou demonstrated that good cross-lingual word translation could be obtained without any parallel corpora when starting from monolingual fastText vectors [14].
Language identification. fastText also distributes two compact supervised language identification models, lid.176.bin and the quantized lid.176.ftz, trained on Wikipedia, Tatoeba, and SETimes data and capable of recognizing 176 languages [15]. These models, used widely as preprocessing steps in multilingual pipelines, illustrate the supervised classifier's main practical strengths: tiny models, very fast inference, and accuracy that holds up well on real text.
| Pretrained release | Languages | Tokens / corpus | Dim |
|---|---|---|---|
| English Common Crawl vectors | 1 | 600B (Common Crawl) | 300 |
| English Wikipedia + UMBC + news | 1 | 16B | 300 |
| 157 languages crawl+wiki | 157 | Common Crawl + Wikipedia | 300 |
| MUSE aligned multilingual | 30 | Wikipedia | 300 |
lid.176 language identification | 176 | Wikipedia, Tatoeba, SETimes | n/a |
A December 2016 paper, "FastText.zip: Compressing Text Classification Models" by Joulin, Grave, Bojanowski, Douze, Jégou, and Mikolov, introduced model quantization techniques that ship in the main library as the quantize subcommand [16]. The technique combines product quantization for the embedding tables, a feature pruning step that drops infrequent features, and an entropy-coded representation for the resulting compact codes.
Product quantization splits each embedding vector into several subvectors and clusters each subvector independently with k-means, replacing the original floating-point storage with a few small integer codebook indices per vector. Decoding is a lookup, and the cost of the lookup at inference time is far smaller than the original dot product against a dense matrix. The paper reports compression by two orders of magnitude (around 100x smaller models) with only modest accuracy loss on the standard text classification benchmarks [16]. The result is a model that fits comfortably on a mobile phone, which was an explicit motivation for the work.
The value proposition of fastText is most visible against the contemporaries it was designed to compete with: word2vec and GloVe for embeddings, and convolutional or recurrent neural networks for classification.
| Method | Embedding type | OOV handling | Polysemy | Typical training cost |
|---|---|---|---|---|
| word2vec | Word-level static | None | None | CPU minutes to hours |
| GloVe | Word-level static (matrix factorization) | None | None | CPU hours |
| fastText | Subword-aware static | Yes (sum of n-grams) | None | CPU minutes to hours |
| ELMo | Contextual (BiLSTM) | Yes (character CNN) | Yes | GPU days |
| BERT | Contextual (Transformer) | Yes (subword tokenization) | Yes | GPU/TPU weeks |
On intrinsic word similarity and analogy tasks, fastText matched or improved on word2vec's skip-gram for English, and consistently improved over it for morphologically richer languages such as German, Russian, Czech, and Arabic, where the subword vectors carry meaningful linguistic information that word-level models cannot exploit [3]. The improvements were largest on rare and inflected words, exactly the regions where word-level models suffer the most.
For text classification, the EACL 2017 paper showed that an averaged-bag-of-features representation followed by a linear classifier reached competitive accuracy on AG News, DBpedia, and the various Yelp and Amazon review datasets while training in seconds on a CPU rather than hours on a GPU [4]. The combination of accuracy parity and dramatic speed advantage made fastText a popular choice for production classifiers and as a strong baseline against which to evaluate more elaborate models.
The most consequential limitation of fastText is that it assigns a single vector per word, so it cannot represent polysemy. The word "bank" gets one embedding regardless of whether it appears in a financial or geographic context. Contextual models such as ELMo (2018) and BERT (2018) explicitly compute a different vector for each occurrence of a word given its surrounding sentence, and they routinely outperform fastText embeddings on downstream NLP tasks where context is decisive. fastText's classification mode is also a bag-of-features model: it averages over input features rather than reading them in order, which means it cannot model long-range syntactic structure the way a Transformer can.
The Bojanowski et al. 2017 paper has been cited more than ten thousand times and is one of the most influential static-embedding papers in modern NLP [3]. The conceptual point that subword units are useful for representing rare and morphologically complex words has carried into a much broader trend toward subword tokenization in modern language models. Byte pair encoding, WordPiece, and SentencePiece, used by GPT, BERT, and most contemporary large language models, all rest on the principle that breaking words into subword pieces yields a vocabulary that gracefully covers rare words and morphology. Strictly speaking, BPE-style tokenization differs from fastText's character n-gram embeddings: BPE chooses one specific segmentation per word at preprocessing time, while fastText sums embeddings over all n-grams in a fixed range. But the underlying intuition that subword sharing pays off in both representation quality and out-of-vocabulary coverage is the same.
fastText has been largely superseded for state-of-the-art accuracy by contextual encoders. Where ELMo, BERT, and the broader transformer family compute fresh, context-sensitive representations every time a word appears, fastText offers only a fixed lookup. For tasks with limited compute, low-resource languages, or where extreme inference speed and small footprint are decisive, fastText nonetheless remains attractive. It is widely used as a fast classification baseline, as initialization for downstream models in low-resource settings, in document deduplication and retrieval pipelines (especially via cosine similarity over averaged embeddings), and as the language identification component in multilingual data preparation, including web-scale corpora used to train modern large language models.
The library has visible limitations beyond the absence of contextualization. Quality degrades when the corpus is small, since rare n-grams collide more aggressively under the hashing trick. Hyperparameter sensitivity is real: minimum and maximum n-gram length, bucket size, vector dimension, learning rate, and number of negative samples all interact with corpus size and language. The supervised mode is essentially a linear classifier over averaged features, so it cannot model interactions or long-range structure beyond what word n-gram features can capture. Finally, while Meta AI (the rebranded successor of Facebook AI Research) maintained the project for several years, the GitHub repository has been archived since March 19, 2024 [6], so users should expect the codebase to be feature-frozen.
fastText is written in C++ with a thin command-line interface and an officially supported Python binding distributed on PyPI. The library is released under the MIT License [6]. The most common interactions look like this in spirit (commands shown only descriptively).
Unsupervised training of skip-gram word vectors uses the skipgram subcommand and produces both a binary .bin model file and a text .vec file containing the vocabulary vectors. The cbow subcommand offers the alternative architecture. Supervised classification is invoked through the supervised subcommand and consumes a labeled file where each line begins with a label prefix such as __label__ followed by the input text. After training, the predict and predict-prob subcommands produce labels (and probabilities) for new inputs, and the quantize subcommand compresses an existing supervised model.
The Python package is installed with pip install fasttext. It exposes the same functionality through fasttext.train_unsupervised, fasttext.train_supervised, and methods on the resulting model object, including get_word_vector, get_sentence_vector, predict, and quantize. There are also widely used third-party integrations: Gensim provides a fastText implementation in Python, and bindings exist for R, Java, and Node.js. Hugging Face hosts the official Meta-released vectors and the language identification models in convenient downloadable form.
Default hyperparameters in fastText follow the values reported in the original papers. For unsupervised training, the typical defaults include vector dimension 100, learning rate 0.05, window size 5, 5 epochs, 5 negative samples, and character n-grams of length 3 to 6 [10]. For supervised classification, the defaults differ: learning rate 0.1, vector dimension 100, 5 epochs, no character n-grams by default, and word n-grams of size 1 (unigrams), with bigrams or trigrams enabled through -wordNgrams 2 or higher [10]. The bucket size is 2,000,000 in both modes [10].