Text Classification Models
Last reviewed
May 11, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 ยท 2,487 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 ยท 2,487 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Natural Language Processing Models and Tasks
Text classification models are machine learning systems that assign one or more categorical labels to a span of natural language text. Classification is one of the oldest and most widely deployed NLP tasks: nearly every email service, content platform, and customer-support system runs a classifier somewhere in its pipeline. Common label spaces include positive vs. negative (sentiment analysis), topic (topic classification), user intent, spam vs. ham (spam detection), toxic vs. acceptable (toxicity classification), and language ID (language identification). This page surveys the architectures, datasets, metrics, and applications that define modern text classification.
Text classification has cycled through five technical eras over three decades. Each transition lowered the labeled-data requirement and raised accuracy on standard benchmarks.
| Era | Years | Dominant approach | Representative work |
|---|---|---|---|
| Counting and Bayes | 1990s to early 2000s | Bag of words plus Naive Bayes or logistic regression | McCallum and Nigam (1998) on Reuters and 20 Newsgroups |
| Margin and kernels | early 2000s to early 2010s | TF-IDF plus support vector machines | Joachims (1998) showed linear SVMs dominate Reuters-21578 |
| Word vectors and shallow networks | 2014 to 2017 | Word2Vec and GloVe embeddings feeding CNNs or BiLSTMs | Kim (2014) CNN for sentence classification; fastText (2016) |
| Transfer learning from Transformers | 2018 to 2021 | Pre-trained Transformer encoders fine-tuned with a classification head | BERT, RoBERTa, DeBERTa |
| Prompted and few-shot LLMs | 2020 to present | Zero-shot prompting or few-shot prompting of large language models; SetFit and contrastive few-shot | GPT-3, GPT-4, Claude, SetFit |
The Transformer paper (Vaswani et al., 2017) replaced recurrence with self-attention, and the encoder-only BERT released by Google in October 2018 made transfer learning the default recipe for classification. Most production classifiers built between 2019 and 2024 are BERT-family encoders with a softmax head. Since 2022 a growing share of low-volume or rapidly changing label spaces is served by prompting an LLM instead of training a dedicated model.
A text classifier is fundamentally a function that maps a string to a probability distribution over labels. Five architectural families dominate.
Encoder Transformer with a classification head. The canonical recipe since 2018. A pre-trained encoder such as BERT reads the tokenized input, prepends a special [CLS] token, and returns a contextual representation for every position. The final hidden state of [CLS] is fed into a small linear-plus-softmax head. Fine-tuning typically updates all encoder weights, but LoRA and adapter modules can match accuracy while training only a fraction of the parameters.
CNN over word embeddings. Yoon Kim's 2014 EMNLP paper showed that a single convolutional layer with several filter widths over pre-trained word embeddings, followed by max-pooling and a softmax, beats long-standing baselines on seven sentence-level benchmarks. Cheap, parallelizable, still useful when latency budgets are tight.
BiLSTM with attention. A bidirectional LSTM reads the embedding sequence in both directions and an attention layer pools the hidden states into a single document vector. Hierarchical Attention Networks (Yang et al., 2016) were the strongest non-Transformer document classifiers before BERT.
FastText and linear models on subword n-grams. FastText (Joulin et al., 2016) averages embedding vectors for word and character n-grams and feeds the average to a linear classifier. It trains on a billion words of news in under ten minutes on a single CPU and remains the default tool for language identification and very large label sets.
Prompted classification with an LLM. A frontier LLM such as GPT-4, Claude, or Gemini can act as a classifier by being asked, in plain English, to return the label. Few-shot examples in the prompt usually help. No training pipeline and instant new labels, at the cost of higher latency, higher dollars per inference, and weaker reproducibility than a dedicated model.
The table below lists widely used dedicated classification models. Parameter counts refer to the most common public checkpoints.
| Model | Released | Organization | Parameters | Key innovation |
|---|---|---|---|---|
| FastText | Jul 2016 | Facebook AI Research | Linear, vocabulary-sized | Subword n-gram averaging, billion-word CPU training |
| BERT-base | Oct 2018 | Google AI Language | 110 million | Bidirectional masked language modeling |
| BERT-large | Oct 2018 | Google AI Language | 340 million | 24 layers, 16 heads, 1024 hidden |
| XLNet | Jun 2019 | CMU and Google Brain | 110 / 340 million | Permutation language modeling |
| RoBERTa | Jul 2019 | Facebook AI | 125 / 355 million | Drops NSP, trains on 160 GB of text |
| ALBERT | Sep 2019 | Google Research | 12 / 18 / 235 million | Factorized embeddings, cross-layer sharing |
| DistilBERT | Oct 2019 | Hugging Face | 66 million | Knowledge distillation, 40% smaller than BERT-base |
| ELECTRA-base | Mar 2020 | Stanford and Google Brain | 110 million | Replaced-token detection on every position |
| DeBERTa-v3-base | Nov 2021 | Microsoft | 86 / 184 million | Disentangled attention plus ELECTRA-style RTD |
| SetFit | Sep 2022 | Hugging Face, Intel Labs, UKP Lab | 110 to 355 million | Contrastive fine-tuning of a Sentence Transformer |
DeBERTa-v3-large reached 91.37 percent average GLUE score, 1.37 points above DeBERTa and 1.91 points above ELECTRA. On the few-shot RAFT benchmark, SetFit with the all-roberta-large-v1 backbone (355 million parameters) outperforms PET and reaches accuracy comparable to GPT-3 prompting.
The field uses a stable set of public datasets, several of which became implicit standards for measuring transfer-learning progress.
| Dataset | Year | Task | Size | Notes |
|---|---|---|---|---|
| Reuters-21578 | 1990s | Multi-label news topic | ~21,500 docs, 90 categories | TF-IDF plus SVM benchmark |
| 20 Newsgroups | 1995 | Newsgroup topic | ~20,000 posts, 20 classes | Early Naive Bayes evaluation set |
| TREC question types | 2002 | Question category | 5,500 train, 500 test, 6 coarse and 47 fine classes | Li and Roth (2002), COLING |
| IMDB Large Movie Review | 2011 | Binary sentiment | 25,000 train, 25,000 test | Maas et al., ACL 2011 |
| SST-2 (Stanford Sentiment Treebank, binary) | 2013 | Binary sentence sentiment | 67,000 train, 873 dev, 1,821 test | Socher et al., included in GLUE |
| AG News | 2015 | 4-class news topic | 120,000 train, 7,600 test | Zhang, Zhao, LeCun, NIPS 2015 |
| Yelp Review Polarity | 2015 | Binary review sentiment | 560,000 train, 38,000 test | Same paper as AG News |
| Amazon Review Polarity | 2015 | Binary review sentiment | 3,600,000 train, 400,000 test | Largest of the Zhang et al. set |
| GLUE | Apr 2018 | 9 NLU tasks, mostly classification | Variable | Wang et al., EMNLP 2018 BlackboxNLP |
| SuperGLUE | May 2019 | Harder NLU tasks | Variable | Wang et al., follow-up to GLUE |
| Jigsaw Toxic Comment | Mar 2018 | Multi-label toxicity | ~159,000 Wikipedia comments | Six labels: toxic, severe_toxic, obscene, threat, insult, identity_hate |
| WiLI-2018 | 2018 | Language identification | 235 languages | Sentence-level language ID corpus |
GLUE's nine tasks are CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, and WNLI. All except STS-B (regression) are classification problems. DeBERTa-v3-large now exceeds the GLUE human baseline of 87.1 points, which is why SuperGLUE was introduced.
Classification quality is measured with a small set of well-understood metrics. The right choice depends on class balance and the cost of false positives versus false negatives.
| Metric | Definition | When to use |
|---|---|---|
| Accuracy | Correct predictions / total | Balanced binary or multi-class |
| Precision | True positives / predicted positives | Costly false positives (spam, moderation appeals) |
| Recall | True positives / actual positives | Costly false negatives (toxicity, fraud, triage) |
| F1 | Harmonic mean of precision and recall | Imbalanced binary tasks |
| Macro F1 | Unweighted mean of per-class F1 | Multi-class imbalance where small classes matter |
| Micro F1 | F1 over pooled confusion matrix | Multi-label scoring |
| Weighted F1 | Per-class F1 weighted by support | Single number on imbalanced multi-class data |
| AUC-ROC | Area under the ROC curve | Threshold-free comparison of probabilistic scores |
| AUC-PR | Area under the precision-recall curve | Highly imbalanced binary tasks |
| Matthews correlation | Correlation between predicted and actual | GLUE's CoLA; robust to imbalance |
The original Jigsaw Kaggle competition scored submissions by mean column-wise AUC-ROC across the six label columns, making it a calibration problem rather than a hard-threshold problem.
Classification is the workhorse of applied NLP.
| Domain | Typical labels | Representative deployments |
|---|---|---|
| Spam, phishing, promotional, social, primary | Gmail tabs, Outlook Focused Inbox | |
| Content moderation | Toxic, hate, harassment, self-harm, sexual, violence | Perspective API, OpenAI Moderation |
| Customer support | Billing, technical, refund, escalation | Zendesk Auto-Routing, Intercom Resolution Bot |
| News and media | Politics, sports, business, tech, entertainment | Google News clustering, Apple News |
| Sentiment monitoring | Positive, neutral, negative, aspect emotion | Brand tracking, social listening |
| Compliance | PII, clause type, regulatory category | DocuSign Insight, Kira Systems |
| Finance | Bullish, bearish, neutral on a ticker | Bloomberg news sentiment |
| Healthcare | ICD-10 codes, triage acuity, symptom category | Medical record coding assistants |
| Search | Query intent, vertical routing | Web search vertical pickers |
| Language tooling | Source language of a snippet | Google Translate auto-detect, fastText 176-language model |
A single production pipeline can stack several classifiers: a language identification step routes input to a localized intent detection model, which then triggers a domain-specific topic classifier.
Since 2020, frontier LLMs have started to replace dedicated classifiers in low-to-medium volume settings. Four approaches dominate.
Zero-shot prompting. A model such as GPT-4 or Claude receives the text and a short instruction listing the candidate labels. Accuracy is often within a few points of a fine-tuned encoder on common topics, and substantially better on rare or novel categories.
Few-shot in-context prompting. Three to thirty labeled examples are placed in the prompt and the model generalizes from them at inference time. See in-context learning.
Natural language inference framing. A pre-trained NLI model such as RoBERTa-large-MNLI scores the pair (text, hypothesis = "this is about X") for every candidate label X and picks the highest entailment probability. Yin et al. (2019) showed this beats classical zero-shot baselines and remains a strong open-weights option.
Contrastive few-shot fine-tuning. SetFit (Tunstall et al., 2022) fine-tunes a Sentence Transformer on positive and negative pairs constructed from as few as eight labeled examples per class, then trains a logistic head on the resulting embeddings. On Customer Reviews with eight examples per class, SetFit reaches accuracy comparable to a RoBERTa-large fine-tune on the full 3,000-example training set.
A common production pattern is to prototype with a prompted LLM, harvest predictions as silver labels, and distill into a small encoder once volume justifies it.
Text classification is mature but not solved.
| Topic area | Wiki pages |
|---|---|
| Sub-tasks | Sentiment analysis, Topic classification, Intent detection, Spam detection, Toxicity classification, Language identification, Named entity recognition |
| Architectures | Transformer, BERT, Convolutional neural network, Long short-term memory, Self-attention |
| Encoder models | RoBERTa, ALBERT, DistilBERT, DeBERTa, XLNet, ELECTRA, FastText, SetFit |
| LLMs used zero-shot | GPT-4, Claude, Gemini, Llama, Mistral |
| Foundations | Bag of words, TF-IDF, Word embedding, Word2Vec, GloVe, Tokenization, In-context learning |
| Benchmarks | GLUE, SuperGLUE, IMDB dataset |
| Training | Supervised fine-tuning, LoRA, Domain adaptation, Naive Bayes, Support vector machine |