Text Classification Models
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 ยท 4,649 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 ยท 4,649 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Natural Language Processing Models and Tasks
Text classification models are machine learning systems that assign one or more categorical labels to a span of natural language text. Classification is one of the oldest and most widely deployed NLP tasks: nearly every email service, content platform, and customer-support system runs a classifier somewhere in its pipeline. Common label spaces include positive vs. negative (sentiment analysis), topic (topic classification), user intent, spam vs. ham (spam detection), toxic vs. acceptable (toxicity classification), and language ID (language identification). This page surveys the architectures, datasets, metrics, and applications that define modern text classification.
Text classification has cycled through five technical eras over three decades. Each transition lowered the labeled-data requirement and raised accuracy on standard benchmarks.
| Era | Years | Dominant approach | Representative work |
|---|---|---|---|
| Counting and Bayes | 1990s to early 2000s | Bag of words plus Naive Bayes or logistic regression | McCallum and Nigam (1998) on Reuters and 20 Newsgroups |
| Margin and kernels | early 2000s to early 2010s | TF-IDF plus support vector machines | Joachims (1998) showed linear SVMs dominate Reuters-21578 |
| Word vectors and shallow networks | 2014 to 2017 | Word2Vec and GloVe embeddings feeding CNNs or BiLSTMs | Kim (2014) CNN for sentence classification; fastText (2016) |
| Transfer learning from Transformers | 2018 to 2021 | Pre-trained Transformer encoders fine-tuned with a classification head | BERT, RoBERTa, DeBERTa |
| Prompted and few-shot LLMs | 2020 to present | Zero-shot prompting or few-shot prompting of large language models; SetFit and contrastive few-shot | GPT-3, GPT-4, Claude, SetFit |
The Transformer paper (Vaswani et al., 2017) replaced recurrence with self-attention, and the encoder-only BERT released by Google in October 2018 made transfer learning the default recipe for classification. Most production classifiers built between 2019 and 2024 are BERT-family encoders with a softmax head. Since 2022 a growing share of low-volume or rapidly changing label spaces is served by prompting an LLM instead of training a dedicated model.
In late 2024 a new generation of encoder models began to appear. Warner et al. released ModernBERT (December 2024), which retained BERT's bidirectional masked language modeling but replaced absolute positional encodings with rotary positional embeddings (RoPE), added alternating local/global attention, and trained on two trillion tokens with a native 8,192-token context window. ModernBERT-base and ModernBERT-large are two to four times faster than their BERT predecessors at inference and reduce memory use by roughly 70 percent versus standard full attention. Shortly afterward, Menet et al. released NeoBERT (February 2025), a 250 million parameter encoder with SwiGLU activations and RMSNorm that reports state-of-the-art results on the Massive Text Embedding Benchmark (MTEB) under identical fine-tuning conditions. These releases mark a renewed investment in encoder-only architectures optimized for classification and retrieval, rather than ceding the space entirely to prompted generative models.
A text classifier is fundamentally a function that maps a string to a probability distribution over labels. Five architectural families dominate.
Encoder Transformer with a classification head. The canonical recipe since 2018. A pre-trained encoder such as BERT reads the tokenized input, prepends a special [CLS] token, and returns a contextual representation for every position. The final hidden state of [CLS] is fed into a small linear-plus-softmax head. Fine-tuning typically updates all encoder weights, but LoRA and adapter modules can match accuracy while training only a fraction of the parameters. For multi-label tasks the softmax head is replaced with independent sigmoid outputs and binary cross-entropy loss, so each label can be active simultaneously.
CNN over word embeddings. Yoon Kim's 2014 EMNLP paper showed that a single convolutional layer with several filter widths over pre-trained word embeddings, followed by max-pooling and a softmax, beats long-standing baselines on seven sentence-level benchmarks. Cheap, parallelizable, still useful when latency budgets are tight.
BiLSTM with attention. A bidirectional LSTM reads the embedding sequence in both directions and an attention layer pools the hidden states into a single document vector. Hierarchical Attention Networks (Yang et al., 2016) were the strongest non-Transformer document classifiers before BERT. The hierarchical structure, encoding words into sentences then sentences into documents, remains relevant for very long inputs where truncation to 512 tokens is undesirable.
FastText and linear models on subword n-grams. FastText (Joulin et al., 2016) averages embedding vectors for word and character n-grams and feeds the average to a linear classifier. It trains on a billion words of news in under ten minutes on a single CPU and remains the default tool for language identification and very large label sets. Throughput advantage over Transformer encoders is typically two to three orders of magnitude, making it appealing for real-time pipelines classifying millions of short texts per minute.
Prompted classification with an LLM. A frontier LLM such as GPT-4, Claude, or Gemini can act as a classifier by being asked, in plain English, to return the label. Few-shot examples in the prompt usually help. No training pipeline and instant new labels, at the cost of higher latency, higher dollars per inference, and weaker reproducibility than a dedicated model. A 2024 study by Laurer et al. found that fine-tuned BERT-scale encoders achieve superior performance in all tested political-text classification scenarios compared to zero-shot GPT-3.5 and GPT-4. Separately, Ornstein et al. (2024) demonstrated that BERT processes 277 samples per second versus Gemma-2-2B's 12, making the encoder roughly 23 times faster for identical throughput.
Full fine-tuning updates all encoder weights, which requires storing a separate checkpoint per task and loading the full model for each request. Parameter-efficient fine-tuning (PEFT) methods address this by learning a small task-specific module while freezing most of the base model.
LoRA (Hu et al., 2021) decomposes the weight update as a product of two low-rank matrices. Because the learned matrices can be merged back into the base weights after training, LoRA adds no inference latency. It has become the de facto standard for PEFT in production. IA3 (Liu et al., 2022) learns rescaling vectors for key, value, and feedforward activations, training fewer parameters than LoRA while matching its accuracy on most tasks. ReFT (Wu et al., 2024) intervenes on hidden representations rather than weight matrices; it trains roughly 3 percent of LoRA's parameters while reaching near-identical classification F1. A 2025 study comparing LoRA, IA3, and ReFT on low-resource AG News and Amazon Reviews found LoRA achieved the highest absolute F1 scores, but ReFT was within one point while training far fewer parameters.
BERT's 512-token limit creates a practical problem for classifying documents longer than a few paragraphs. Three strategies are commonly applied.
Truncation and chunking. The simplest approach truncates the input at 512 tokens or processes overlapping windows and aggregates logits. Effective for documents where the first 512 tokens carry enough signal (e.g., news articles).
Sparse attention models. Longformer (Beltagy et al., 2020) and BigBird (Zaheer et al., 2020) combine local windowed attention with global attention on a small set of tokens, scaling to 4,096 or 8,192 tokens while keeping computation manageable. ModernBERT's 8,192-token native window now covers these use cases without custom attention kernels.
Hierarchical encoders. A word-level encoder produces sentence representations, and a sentence-level encoder aggregates them into a document vector. Hierarchical Attention Transformers (HAT, 2022) outperform equally sized Longformer models while using 10 to 20 percent less memory and running 40 to 45 percent faster.
The table below lists widely used dedicated classification models. Parameter counts refer to the most common public checkpoints.
| Model | Released | Organization | Parameters | Key innovation |
|---|---|---|---|---|
| FastText | Jul 2016 | Facebook AI Research | Linear, vocabulary-sized | Subword n-gram averaging, billion-word CPU training |
| BERT-base | Oct 2018 | Google AI Language | 110 million | Bidirectional masked language modeling |
| BERT-large | Oct 2018 | Google AI Language | 340 million | 24 layers, 16 heads, 1024 hidden |
| XLNet | Jun 2019 | CMU and Google Brain | 110 / 340 million | Permutation language modeling |
| RoBERTa | Jul 2019 | Facebook AI | 125 / 355 million | Drops NSP, trains on 160 GB of text |
| ALBERT | Sep 2019 | Google Research | 12 / 18 / 235 million | Factorized embeddings, cross-layer sharing |
| DistilBERT | Oct 2019 | Hugging Face | 66 million | Knowledge distillation, 40% smaller than BERT-base |
| ELECTRA-base | Mar 2020 | Stanford and Google Brain | 110 million | Replaced-token detection on every position |
| XLM-RoBERTa | Nov 2019 | Facebook AI | 125 / 355 million | Multilingual; pretrained on 100 languages, 2.5 TB of CommonCrawl; 14.6 points above mBERT on XNLI |
| DeBERTa-v3-base | Nov 2021 | Microsoft | 86 / 184 million | Disentangled attention plus ELECTRA-style RTD |
| SetFit | Sep 2022 | Hugging Face, Intel Labs, UKP Lab | 110 to 355 million | Contrastive fine-tuning of a Sentence Transformer |
| ModernBERT-base / large | Dec 2024 | Answer.AI and LightOn | 149 / 395 million | RoPE, alternating attention, 8,192-token context, 2T token training |
| NeoBERT | Feb 2025 | Menet et al. | 250 million | SwiGLU, RMSNorm, optimal depth-to-width ratio, 4,096-token context, SOTA on MTEB |
DeBERTa-v3-large reached 91.37 percent average GLUE score, 1.37 points above DeBERTa and 1.91 points above ELECTRA. On the few-shot RAFT benchmark, SetFit with the all-roberta-large-v1 backbone (355 million parameters) outperforms PET and reaches accuracy comparable to GPT-3 prompting with a model 1,600 times smaller. ModernBERT reports state-of-the-art results on classification, retrieval, and code understanding at two to four times the throughput of DeBERTa and BERT.
The field uses a stable set of public datasets, several of which became implicit standards for measuring transfer-learning progress.
| Dataset | Year | Task | Size | Notes |
|---|---|---|---|---|
| Reuters-21578 | 1990s | Multi-label news topic | ~21,500 docs, 90 categories | TF-IDF plus SVM benchmark |
| 20 Newsgroups | 1995 | Newsgroup topic | ~20,000 posts, 20 classes | Early Naive Bayes evaluation set |
| TREC question types | 2002 | Question category | 5,500 train, 500 test, 6 coarse and 47 fine classes | Li and Roth (2002), COLING |
| IMDB Large Movie Review | 2011 | Binary sentiment | 25,000 train, 25,000 test | Maas et al., ACL 2011; BERT-family achieves ~97% accuracy |
| SST-2 (Stanford Sentiment Treebank, binary) | 2013 | Binary sentence sentiment | 67,000 train, 873 dev, 1,821 test | Socher et al., included in GLUE |
| SemEval-2014 Task 4 | 2014 | Aspect-based sentiment (restaurant and laptop reviews) | ~3,000 to 4,000 per domain | Canonical ABSA benchmark; DeBERTa-based models lead |
| AG News | 2015 | 4-class news topic | 120,000 train, 7,600 test | Zhang, Zhao, LeCun, NIPS 2015 |
| Yelp Review Polarity | 2015 | Binary review sentiment | 560,000 train, 38,000 test | Same paper as AG News |
| Amazon Review Polarity | 2015 | Binary review sentiment | 3,600,000 train, 400,000 test | Largest of the Zhang et al. set |
| GLUE | Apr 2018 | 9 NLU tasks, mostly classification | Variable | Wang et al., EMNLP 2018 BlackboxNLP |
| Jigsaw Toxic Comment | Mar 2018 | Multi-label toxicity | ~159,000 Wikipedia comments | Six labels: toxic, severe_toxic, obscene, threat, insult, identity_hate |
| WiLI-2018 | 2018 | Language identification | 235 languages | Sentence-level language ID corpus |
| SuperGLUE | May 2019 | Harder NLU tasks | Variable | Wang et al., follow-up to GLUE |
| RAFT | 2021 | 11 real-world few-shot classification tasks | 50 train examples per task | Alex et al., NeurIPS 2021; designed to resist crowdsourced solution |
GLUE's nine tasks are CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, and WNLI. All except STS-B (regression) are classification problems. DeBERTa-v3-large now exceeds the GLUE human baseline of 87.1 points, which is why SuperGLUE was introduced.
Text classification is not a single task but a family of related problems. Understanding the distinctions helps when selecting architectures and training data.
Sentiment analysis. Assigns a polarity (positive, negative, neutral) or a fine-grained emotion to a piece of text. Binary polarity is the most common formulation (IMDB, Yelp, Amazon). Aspect-based sentiment analysis (ABSA) refines this by identifying which aspect of a product or service a particular opinion targets (e.g., "the battery life is excellent but the screen is dim" carries two aspect-sentiment pairs). SemEval-2014 through SemEval-2016 Tasks 4 and 5 define the standard ABSA evaluation protocol; DeBERTa fine-tuned with graph convolutional network augmentation currently leads on the restaurant and laptop subsets.
Topic classification. Routes a document to one of several predefined topical categories. AG News (4 classes), Reuters-21578 (90 categories, multi-label), and the 20 Newsgroups corpus (20 classes) are the standard evaluation sets. Encoder fine-tuning has long dominated, but LLM-based classifiers now reach strong performance zero-shot on common topic sets.
Intent detection. Identifies the user's purpose in a short utterance, commonly in dialogue systems and voice assistants. Datasets include SNIPS, ATIS, and HWU64. Intent labels are highly specific and numerous (hundreds of intents in enterprise systems), which makes fine-tuning necessary because zero-shot prompting generalizes poorly across very large intent taxonomies.
Natural language inference (NLI). A three-way classification that decides whether a hypothesis is entailed by, contradicted by, or neutral with respect to a premise. The MNLI, SNLI, and RTE datasets define this subtask. NLI-trained models also double as zero-shot classifiers (see below).
Spam and phishing detection. Binary classification distinguishing legitimate messages from unsolicited or malicious ones. Applied to email, SMS, and web comments. Training data is typically proprietary; public datasets include the Enron spam corpus and SpamAssassin corpus. Adversarial robustness (homoglyph substitution, obfuscation) is a central concern.
Toxicity and content moderation. Multi-label classification where a comment may simultaneously belong to "toxic," "obscene," "threat," and other categories. The Jigsaw Toxic Comment dataset is the standard public benchmark. Perspective API and OpenAI's Moderation endpoint are widely deployed production systems. Annotation disagreement (some annotators find content toxic that others do not) is a fundamental challenge that no single model fully resolves.
Language identification. Assigns a language code to a snippet of text, often as a preprocessing step before routing to a language-specific model. FastText's 176-language model is the most widely deployed; Google's compact Compact Language Detector (CLD3) runs in browsers. WiLI-2018 (235 languages) and the LTI LangID corpus are standard evaluation sets.
Multi-label and hierarchical classification. Many real-world label spaces are overlapping (a news article can be both "Technology" and "Business") or structured in a hierarchy (ICD-10 medical codes have a four-level ontology). Multi-label tasks use sigmoid outputs with binary cross-entropy loss and require threshold selection per label rather than argmax. Hierarchical classifiers enforce parent-before-child label constraints or use label-embedding models that encode the taxonomy structure.
Classification quality is measured with a small set of well-understood metrics. The right choice depends on class balance and the cost of false positives versus false negatives.
| Metric | Definition | When to use |
|---|---|---|
| Accuracy | Correct predictions / total | Balanced binary or multi-class |
| Precision | True positives / predicted positives | Costly false positives (spam, moderation appeals) |
| Recall | True positives / actual positives | Costly false negatives (toxicity, fraud, triage) |
| F1 | Harmonic mean of precision and recall | Imbalanced binary tasks |
| Macro F1 | Unweighted mean of per-class F1 | Multi-class imbalance where small classes matter |
| Micro F1 | F1 over pooled confusion matrix | Multi-label scoring |
| Weighted F1 | Per-class F1 weighted by support | Single number on imbalanced multi-class data |
| AUC-ROC | Area under the ROC curve | Threshold-free comparison of probabilistic scores |
| AUC-PR | Area under the precision-recall curve | Highly imbalanced binary tasks |
| Matthews correlation | Correlation between predicted and actual | GLUE's CoLA; robust to imbalance |
The original Jigsaw Kaggle competition scored submissions by mean column-wise AUC-ROC across the six label columns, making it a calibration problem rather than a hard-threshold problem.
Threshold tuning. For multi-label and imbalanced binary tasks, the default 0.5 decision threshold often performs poorly. Per-label threshold optimization on a held-out validation set, as well as adaptive thresholding algorithms that fuse label frequency statistics with instance-level signals, routinely improves micro and macro F1 over the default threshold.
Calibration. A well-calibrated model's confidence scores are accurate probabilities (a 70-percent confident prediction should be correct 70 percent of the time). Models that feed downstream decision systems require calibration beyond raw accuracy. Temperature scaling (learning a single scalar to divide logits) is the simplest and most widely used method; isotonic regression and Platt scaling handle more complex miscalibration.
Classification is the workhorse of applied NLP.
| Domain | Typical labels | Representative deployments |
|---|---|---|
| Spam, phishing, promotional, social, primary | Gmail tabs, Outlook Focused Inbox | |
| Content moderation | Toxic, hate, harassment, self-harm, sexual, violence | Perspective API, OpenAI Moderation |
| Customer support | Billing, technical, refund, escalation | Zendesk Auto-Routing, Intercom Resolution Bot |
| News and media | Politics, sports, business, tech, entertainment | Google News clustering, Apple News |
| Sentiment monitoring | Positive, neutral, negative, aspect emotion | Brand tracking, social listening |
| Compliance | PII, clause type, regulatory category | DocuSign Insight, Kira Systems |
| Finance | Bullish, bearish, neutral on a ticker | Bloomberg news sentiment |
| Healthcare | ICD-10 codes, triage acuity, symptom category | Medical record coding assistants |
| Search | Query intent, vertical routing | Web search vertical pickers |
| Language tooling | Source language of a snippet | Google Translate auto-detect, fastText 176-language model |
| LLM routing and guardrails | Safe vs. unsafe, query complexity tier | OpenAI Moderation, ModernBERT-based guardrail systems |
A single production pipeline can stack several classifiers: a language identification step routes input to a localized intent detection model, which then triggers a domain-specific topic classifier. ModernBERT has also found a new application category as a lightweight guardrail model in LLM inference pipelines, classifying input prompts for malicious intent at latency budgets that generative models cannot meet.
Since 2020, frontier LLMs have started to replace dedicated classifiers in low-to-medium volume settings. Four approaches dominate.
Zero-shot prompting. A model such as GPT-4 or Claude receives the text and a short instruction listing the candidate labels. Accuracy is often within a few points of a fine-tuned encoder on common topics, and substantially better on rare or novel categories. The tradeoff is cost: zero-shot LLM classification is roughly one to two orders of magnitude more expensive per query than a dedicated encoder.
Few-shot in-context prompting. Three to thirty labeled examples are placed in the prompt and the model generalizes from them at inference time. See in-context learning. Performance improves with example quality and quantity, but degrades when candidate labels are numerous or semantically similar.
Natural language inference framing. A pre-trained NLI model such as RoBERTa-large-MNLI or mDeBERTa-v3-base (fine-tuned on 2.7 million multilingual NLI pairs) scores the pair (text, hypothesis = "this is about X") for every candidate label X and picks the highest entailment probability. This approach, popularized by Yin et al. (2019), requires no task-specific labels and supports multilingual zero-shot classification out of the box. The default Hugging Face zero-shot pipeline uses facebook/bart-large-mnli; alternatives include cross-encoder/nli-deberta-v3-large for higher accuracy.
Contrastive few-shot fine-tuning. SetFit (Tunstall et al., 2022) fine-tunes a Sentence Transformer on positive and negative pairs constructed from as few as eight labeled examples per class, then trains a logistic head on the resulting embeddings. On Customer Reviews with eight examples per class, SetFit reaches accuracy comparable to a RoBERTa-large fine-tune on the full 3,000-example training set. On the RAFT benchmark, SetFit with all-roberta-large-v1 (355M parameters) outperforms GPT-3 (175B parameters) overall, surpassing the human baseline on 7 of 11 tasks, while being 1,600 times smaller. A 2025 update using ModernBERT as the SetFit backbone further improves results on several RAFT subtasks.
LLM-generated silver labels with distillation. A common production pattern is to use a capable LLM to annotate a large unlabeled set, then fine-tune a small encoder on the resulting silver labels. Pangakis and Wolken (2024) demonstrated that classifiers fine-tuned on GPT-4-generated labels perform comparably to models trained with human annotations across sentiment, approval, and party-position tasks, while costing far less than human annotation at scale.
A common production pattern is to prototype with a prompted LLM, harvest predictions as silver labels, and distill into a small encoder once volume justifies it.
| Scenario | Recommended approach | Rationale |
|---|---|---|
| High throughput (millions of documents per day) | Fine-tuned encoder (BERT, DeBERTa, ModernBERT) | 23 to 200 times faster; orders of magnitude cheaper per query |
| Fixed label set, adequate labeled data (500+ examples) | Fine-tuned encoder | Consistently outperforms zero-shot LLMs in controlled comparisons |
| New labels or rapidly changing taxonomy with minimal data | Zero-shot or few-shot LLM | No training pipeline; instant new labels |
| 8 to 50 labeled examples per class | SetFit (few-shot contrastive) | Matches fine-tuned encoders at full data; much cheaper than LLM |
| Zero labeled examples, common topics | NLI-based zero-shot (BART-large-MNLI, mDeBERTa NLI) | Open-weights, multilingual, no API cost |
| Annotation generation at scale | LLM labeling followed by encoder distillation | Cuts annotation cost; resulting encoder has low inference cost |
| Low-latency guardrails in an LLM pipeline | ModernBERT or FastText | Sub-millisecond to single-millisecond latency |
Most of the benchmark history of text classification is English-centric. Three architectures extend coverage to other languages.
Multilingual encoders. mBERT (Devlin et al., 2018) was pre-trained on 104 languages from Wikipedia and showed surprising cross-lingual transfer even without language-specific data. XLM-RoBERTa (Conneau et al., 2019) improved substantially by training on 2.5 terabytes of CommonCrawl data in 100 languages, reaching 80.9 percent on XNLI versus 66.3 percent for mBERT, a 14.6-point gap. mDeBERTa-v3-base adds disentangled attention and ELECTRA-style training to the multilingual setting, reaching 87.1 percent on English MNLI and maintaining above 80 percent accuracy across multiple languages when fine-tuned on the combined XNLI and multilingual NLI dataset.
Cross-lingual zero-shot classification. NLI models trained on multilingual data (e.g., joeddav/xlm-roberta-large-xnli) can classify text in languages not seen during task fine-tuning. This zero-shot cross-lingual approach works well for typologically close languages and degrades on more distant or low-resource ones.
Language-specific fine-tuning. For high-value languages with adequate data, language-specific BERT variants (CamemBERT for French, BERTje for Dutch, RoBERTa-base-chinese, etc.) typically outperform multilingual models by two to five points when fine-tuned on in-language data.
Text classification is mature but not solved.
| Topic area | Wiki pages |
|---|---|
| Sub-tasks | Sentiment analysis, Topic classification, Intent detection, Spam detection, Toxicity classification, Language identification, Named entity recognition |
| Architectures | Transformer, BERT, Convolutional neural network, Long short-term memory, Self-attention |
| Encoder models | RoBERTa, ALBERT, DistilBERT, DeBERTa, XLNet, ELECTRA, FastText, SetFit |
| LLMs used zero-shot | GPT-4, Claude, Gemini, Llama, Mistral |
| Foundations | Bag of words, TF-IDF, Word embedding, Word2Vec, GloVe, Tokenization, In-context learning |
| Benchmarks | GLUE, SuperGLUE, IMDB dataset |
| Training | Supervised fine-tuning, LoRA, Domain adaptation, Naive Bayes, Support vector machine |