Text Classification Models

AI Models Natural Language Processing

26 min read

Updated Jun 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 23, 2026

Fact-checked

In review queue

Sources

35 citations

Revision

v5 · 5,100 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

See also: Natural Language Processing Models and Tasks

Text classification models are machine learning systems that assign one or more predefined categorical labels to a span of natural language text, such as positive vs. negative, spam vs. ham, or one of dozens of topic categories. It is one of the oldest and most widely deployed NLP tasks: nearly every email service, content platform, and customer-support system runs a classifier somewhere in its pipeline. Common label spaces include polarity (sentiment analysis), topic (topic classification), user intent, spam vs. ham (spam detection), toxic vs. acceptable (toxicity classification), and language ID (language identification). The field has moved through five technical eras in three decades, from bag-of-words plus Naive Bayes to fine-tuned BERT encoders and, since 2020, zero-shot prompting of large language models. As of 2026 a fine-tuned BERT-scale encoder remains the strongest and cheapest baseline when labeled data is available: one 2026 benchmark measured a fine-tuned BERT classifier at 277 samples per second versus 12 for a Gemma-2-2B generative model, roughly 23 times faster. ^[9]^[26]^[35] This page surveys the architectures, datasets, metrics, and applications that define modern text classification.

What are the major eras of text classification?

Text classification has cycled through five technical eras over three decades. Each transition lowered the labeled-data requirement and raised accuracy on standard benchmarks.

Era	Years	Dominant approach	Representative work
Counting and Bayes	1990s to early 2000s	Bag of words plus Naive Bayes or logistic regression	McCallum and Nigam (1998) on Reuters and 20 Newsgroups
Margin and kernels	early 2000s to early 2010s	TF-IDF plus support vector machines	Joachims (1998) showed linear SVMs dominate Reuters-21578
Word vectors and shallow networks	2014 to 2017	Word2Vec and GloVe embeddings feeding CNNs or BiLSTMs	Kim (2014) CNN for sentence classification; fastText (2016)
Transfer learning from Transformers	2018 to 2021	Pre-trained Transformer encoders fine-tuned with a classification head	BERT, RoBERTa, DeBERTa
Prompted and few-shot LLMs	2020 to present	Zero-shot prompting or few-shot prompting of large language models; SetFit and contrastive few-shot	GPT-3, GPT-4, Claude, SetFit

The early margin-and-kernel era set a durable baseline: Joachims (1998) reported that a simple TF-IDF representation with Euclidean normalization and a linear SVM reached an 84.2 percent micro-averaged precision-recall breakeven point on the Reuters-21578 ModApte split, establishing linear SVMs as the model to beat for over a decade. ^[9]^[16] The Transformer paper (Vaswani et al., 2017) then replaced recurrence with self-attention, and the encoder-only BERT released by Google in October 2018 made transfer learning the default recipe for classification. ^[1]^[2] Most production classifiers built between 2019 and 2024 are BERT-family encoders with a softmax head. Since 2022 a growing share of low-volume or rapidly changing label spaces is served by prompting an LLM instead of training a dedicated model.

In late 2024 a new generation of encoder models began to appear. Warner et al. released ModernBERT (December 2024), which retained BERT's bidirectional masked language modeling but replaced absolute positional encodings with rotary positional embeddings (RoPE), added alternating local/global attention, and trained on two trillion tokens with a native 8,192-token context window. ^[24] ModernBERT-base and ModernBERT-large run roughly twice as fast as DeBERTaV3 on short context and three times faster than NomicBERT and GTE on long context, and the authors describe the family as "the most speed and memory efficient encoder" tested. ^[24] Shortly afterward, Menet et al. released NeoBERT (February 2025), a 250 million parameter encoder with SwiGLU activations and RMSNorm that reports state-of-the-art results on the Massive Text Embedding Benchmark (MTEB) under identical fine-tuning conditions. ^[25] These releases mark a renewed investment in encoder-only architectures optimized for classification and retrieval, rather than ceding the space entirely to prompted generative models.

What architectures are used for text classification?

A text classifier is fundamentally a function that maps a string to a probability distribution over labels. Five architectural families dominate.

Encoder Transformer with a classification head. The canonical recipe since 2018. A pre-trained encoder such as BERT reads the tokenized input, prepends a special [CLS] token, and returns a contextual representation for every position. The final hidden state of [CLS] is fed into a small linear-plus-softmax head. ^[2] Fine-tuning typically updates all encoder weights, but LoRA and adapter modules can match accuracy while training only a fraction of the parameters. For multi-label tasks the softmax head is replaced with independent sigmoid outputs and binary cross-entropy loss, so each label can be active simultaneously.

CNN over word embeddings. Yoon Kim's 2014 EMNLP paper showed that a single convolutional layer with several filter widths over pre-trained word embeddings, followed by max-pooling and a softmax, "improves upon the state of the art on 4 out of 7 tasks" including sentiment analysis and question classification. ^[10] Cheap, parallelizable, still useful when latency budgets are tight.

BiLSTM with attention. A bidirectional LSTM reads the embedding sequence in both directions and an attention layer pools the hidden states into a single document vector. Hierarchical Attention Networks (Yang et al., 2016) were the strongest non-Transformer document classifiers before BERT. ^[29] The hierarchical structure, encoding words into sentences then sentences into documents, remains relevant for very long inputs where truncation to 512 tokens is undesirable.

FastText and linear models on subword n-grams. FastText (Joulin et al., 2016) averages embedding vectors for word and character n-grams and feeds the average to a linear classifier. The authors report that it "can train on more than one billion words in less than ten minutes using a standard multicore CPU, and classify half a million sentences among 312K classes in less than a minute." ^[9] It remains the default tool for language identification and very large label sets, with a throughput advantage over Transformer encoders typically of two to three orders of magnitude, making it appealing for real-time pipelines classifying millions of short texts per minute.

Prompted classification with an LLM. A frontier LLM such as GPT-4, Claude, or Gemini can act as a classifier by being asked, in plain English, to return the label. Few-shot examples in the prompt usually help. No training pipeline and instant new labels, at the cost of higher latency, higher dollars per inference, and weaker reproducibility than a dedicated model. A 2024 study by Bucher and Martini found that fine-tuned BERT-scale encoders "significantly outperform" zero-shot GPT-3.5, GPT-4, and Claude across sentiment, approval, emotion, and party-position tasks, concluding that fine-tuning with application-specific data "achieves superior performance in all cases." ^[26] Separately, a 2026 benchmark measured a fine-tuned BERT classifier at 277 samples per second versus 12 for Gemma-2-2B on an RTX A4500 GPU, making the encoder roughly 23 times faster for identical throughput. ^[35]

How do you fine-tune efficiently without retraining the whole model?

Full fine-tuning updates all encoder weights, which requires storing a separate checkpoint per task and loading the full model for each request. Parameter-efficient fine-tuning (PEFT) methods address this by learning a small task-specific module while freezing most of the base model.

LoRA (Hu et al., 2021) decomposes the weight update as a product of two low-rank matrices. Because the learned matrices can be merged back into the base weights after training, LoRA adds no inference latency. It has become the de facto standard for PEFT in production. IA3 (Liu et al., 2022) learns rescaling vectors for key, value, and feedforward activations, training fewer parameters than LoRA while matching its accuracy on most tasks. ReFT (Wu et al., 2024) intervenes on hidden representations rather than weight matrices; it trains roughly 3 percent of LoRA's parameters while reaching near-identical classification F1. A 2025 study comparing LoRA, IA3, and ReFT on low-resource AG News and Amazon Reviews found LoRA achieved the highest absolute F1 scores, but ReFT was within one point while training far fewer parameters.

How are documents longer than 512 tokens classified?

BERT's 512-token limit creates a practical problem for classifying documents longer than a few paragraphs. Three strategies are commonly applied.

Truncation and chunking. The simplest approach truncates the input at 512 tokens or processes overlapping windows and aggregates logits. Effective for documents where the first 512 tokens carry enough signal (e.g., news articles).

Sparse attention models. Longformer (Beltagy et al., 2020) and BigBird (Zaheer et al., 2020) combine local windowed attention with global attention on a small set of tokens, scaling to 4,096 or 8,192 tokens while keeping computation manageable. ^[30] ModernBERT's 8,192-token native window now covers these use cases without custom attention kernels. ^[24]

Hierarchical encoders. A word-level encoder produces sentence representations, and a sentence-level encoder aggregates them into a document vector. Hierarchical Attention Transformers (HAT, 2022) outperform equally sized Longformer models while using 10 to 20 percent less memory and running 40 to 45 percent faster.

Which text classification models are most widely used?

The table below lists widely used dedicated classification models. Parameter counts refer to the most common public checkpoints.

Model	Released	Organization	Parameters	Key innovation
FastText	Jul 2016	Facebook AI Research	Linear, vocabulary-sized	Subword n-gram averaging, billion-word CPU training
BERT-base	Oct 2018	Google AI Language	110 million	Bidirectional masked language modeling
BERT-large	Oct 2018	Google AI Language	340 million	24 layers, 16 heads, 1024 hidden
XLNet	Jun 2019	CMU and Google Brain	110 / 340 million	Permutation language modeling
RoBERTa	Jul 2019	Facebook AI	125 / 355 million	Drops NSP, trains on 160 GB of text
ALBERT	Sep 2019	Google Research	12 / 18 / 235 million	Factorized embeddings, cross-layer sharing
DistilBERT	Oct 2019	Hugging Face	66 million	Knowledge distillation, 40% smaller than BERT-base
ELECTRA-base	Mar 2020	Stanford and Google Brain	110 million	Replaced-token detection on every position
XLM-RoBERTa	Nov 2019	Facebook AI	125 / 355 million	Multilingual; pretrained on 100 languages, 2.5 TB of CommonCrawl; 14.6 points above mBERT on XNLI
DeBERTa-v3-base	Nov 2021	Microsoft	86 / 184 million	Disentangled attention plus ELECTRA-style RTD
SetFit	Sep 2022	Hugging Face, Intel Labs, UKP Lab	110 to 355 million	Contrastive fine-tuning of a Sentence Transformer
ModernBERT-base / large	Dec 2024	Answer.AI and LightOn	149 / 395 million	RoPE, alternating attention, 8,192-token context, 2T token training
NeoBERT	Feb 2025	Menet et al.	250 million	SwiGLU, RMSNorm, optimal depth-to-width ratio, 4,096-token context, SOTA on MTEB

DeBERTa-v3-large reached 91.37 percent average GLUE score, 1.37 points above DeBERTa and 1.91 points above ELECTRA. ^[7] On the few-shot RAFT benchmark, SetFit with the all-roberta-large-v1 backbone (355 million parameters) outperforms PET and reaches accuracy comparable to GPT-3 prompting with a model 1,600 times smaller. ^[18]^[22] ModernBERT reports state-of-the-art results on classification, retrieval, and code understanding at two to three times the throughput of DeBERTa and BERT. ^[24]

What datasets and subtasks define text classification?

The field uses a stable set of public datasets, several of which became implicit standards for measuring transfer-learning progress.

Dataset	Year	Task	Size	Notes
Reuters-21578	1990s	Multi-label news topic	~21,500 docs, 90 categories	TF-IDF plus SVM benchmark
20 Newsgroups	1995	Newsgroup topic	~20,000 posts, 20 classes	Early Naive Bayes evaluation set
TREC question types	2002	Question category	5,500 train, 500 test, 6 coarse and 47 fine classes	Li and Roth (2002), COLING
IMDB Large Movie Review	2011	Binary sentiment	25,000 train, 25,000 test	Maas et al., ACL 2011; BERT-family achieves ~97% accuracy
SST-2 (Stanford Sentiment Treebank, binary)	2013	Binary sentence sentiment	67,000 train, 873 dev, 1,821 test	Socher et al., included in GLUE
SemEval-2014 Task 4	2014	Aspect-based sentiment (restaurant and laptop reviews)	~3,000 to 4,000 per domain	Canonical ABSA benchmark; DeBERTa-based models lead
AG News	2015	4-class news topic	120,000 train, 7,600 test	Zhang, Zhao, LeCun, NIPS 2015
Yelp Review Polarity	2015	Binary review sentiment	560,000 train, 38,000 test	Same paper as AG News
Amazon Review Polarity	2015	Binary review sentiment	3,600,000 train, 400,000 test	Largest of the Zhang et al. set
GLUE	Apr 2018	9 NLU tasks, mostly classification	Variable	Wang et al., EMNLP 2018 BlackboxNLP
Jigsaw Toxic Comment	Mar 2018	Multi-label toxicity	~159,000 Wikipedia comments	Six labels: toxic, severe_toxic, obscene, threat, insult, identity_hate
WiLI-2018	2018	Language identification	235 languages	Sentence-level language ID corpus
SuperGLUE	May 2019	Harder NLU tasks	Variable	Wang et al., follow-up to GLUE
RAFT	2021	11 real-world few-shot classification tasks	50 train examples per task	Alex et al., NeurIPS 2021; designed to resist crowdsourced solution

GLUE's nine tasks are CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, and WNLI. ^[11] All except STS-B (regression) are classification problems. DeBERTa-v3-large now exceeds the GLUE human baseline of 87.1 points, which is why SuperGLUE was introduced. ^[7]^[12]

What are the main subtasks of text classification?

Text classification is not a single task but a family of related problems. Understanding the distinctions helps when selecting architectures and training data.

Sentiment analysis. Assigns a polarity (positive, negative, neutral) or a fine-grained emotion to a piece of text. Binary polarity is the most common formulation (IMDB, Yelp, Amazon). Aspect-based sentiment analysis (ABSA) refines this by identifying which aspect of a product or service a particular opinion targets (e.g., "the battery life is excellent but the screen is dim" carries two aspect-sentiment pairs). SemEval-2014 through SemEval-2016 Tasks 4 and 5 define the standard ABSA evaluation protocol; DeBERTa fine-tuned with graph convolutional network augmentation currently leads on the restaurant and laptop subsets.

Topic classification. Routes a document to one of several predefined topical categories. AG News (4 classes), Reuters-21578 (90 categories, multi-label), and the 20 Newsgroups corpus (20 classes) are the standard evaluation sets. Encoder fine-tuning has long dominated, but LLM-based classifiers now reach strong performance zero-shot on common topic sets.

Intent detection. Identifies the user's purpose in a short utterance, commonly in dialogue systems and voice assistants. Datasets include SNIPS, ATIS, and HWU64. Intent labels are highly specific and numerous (hundreds of intents in enterprise systems), which makes fine-tuning necessary because zero-shot prompting generalizes poorly across very large intent taxonomies.

Natural language inference (NLI). A three-way classification that decides whether a hypothesis is entailed by, contradicted by, or neutral with respect to a premise. The MNLI, SNLI, and RTE datasets define this subtask. NLI-trained models also double as zero-shot classifiers (see below).

Spam and phishing detection. Binary classification distinguishing legitimate messages from unsolicited or malicious ones. Applied to email, SMS, and web comments. Training data is typically proprietary; public datasets include the Enron spam corpus and SpamAssassin corpus. Adversarial robustness (homoglyph substitution, obfuscation) is a central concern.

Toxicity and content moderation. Multi-label classification where a comment may simultaneously belong to "toxic," "obscene," "threat," and other categories. The Jigsaw Toxic Comment dataset is the standard public benchmark. ^[17] Perspective API and OpenAI's Moderation endpoint are widely deployed production systems. Annotation disagreement (some annotators find content toxic that others do not) is a fundamental challenge that no single model fully resolves.

Language identification. Assigns a language code to a snippet of text, often as a preprocessing step before routing to a language-specific model. FastText's 176-language model is the most widely deployed; Google's compact Compact Language Detector (CLD3) runs in browsers. WiLI-2018 (235 languages) and the LTI LangID corpus are standard evaluation sets.

Multi-label and hierarchical classification. Many real-world label spaces are overlapping (a news article can be both "Technology" and "Business") or structured in a hierarchy (ICD-10 medical codes have a four-level ontology). Multi-label tasks use sigmoid outputs with binary cross-entropy loss and require threshold selection per label rather than argmax. Hierarchical classifiers enforce parent-before-child label constraints or use label-embedding models that encode the taxonomy structure.

How is text classification accuracy measured?

Classification quality is measured with a small set of well-understood metrics. The right choice depends on class balance and the cost of false positives versus false negatives.

Metric	Definition	When to use
Accuracy	Correct predictions / total	Balanced binary or multi-class
Precision	True positives / predicted positives	Costly false positives (spam, moderation appeals)
Recall	True positives / actual positives	Costly false negatives (toxicity, fraud, triage)
F1	Harmonic mean of precision and recall	Imbalanced binary tasks
Macro F1	Unweighted mean of per-class F1	Multi-class imbalance where small classes matter
Micro F1	F1 over pooled confusion matrix	Multi-label scoring
Weighted F1	Per-class F1 weighted by support	Single number on imbalanced multi-class data
AUC-ROC	Area under the ROC curve	Threshold-free comparison of probabilistic scores
AUC-PR	Area under the precision-recall curve	Highly imbalanced binary tasks
Matthews correlation	Correlation between predicted and actual	GLUE's CoLA; robust to imbalance

The original Jigsaw Kaggle competition scored submissions by mean column-wise AUC-ROC across the six label columns, making it a calibration problem rather than a hard-threshold problem. ^[17]

Threshold tuning. For multi-label and imbalanced binary tasks, the default 0.5 decision threshold often performs poorly. Per-label threshold optimization on a held-out validation set, as well as adaptive thresholding algorithms that fuse label frequency statistics with instance-level signals, routinely improves micro and macro F1 over the default threshold.

Calibration. A well-calibrated model's confidence scores are accurate probabilities (a 70-percent confident prediction should be correct 70 percent of the time). Models that feed downstream decision systems require calibration beyond raw accuracy. Temperature scaling (learning a single scalar to divide logits) is the simplest and most widely used method; isotonic regression and Platt scaling handle more complex miscalibration.

What is text classification used for?

Classification is the workhorse of applied NLP.

Domain	Typical labels	Representative deployments
Email	Spam, phishing, promotional, social, primary	Gmail tabs, Outlook Focused Inbox
Content moderation	Toxic, hate, harassment, self-harm, sexual, violence	Perspective API, OpenAI Moderation
Customer support	Billing, technical, refund, escalation	Zendesk Auto-Routing, Intercom Resolution Bot
News and media	Politics, sports, business, tech, entertainment	Google News clustering, Apple News
Sentiment monitoring	Positive, neutral, negative, aspect emotion	Brand tracking, social listening
Compliance	PII, clause type, regulatory category	DocuSign Insight, Kira Systems
Finance	Bullish, bearish, neutral on a ticker	Bloomberg news sentiment
Healthcare	ICD-10 codes, triage acuity, symptom category	Medical record coding assistants
Search	Query intent, vertical routing	Web search vertical pickers
Language tooling	Source language of a snippet	Google Translate auto-detect, fastText 176-language model
LLM routing and guardrails	Safe vs. unsafe, query complexity tier	OpenAI Moderation, ModernBERT-based guardrail systems

A single production pipeline can stack several classifiers: a language identification step routes input to a localized intent detection model, which then triggers a domain-specific topic classifier. ModernBERT has also found a new application category as a lightweight guardrail model in LLM inference pipelines, classifying input prompts for malicious intent at latency budgets that generative models cannot meet. ^[24]

How do LLMs do text classification by prompting?

Since 2020, frontier LLMs have started to replace dedicated classifiers in low-to-medium volume settings. Four approaches dominate.

Zero-shot prompting. A model such as GPT-4 or Claude receives the text and a short instruction listing the candidate labels. Accuracy is often within a few points of a fine-tuned encoder on common topics, and substantially better on rare or novel categories. The tradeoff is cost: zero-shot LLM classification is roughly one to two orders of magnitude more expensive per query than a dedicated encoder.

Few-shot in-context prompting. Three to thirty labeled examples are placed in the prompt and the model generalizes from them at inference time. See in-context learning. Performance improves with example quality and quantity, but degrades when candidate labels are numerous or semantically similar.

Natural language inference framing. A pre-trained NLI model such as RoBERTa-large-MNLI or mDeBERTa-v3-base (fine-tuned by Moritz Laurer on roughly 2.7 million multilingual NLI pairs across 27 languages) scores the pair (text, hypothesis = "this is about X") for every candidate label X and picks the highest entailment probability. This approach, popularized by Yin et al. (2019), requires no task-specific labels and supports multilingual zero-shot classification out of the box. ^[19] The default Hugging Face zero-shot pipeline uses facebook/bart-large-mnli; alternatives include cross-encoder/nli-deberta-v3-large for higher accuracy. ^[20]

Contrastive few-shot fine-tuning. SetFit (Tunstall et al., 2022) fine-tunes a Sentence Transformer on positive and negative pairs constructed from as few as eight labeled examples per class, then trains a logistic head on the resulting embeddings. On Customer Reviews with eight examples per class, SetFit reaches accuracy comparable to a RoBERTa-large fine-tune on the full 3,000-example training set. ^[18] On the RAFT benchmark, SetFit with all-roberta-large-v1 (355M parameters) outperforms GPT-3 (175B parameters) by 8.6 points overall, surpassing the human baseline on 7 of 11 tasks, while being 1,600 times smaller. ^[18]^[22] A 2025 update using ModernBERT as the SetFit backbone further improves results on several RAFT subtasks.

LLM-generated silver labels with distillation. A common production pattern is to use a capable LLM to annotate a large unlabeled set, then fine-tune a small encoder on the resulting silver labels. Pangakis and Wolken (2024) replicated 14 classification tasks and demonstrated that classifiers fine-tuned on GPT-4-generated labels perform comparably to models trained with human annotations across sentiment, approval, and party-position tasks, while costing far less than human annotation at scale; they note that labeling 6.2 million tweets on four dimensions with GPT-4 alone would cost nearly 9,000 dollars, which distillation avoids. ^[27]

A common production pattern is to prototype with a prompted LLM, harvest predictions as silver labels, and distill into a small encoder once volume justifies it.

When should you choose an encoder vs. an LLM?

Scenario	Recommended approach	Rationale
High throughput (millions of documents per day)	Fine-tuned encoder (BERT, DeBERTa, ModernBERT)	23 to 200 times faster; orders of magnitude cheaper per query
Fixed label set, adequate labeled data (500+ examples)	Fine-tuned encoder	Consistently outperforms zero-shot LLMs in controlled comparisons
New labels or rapidly changing taxonomy with minimal data	Zero-shot or few-shot LLM	No training pipeline; instant new labels
8 to 50 labeled examples per class	SetFit (few-shot contrastive)	Matches fine-tuned encoders at full data; much cheaper than LLM
Zero labeled examples, common topics	NLI-based zero-shot (BART-large-MNLI, mDeBERTa NLI)	Open-weights, multilingual, no API cost
Annotation generation at scale	LLM labeling followed by encoder distillation	Cuts annotation cost; resulting encoder has low inference cost
Low-latency guardrails in an LLM pipeline	ModernBERT or FastText	Sub-millisecond to single-millisecond latency

How is text classified across languages?

Most of the benchmark history of text classification is English-centric. Three architectures extend coverage to other languages.

Multilingual encoders. mBERT (Devlin et al., 2018) was pre-trained on 104 languages from Wikipedia and showed surprising cross-lingual transfer even without language-specific data. ^[2] XLM-RoBERTa (Conneau et al., 2019) improved substantially by training on 2.5 terabytes of CommonCrawl data in 100 languages; the authors report that XLM-R "significantly outperforms multilingual BERT (mBERT)" including a +14.6 percent average accuracy gain on XNLI (80.9 versus 66.3). ^[23] mDeBERTa-v3-base adds disentangled attention and ELECTRA-style training to the multilingual setting, reaching 87.1 percent on English MNLI and maintaining above 80 percent accuracy across multiple languages when fine-tuned on the combined XNLI and multilingual NLI dataset.

Cross-lingual zero-shot classification. NLI models trained on multilingual data (e.g., joeddav/xlm-roberta-large-xnli) can classify text in languages not seen during task fine-tuning. This zero-shot cross-lingual approach works well for typologically close languages and degrades on more distant or low-resource ones.

Language-specific fine-tuning. For high-value languages with adequate data, language-specific BERT variants (CamemBERT for French, BERTje for Dutch, RoBERTa-base-chinese, etc.) typically outperform multilingual models by two to five points when fine-tuned on in-language data.

What are the open problems in text classification?

Text classification is mature but not solved.

Domain shift. A classifier trained on Amazon reviews drops sharply on Twitter or clinical notes. Domain adaptation and in-domain pre-training help but never fully close the gap.
Class imbalance. Real-world label distributions are skewed, especially in toxicity, fraud, and medical coding. Focal loss, class-balanced sampling, threshold tuning, and SMOTE oversampling are standard responses.
Label noise. Annotators disagree on subjective labels like toxicity, sarcasm, or emotion. Soft labels, multi-annotator modeling, and confidence calibration are active research areas.
Adversarial robustness. Character swaps, homoglyphs, or paraphrases can flip predictions. TextFooler and BERT-Attack document the gap; adversarial training and randomized smoothing are common defenses.
Multilingual coverage. Most benchmarks are English-centric. mBERT, XLM-R, and mDeBERTa-v3 extend coverage to 100-plus languages, but low-resource languages still lag. ^[23]
Calibration. Scores are often miscalibrated after large-batch training. Temperature scaling, isotonic regression, and Platt scaling fix this when probabilities feed downstream decisions.
Explainability and PII. Attention weights, integrated gradients, LIME, and SHAP help explain flags. Differential privacy, federated learning, and PII scrubbing limit memorization of user text.
Long-document handling. Standard BERT encoders truncate at 512 tokens. Hierarchical encoders, sparse-attention models (Longformer, BigBird), and the new generation of long-context encoders (ModernBERT at 8,192 tokens) reduce but do not eliminate the problem for book-length or multi-document classification. ^[24]^[30]
LLM vs. encoder tradeoffs. Despite the promise of zero-shot LLMs, 2024 research consistently shows fine-tuned encoders outperform zero-shot GPT-4 on standard classification tasks when labeled training data is available, while being dramatically faster and cheaper. ^[26]^[35] The question of when to reach for an LLM versus a dedicated encoder remains an active design choice in production systems.

Topic area	Wiki pages
Sub-tasks	Sentiment analysis, Topic classification, Intent detection, Spam detection, Toxicity classification, Language identification, Named entity recognition
Architectures	Transformer, BERT, Convolutional neural network, Long short-term memory, Self-attention
Encoder models	RoBERTa, ALBERT, DistilBERT, DeBERTa, XLNet, ELECTRA, FastText, SetFit
LLMs used zero-shot	GPT-4, Claude, Gemini, Llama, Mistral
Foundations	Bag of words, TF-IDF, Word embedding, Word2Vec, GloVe, Tokenization, In-context learning
Benchmarks	GLUE, SuperGLUE, IMDB dataset
Training	Supervised fine-tuning, LoRA, Domain adaptation, Naive Bayes, Support vector machine

References

Vaswani, A., et al. "Attention Is All You Need". NeurIPS, 2017. https://arxiv.org/abs/1706.03762 ↩
Devlin, J., et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". arXiv:1810.04805, October 2018. https://arxiv.org/abs/1810.04805 ↩
Liu, Y., et al. "RoBERTa: A Robustly Optimized BERT Pretraining Approach". arXiv:1907.11692, July 2019. https://arxiv.org/abs/1907.11692
Sanh, V., et al. "DistilBERT, a distilled version of BERT". arXiv:1910.01108, October 2019. https://arxiv.org/abs/1910.01108
Lan, Z., et al. "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations". ICLR 2020, arXiv:1909.11942. https://arxiv.org/abs/1909.11942
Clark, K., et al. "ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators". ICLR 2020, arXiv:2003.10555. https://arxiv.org/abs/2003.10555
He, P., Gao, J., and Chen, W. "DeBERTaV3". ICLR 2023, arXiv:2111.09543. https://arxiv.org/abs/2111.09543 ↩
Yang, Z., et al. "XLNet: Generalized Autoregressive Pretraining for Language Understanding". NeurIPS 2019, arXiv:1906.08237. https://arxiv.org/abs/1906.08237
Joulin, A., et al. "Bag of Tricks for Efficient Text Classification". EACL 2017, arXiv:1607.01759. https://arxiv.org/abs/1607.01759 ↩
Kim, Y. "Convolutional Neural Networks for Sentence Classification". EMNLP 2014, arXiv:1408.5882. https://arxiv.org/abs/1408.5882 ↩
Wang, A., et al. "GLUE". EMNLP 2018 BlackboxNLP, arXiv:1804.07461. https://arxiv.org/abs/1804.07461 ↩
Wang, A., et al. "SuperGLUE". NeurIPS 2019, arXiv:1905.00537. https://arxiv.org/abs/1905.00537 ↩
Maas, A. L., et al. "Learning Word Vectors for Sentiment Analysis" (IMDB). ACL 2011. https://ai.stanford.edu/~amaas/data/sentiment/
Socher, R., et al. "Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank" (SST-2). EMNLP 2013. https://nlp.stanford.edu/sentiment/
Zhang, X., Zhao, J., and LeCun, Y. "Character-level Convolutional Networks for Text Classification" (AG News, Yelp, Amazon Polarity). NIPS 2015, arXiv:1509.01626. https://arxiv.org/abs/1509.01626
Joachims, T. "Text Categorization with Support Vector Machines: Learning with Many Relevant Features" (TREC reference is Li and Roth 2002). ECML 1998, Springer LNCS 1398. https://link.springer.com/chapter/10.1007/BFb0026683 ↩
Jigsaw / Conversation AI. "Toxic Comment Classification Challenge". Kaggle, 2018. https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge ↩
Tunstall, L., et al. "Efficient Few-Shot Learning Without Prompts" (SetFit). arXiv:2209.11055, September 2022. https://arxiv.org/abs/2209.11055 ↩
Yin, W., Hay, J., and Roth, D. "Benchmarking Zero-shot Text Classification". EMNLP 2019, arXiv:1909.00161. https://arxiv.org/abs/1909.00161 ↩
Hugging Face Transformers, "Text classification". https://huggingface.co/docs/transformers/tasks/sequence_classification ↩
GLUE leaderboard. https://gluebenchmark.com/leaderboard
Hugging Face. "SetFit: Efficient Few-Shot Learning Without Prompts" blog, September 2022. https://huggingface.co/blog/setfit ↩
Conneau, A., et al. "Unsupervised Cross-lingual Representation Learning at Scale" (XLM-R). ACL 2020, arXiv:1911.02116. https://arxiv.org/abs/1911.02116 ↩
Warner, B., et al. "Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference" (ModernBERT). arXiv:2412.13663, December 2024. https://arxiv.org/abs/2412.13663 ↩
Menet, L., et al. "NeoBERT: A Next-Generation BERT". arXiv:2502.19587, February 2025. https://arxiv.org/abs/2502.19587 ↩
Bucher, M. J. J., and Martini, M. "Fine-Tuned 'Small' LLMs (Still) Significantly Outperform Zero-Shot Generative AI Models in Text Classification". arXiv:2406.08660, June 2024. https://arxiv.org/abs/2406.08660 ↩
Pangakis, N., and Wolken, S. "Knowledge Distillation in Automated Annotation: Supervised Text Classification with LLM-Generated Training Labels". ACL NLP+CSS 2024, arXiv:2406.17633. https://arxiv.org/abs/2406.17633 ↩
Alex, N., et al. "RAFT: A Real-World Few-Shot Text Classification Benchmark". NeurIPS 2021, arXiv:2109.14076. https://arxiv.org/abs/2109.14076
Yang, Z., et al. "Hierarchical Attention Networks for Document Classification". NAACL 2016. https://aclanthology.org/N16-1174/ ↩
Beltagy, I., Peters, M. E., and Cohan, A. "Longformer: The Long-Document Transformer". arXiv:2004.05150, April 2020. https://arxiv.org/abs/2004.05150 ↩
Hu, E., et al. "LoRA: Low-Rank Adaptation of Large Language Models". arXiv:2106.09685, June 2021. https://arxiv.org/abs/2106.09685
Zaheer, M., et al. "Big Bird: Transformers for Longer Sequences". NeurIPS 2020, arXiv:2007.14062. https://arxiv.org/abs/2007.14062
Li, X., and Roth, D. "Learning Question Classifiers" (TREC). COLING 2002. https://aclanthology.org/C02-1150/
Conneau, A., et al. "XNLI: Evaluating Cross-lingual Sentence Representations". EMNLP 2018, arXiv:1809.05053. https://arxiv.org/abs/1809.05053
Jacobs, A. "Beating BERT? Small LLMs vs Fine-Tuned Encoders for Classification". January 2026. https://alex-jacobs.com/posts/beatingbert/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

What links here

Active Learning AdaBoost Bayes' theorem Data Augmentation Dense Feature DistilBERT In-Context Learning Naive Bayes Trigram Zero-Shot Classification Models fastText

What are the major eras of text classification?

What architectures are used for text classification?

How do you fine-tune efficiently without retraining the whole model?

How are documents longer than 512 tokens classified?

Which text classification models are most widely used?

What datasets and subtasks define text classification?

What are the main subtasks of text classification?

How is text classification accuracy measured?

What is text classification used for?

How do LLMs do text classification by prompting?

When should you choose an encoder vs. an LLM?

How is text classified across languages?

What are the open problems in text classification?

related wiki pages

See also

References

Improve this article

Related Articles

Translation Models

Bert-base-uncased model

Conversational Models

Fill-Mask Models

Question Answering Models

Sentence-transformers/all-MiniLM-L6-v2 model

What links here

Related Articles

Translation Models

Bert-base-uncased model

Conversational Models

Fill-Mask Models

Question Answering Models

Sentence-transformers/all-MiniLM-L6-v2 model

What links here