Classification (machine learning)
Last reviewed
May 26, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 ยท 4,117 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 26, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 ยท 4,117 words
Add missing citations, update stale details, or suggest a clearer explanation.
Classification is a supervised learning task in machine learning in which an algorithm learns to assign discrete category labels to input data, in contrast to regression, which predicts continuous numeric outputs.[1] A classifier is trained on a labeled dataset of examples and is then expected to generalize to previously unseen inputs, producing either a single predicted class or a probability distribution over candidate classes.[2] Classification underlies a large fraction of deployed machine learning systems, including spam filters, medical triage tools, credit scoring, fraud detection, image recognition, and natural language understanding.[1] Algorithms used for classification range from classical statistical methods such as logistic regression, naive bayes, and linear discriminant analysis to modern deep architectures such as convolutional networks, transformer encoders, and large multimodal foundation models that classify in a zero-shot setting from a text or image prompt.[3][4][5]
In a classification problem, a learner observes a training set of input-output pairs (x, y), where each x is a feature vector in some input space and each y is an element of a finite label set. The goal is to learn a function f that maps new inputs to labels with low expected error under the same distribution that generated the training data.[1] When the algorithm outputs probabilities p(y | x), the predicted class is typically argmax p(y | x), but the probabilities themselves are useful for thresholding, ranking, and downstream decision making.[6]
Three structural variants are standard:
Related variants include hierarchical classification (labels form a tree or DAG), ordinal classification (labels have a natural order, such as star ratings), and open-set classification, where the model must also detect inputs that belong to none of the training classes.[1]
Classification has a long statistical pedigree predating the term "machine learning". Ronald A. Fisher introduced linear discriminant analysis in 1936 to separate iris species by petal and sepal measurements, producing one of the most cited classification datasets in history.[7] Logistic regression, naive Bayes, decision trees, and support vector machines together remain the workhorse algorithms for tabular data in 2026.
Logistic regression models the log-odds of class membership as a linear combination of features. For a binary problem it predicts p(y = 1 | x) = sigma(w x + b), where sigma is the logistic sigmoid function. The parameters are usually estimated by maximizing the conditional log-likelihood, which is equivalent to minimizing the cross-entropy loss. The multiclass extension uses a softmax over K linear scores and is sometimes called multinomial logistic regression or softmax regression.[1] Despite its simplicity, logistic regression remains a strong baseline because it produces naturally calibrated probabilities, scales to very large feature spaces with sparse solvers, and yields easily interpretable weight coefficients.[6]
Corinna Cortes and Vladimir Vapnik formalized the support vector machine svm in 1995, framing classification as the search for the hyperplane that maximizes the margin between the two classes, with the closest training examples acting as "support vectors".[8] The soft-margin SVM minimizes a sum of hinge loss and an L2 regularizer, and the kernel trick allows the same algorithm to learn nonlinear boundaries by implicitly mapping features into a high-dimensional reproducing kernel Hilbert space. Common kernels include radial basis function (RBF), polynomial, and sigmoid kernels.[8] SVMs dominated the late 1990s and 2000s for tasks like text categorization and handwritten digit recognition before deep learning overtook them on raw perceptual inputs.
The k nearest neighbors (k-NN) classifier stores the training set and labels each new point by majority vote among its k closest training examples under a chosen distance metric. k-NN has no training phase, can approximate any decision boundary in the infinite-sample limit, and is widely used as a sanity-check baseline. The main practical drawback is inference cost, which scales linearly in the size of the training set, mitigated in modern deployments by approximate nearest neighbor indexes.[1]
Naive bayes classifiers apply Bayes' rule under the assumption that features are conditionally independent given the class. The class-conditional likelihoods are typically Gaussian (for continuous features), multinomial (for word counts), or Bernoulli (for binary features). Despite the independence assumption rarely holding in practice, naive Bayes remains effective for text classification because the high dimensionality of word features tends to dampen the impact of correlated features, and the model trains in a single pass over the data.[1]
A decision tree recursively partitions the feature space along axis-aligned splits chosen to maximize information gain (or to minimize Gini impurity) on the training set. Trees handle mixed numeric and categorical features without scaling, produce human-readable rules, but are prone to overfitting when grown deep. Two ensemble families dominate modern tabular classification:
Linear discriminant analysis (LDA), in its modern statistical form, models each class as a multivariate Gaussian with a shared covariance matrix and uses Bayes' rule to derive a linear decision boundary. Quadratic discriminant analysis relaxes the shared-covariance assumption. LDA is closely related to logistic regression but is parameterized generatively rather than discriminatively, which gives it an edge when the Gaussian assumption holds and the training set is small.[7]
A long-running distinction in classification contrasts generative models, which model the joint distribution p(x, y) and derive p(y | x) by Bayes' rule, with discriminative models, which directly model the conditional p(y | x).[11] Naive Bayes and LDA are canonical generative classifiers; logistic regression, SVMs, conditional random fields, and most neural network classifiers are discriminative.
Andrew Ng and Michael Jordan analyzed the trade-off in a 2001 paper that compared the naive Bayes and logistic regression classifier pair (one generative, one discriminative, with the same parametric family for p(y | x)) and showed that discriminative learning attains lower asymptotic test error, but generative learning often approaches its own (higher) asymptote much faster as the training set grows.[11] In practice this means that on very small datasets, naive Bayes can outperform logistic regression even though logistic regression is the better choice when data is abundant.[11] Generative models also offer additional capabilities, including imputation of missing features, generation of synthetic samples, and natural handling of unlabeled data through expectation-maximization.
From the early 2010s onward, deep neural networks displaced hand-engineered features for classification on images, audio, video, and text. A neural network classifier ends in a softmax (for multiclass) or sigmoid (for binary or multilabel) output layer and is trained end-to-end by stochastic gradient descent on a cross-entropy objective. The breakthrough was less the loss function (already standard) than the combination of large labeled datasets, GPU compute, and architectural advances.
Yann LeCun and collaborators applied convolutional neural networks (CNNs) to handwritten digit recognition in the LeNet system at AT&T Bell Labs in the 1990s, but CNNs reached broad attention only after Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton's alexnet, which won the imagenet Large Scale Visual Recognition Challenge in 2012 with a top-5 error of 15.3 percent, more than ten percentage points ahead of the runner-up.[12] AlexNet had 60 million parameters in eight learned layers, used ReLU activations, dropout, and training on two GPUs, and is widely cited as the moment deep learning displaced hand-crafted feature pipelines in computer vision.[12]
The imagenet benchmark, formalized by Olga Russakovsky and colleagues, scored object-classification systems on roughly 1.2 million training images across 1,000 categories, with held-out validation and test sets.[13] Annual improvements followed in rapid succession: VGG and GoogLeNet in 2014, resnet in 2015, and a long tail of architectural variants. ResNet, introduced by Kaiming He and colleagues, introduced residual (skip) connections that let the network learn residual functions relative to the identity, which enabled training of networks up to 152 layers deep and achieved a 3.57 percent top-5 error on the ImageNet test set, winning the 2015 ILSVRC.[14] ResNet's residual block has since been adopted by nearly every subsequent vision and language architecture.
Vision transformer (ViT), introduced by Alexey Dosovitskiy and collaborators at Google Research in 2020, applied the transformer encoder architecture directly to images by splitting each image into a sequence of fixed-size patches (commonly 16x16 pixels), embedding each patch linearly, and feeding the resulting tokens through standard transformer blocks with positional encoding.[15] When pre-trained on large datasets such as JFT-300M, ViT matched or exceeded the accuracy of comparable CNNs on ImageNet while requiring substantially less training compute, demonstrating that the inductive biases of convolutions are not strictly necessary at scale.[15] Variants including DeiT, Swin Transformer, and BEiT have explored data-efficient training, hierarchical attention, and self-supervised pre-training.
For text classification, the dominant architecture evolved from bag-of-words logistic regression and SVMs in the 1990s and 2000s, to rnn and lstm models in the mid-2010s, to transformer encoders from 2018 onward. BERT, introduced by Jacob Devlin and collaborators in 2018, is a bidirectional transformer pre-trained with masked language modeling on the BooksCorpus and English Wikipedia and fine-tuned on labeled downstream tasks including sentiment classification, natural language inference, and question answering. BERT-Large pushed the GLUE benchmark to 80.5 and SQuAD v1.1 test F1 to 93.2 at release.[3] Variants such as RoBERTa, ALBERT, DeBERTa, and DistilBERT followed, refining training recipes, parameter sharing, attention patterns, and distillation. By 2026, encoder-only transformers remain the standard tool for high-accuracy text classification when labeled data is available.
Almost all probabilistic classifiers are trained by minimizing a loss function derived from the negative log-likelihood of the training labels under the model.
Auxiliary losses (label smoothing, supervised contrastive loss, knowledge-distillation losses) are often added to the primary cross-entropy term to improve generalization or calibration.
A single number rarely captures a classifier's quality, so practitioners report several complementary metrics. Every metric below ultimately derives from the confusion matrix, which tabulates the joint distribution of true and predicted labels.
For a binary problem with classes labeled positive (P) and negative (N), the confusion matrix has four cells: true positives (TP), false positives (FP), false negatives (FN), and true negatives (TN). From these:
| Metric | Definition | When useful |
|---|---|---|
| Accuracy | (TP + TN) / (TP + TN + FP + FN) | Roughly balanced classes, symmetric error costs |
| Precision | TP / (TP + FP) | False positives are expensive (e.g., automated charges) |
| Recall (sensitivity) | TP / (TP + FN) | False negatives are expensive (e.g., cancer screening) |
| Specificity | TN / (TN + FP) | Both classes matter, complements recall |
| F1 score | 2 (precision x recall) / (precision + recall) | Single number combining precision and recall |
| ROC AUC | Area under the ROC curve | Threshold-free ranking quality |
| PR AUC | Area under the precision-recall curve | Heavily imbalanced positives |
| Log loss | Negative average log-likelihood | Penalizes overconfident wrong predictions |
The receiver operating characteristic (ROC) curve plots the true positive rate against the false positive rate as the classification threshold sweeps from 0 to 1. The auc area under the roc curve (ROC AUC) summarizes this curve as a single scalar in [0, 1] and equals the probability that a randomly chosen positive example receives a higher score than a randomly chosen negative one. An AUC of 0.5 corresponds to random guessing; an AUC of 1.0 indicates perfect ranking.[1] On heavily imbalanced datasets, ROC AUC can be misleadingly optimistic, and the precision-recall curve (and its AUC) is preferred because it focuses on positive-class performance.
For multiclass problems, the confusion matrix is KxK. Per-class precision, recall, and F1 are typically aggregated via macro-averaging (unweighted mean across classes), micro-averaging (compute global TP/FP/FN first, then derive the metric), or weighted averaging (mean weighted by class support). Multilabel metrics include subset accuracy, Hamming loss, and macro-averaged AUC.[1]
Many real-world classification problems have a strong skew in label frequencies: fraudulent transactions may be 0.1 percent of all transactions, rare diseases may appear in fewer than 1 percent of patients, and clickthroughs are often a small fraction of impressions. Naively minimizing average loss on such data tends to produce models that predict the majority class for almost every input and still score high on accuracy.[17] See class-imbalanced dataset.
Standard mitigations operate at the data level, the algorithm level, or both.
class_weight parameter that reweights the loss so that minority examples contribute more per sample. This is equivalent to a cost matrix in cost-sensitive learning.Resampling should be applied only to the training data; applying it to the test or validation set inflates measured performance and produces an unreliable estimate of deployment behavior.[17]
A classifier is calibrated when its predicted probabilities match empirical frequencies: among all examples for which the model predicts probability 0.8, roughly 80 percent should belong to the predicted class. Calibration matters whenever downstream systems use the probabilities for thresholding, expected-utility decisions, or risk estimation. Logistic regression tends to be well calibrated out of the box; modern deep networks are usually overconfident, a phenomenon Chuan Guo and colleagues documented systematically in 2017 in "On Calibration of Modern Neural Networks".[4] They showed that depth, width, weight decay, and batch normalization all contribute to miscalibration, and that the problem is severe enough to be visible in standard reliability diagrams. See calibration.
Three calibration methods are standard:
By the mid-2020s, large pre-trained foundation models reshaped classification practice. Rather than collect labeled examples for every new task, practitioners increasingly rely on models pre-trained on web-scale data and prompt or condition them to classify in new label spaces with little or no task-specific training data.
CLIP (Contrastive Language-Image Pre-training), introduced by Alec Radford and colleagues at OpenAI in 2021, trains an image encoder and a text encoder jointly on 400 million image-caption pairs scraped from the web using a contrastive objective that aligns matched pairs and separates mismatched ones.[5] After pre-training, CLIP can classify an image into an arbitrary set of categories described in natural language by encoding each candidate caption ("a photo of a cat", "a photo of a dog", etc.) and assigning the image to the caption whose embedding has the highest cosine similarity with the image embedding. CLIP achieved zero-shot ImageNet accuracy comparable to a fully supervised ResNet-50, and its embedding space underpins many downstream open-vocabulary classification and retrieval systems.[5]
Beyond CLIP, several modern approaches use pre-trained representations for classification:
These methods do not replace classical classifiers wholesale. For tabular data, gradient boosting still typically outperforms deep approaches; for high-volume, high-stakes deployments with abundant labels, a fine-tuned task-specific model usually beats a zero-shot prompt. But the practical default for "I need a classifier and have very little labeled data" has shifted from logistic regression with hand-engineered features to a foundation-model embedding plus a small head.
Classification is used across nearly every industry that has structured outcomes:
| Domain | Task | Typical approach as of 2026 |
|---|---|---|
| Email and messaging | Spam, phishing, and abuse detection | Gradient boosting and fine-tuned BERT-family encoders |
| Finance | Credit scoring, fraud detection, transaction risk | XGBoost, lightgbm, logistic regression, deep tabular models |
| Healthcare | Disease screening, pathology slide triage, ECG arrhythmia detection | CNNs and vision transformers for imaging, gradient boosting for EHR features |
| Computer vision | Object recognition, object detection, scene classification | CNNs and ViT-family backbones, often pre-trained |
| Natural language processing | Sentiment analysis, topic classification, intent detection, content moderation | BERT-family fine-tuning, prompted LLMs for low-resource tasks |
| Cybersecurity | Malware family classification, intrusion detection | Random forests, gradient boosting, neural network classifiers |
| Industrial | Defect detection in manufacturing, sensor anomaly classification | CNNs for visual inspection, gradient boosting for sensor data |
| Marketing | Churn prediction, lead scoring, response models | Gradient boosting, logistic regression, deep tabular models |
| Autonomous systems | Traffic sign recognition, pedestrian classification | CNNs and ViTs as components of a larger perception stack |
In each case, the practical pipeline includes feature engineering (for tabular data) or pre-processing (for images and text), model selection and hyperparameter tuning, calibration, threshold selection, fairness and bias audits, and monitoring of distribution shift in production.