# Classification (machine learning)

> Source: https://aiwiki.ai/wiki/classification
> Updated: 2026-06-20
> Categories: Machine Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

Classification in machine learning is the [supervised learning](/wiki/supervised_learning) task of training an algorithm to assign discrete category labels to input data, in contrast to [regression](/wiki/regression), which predicts continuous numeric outputs.[^1] A classifier is fit on a labeled dataset of examples and is then expected to generalize to previously unseen inputs, producing either a single predicted class or a probability distribution over candidate classes.[^2] Classification underlies a large fraction of deployed machine learning systems, including spam filters, medical triage tools, credit scoring, fraud detection, image recognition, and natural language understanding.[^1] Algorithms used for classification range from classical statistical methods such as [logistic regression](/wiki/logistic_regression), [naive bayes](/wiki/naive_bayes), and [linear discriminant analysis](/wiki/linear_discriminant_analysis) to modern deep architectures such as convolutional networks, transformer encoders, and large multimodal foundation models that classify in a zero-shot setting from a text or image prompt.[^3][^4][^5]

Classification is among the oldest and most studied problems in the field. Ronald A. Fisher published a linear classification rule in 1936,[^7] and image classification drove the modern deep learning era: in 2012 the [alexnet](/wiki/alexnet) network cut the [imagenet](/wiki/imagenet) top-5 error to 15.3 percent versus 26.2 percent for the second-place entry, a margin that is widely credited with launching the deep learning boom.[^12][^19]

## What is classification in machine learning?

In a classification problem, a learner observes a training set of input-output pairs (x, y), where each x is a feature vector in some input space and each y is an element of a finite label set. The goal is to learn a function f that maps new inputs to labels with low expected error under the same distribution that generated the training data.[^1] When the algorithm outputs probabilities p(y | x), the predicted class is typically argmax p(y | x), but the probabilities themselves are useful for thresholding, ranking, and downstream decision making.[^6]

## How does classification differ from regression?

Classification and [regression](/wiki/regression) are the two core supervised learning tasks. The distinction is the type of the target variable: classification predicts a discrete label from a finite set (spam or not spam, one of 1,000 object classes), while regression predicts a continuous quantity (a price, a temperature, a probability). The two tasks share most of the same algorithms and training machinery, but differ in their output layers and loss functions: a classifier ends in a [softmax](/wiki/softmax) or [sigmoid function](/wiki/sigmoid_function) and is trained with a [cross-entropy](/wiki/cross-entropy) objective, whereas a regressor ends in a linear output and is trained with squared error or a similar metric.[^1] A probabilistic classifier can be viewed as a regression onto class probabilities, which is why [logistic regression](/wiki/logistic_regression), despite its name, is a classification method.

## What are the main types of classification?

Three structural variants are standard:

- **Binary classification** has exactly two mutually exclusive labels, often encoded as 0 and 1 or as "positive" and "negative". Examples include spam vs. non-spam, fraudulent vs. legitimate, and disease present vs. absent. Most theory and many metrics were first developed in this setting.[^1] See [binary classification](/wiki/binary_classification).
- **Multiclass classification** has K > 2 mutually exclusive labels, such as the ten digits in MNIST or the 1,000 object classes in [imagenet](/wiki/imagenet). Models either output K-way [softmax](/wiki/softmax) probabilities directly or reduce to binary problems via one-vs-rest or one-vs-one schemes.[^1] See [multi-class classification](/wiki/multi-class_classification).
- **Multilabel classification** allows each input to receive any subset of L candidate labels simultaneously. Common examples include tagging a news article with several topics or detecting multiple acoustic events in a recording. Multilabel problems are typically modeled as L independent [sigmoid function](/wiki/sigmoid_function) outputs and trained with per-label binary cross-entropy.[^1]

Related variants include hierarchical classification (labels form a tree or DAG), ordinal classification (labels have a natural order, such as star ratings), and open-set classification, where the model must also detect inputs that belong to none of the training classes.[^1]

## Classical classifiers

Classification has a long statistical pedigree predating the term "machine learning". Ronald A. Fisher introduced linear discriminant analysis in 1936 to separate iris species by petal and sepal measurements, producing one of the most cited classification datasets in history.[^7] Logistic regression, naive Bayes, decision trees, and support vector machines together remain the workhorse algorithms for tabular data in 2026, and all four ship as standard estimators in the widely used [scikit learn](/wiki/scikit_learn) library.

### Logistic regression

[Logistic regression](/wiki/logistic_regression) models the log-odds of class membership as a linear combination of features. For a binary problem it predicts p(y = 1 | x) = sigma(w x + b), where sigma is the logistic [sigmoid function](/wiki/sigmoid_function). The parameters are usually estimated by maximizing the conditional log-likelihood, which is equivalent to minimizing the [cross-entropy](/wiki/cross-entropy) loss. The multiclass extension uses a [softmax](/wiki/softmax) over K linear scores and is sometimes called multinomial logistic regression or softmax regression.[^1] Despite its simplicity, logistic regression remains a strong baseline because it produces naturally calibrated probabilities, scales to very large feature spaces with sparse solvers, and yields easily interpretable weight coefficients.[^6]

### Support vector machines

Corinna Cortes and Vladimir Vapnik formalized the [support vector machine svm](/wiki/support_vector_machine_svm) in 1995, framing classification as the search for the hyperplane that maximizes the margin between the two classes, with the closest training examples acting as "support vectors".[^8] The soft-margin SVM minimizes a sum of [hinge loss](/wiki/hinge_loss) and an L2 regularizer, and the kernel trick allows the same algorithm to learn nonlinear boundaries by implicitly mapping features into a high-dimensional reproducing kernel Hilbert space. Common kernels include radial basis function (RBF), polynomial, and sigmoid kernels.[^8] SVMs dominated the late 1990s and 2000s for tasks like text categorization and handwritten digit recognition before deep learning overtook them on raw perceptual inputs.

### K-nearest neighbors

The [k nearest neighbors](/wiki/k_nearest_neighbors) (k-NN) classifier stores the training set and labels each new point by majority vote among its k closest training examples under a chosen distance metric. k-NN has no training phase, can approximate any decision boundary in the infinite-sample limit, and is widely used as a sanity-check baseline. The main practical drawback is inference cost, which scales linearly in the size of the training set, mitigated in modern deployments by approximate nearest neighbor indexes.[^1]

### Naive Bayes

[Naive bayes](/wiki/naive_bayes) classifiers apply Bayes' rule under the assumption that features are conditionally independent given the class. The class-conditional likelihoods are typically Gaussian (for continuous features), multinomial (for word counts), or Bernoulli (for binary features). Despite the independence assumption rarely holding in practice, naive Bayes remains effective for text classification because the high dimensionality of word features tends to dampen the impact of correlated features, and the model trains in a single pass over the data.[^1]

### Decision trees and ensembles

A [decision tree](/wiki/decision_tree) recursively partitions the feature space along axis-aligned splits chosen to maximize information gain (or to minimize Gini impurity) on the training set. Trees handle mixed numeric and categorical features without scaling, produce human-readable rules, but are prone to [overfitting](/wiki/overfitting) when grown deep. Two ensemble families dominate modern tabular classification:

- **[Random forest](/wiki/random_forest)**, introduced by Leo Breiman in 2001, trains many decision trees on bootstrap resamples of the data, restricting each split to a random subset of features, and averages (or majority-votes) their predictions. The randomization decorrelates the trees and substantially reduces variance compared to a single deep tree.[^9]
- **[Gradient boosting](/wiki/gradient_boosting)** builds trees sequentially, where each tree is trained to predict the negative gradient of the loss for the current ensemble. Modern implementations including [xgboost](/wiki/xgboost) (Chen and Guestrin, 2016),[^10] [lightgbm](/wiki/lightgbm), and [catboost](/wiki/catboost) add second-order gradient information, histogram-based split finding, shrinkage, column subsampling, and efficient handling of categorical features. Boosted trees remain the standard winning approach on tabular Kaggle competitions and many industrial classification pipelines.[^10] In the XGBoost paper, Chen and Guestrin report that of 29 challenge-winning solutions published on Kaggle's blog during 2015, 17 used XGBoost, and that the system "scales beyond billions of examples using far fewer resources than existing systems".[^10]

### Linear discriminant analysis

[Linear discriminant analysis](/wiki/linear_discriminant_analysis) (LDA), in its modern statistical form, models each class as a multivariate Gaussian with a shared covariance matrix and uses Bayes' rule to derive a linear decision boundary. Quadratic discriminant analysis relaxes the shared-covariance assumption. LDA is closely related to logistic regression but is parameterized generatively rather than discriminatively, which gives it an edge when the Gaussian assumption holds and the training set is small.[^7]

## Generative vs. discriminative classifiers

A long-running distinction in classification contrasts [generative model](/wiki/generative_model)s, which model the joint distribution p(x, y) and derive p(y | x) by Bayes' rule, with [discriminative model](/wiki/discriminative_model)s, which directly model the conditional p(y | x).[^11] Naive Bayes and LDA are canonical generative classifiers; logistic regression, SVMs, conditional random fields, and most neural network classifiers are discriminative.

Andrew Ng and Michael Jordan analyzed the trade-off in a 2001 paper that compared the naive Bayes and logistic regression classifier pair (one generative, one discriminative, with the same parametric family for p(y | x)). They concluded that "while discriminative learning has lower asymptotic error, a generative classifier may also approach its (higher) asymptotic error much faster", possibly needing a number of training examples that grows only logarithmically, rather than linearly, in the number of parameters.[^11] In practice this means that on very small datasets, naive Bayes can outperform logistic regression even though logistic regression is the better choice when data is abundant.[^11] Generative models also offer additional capabilities, including imputation of missing features, generation of synthetic samples, and natural handling of unlabeled data through expectation-maximization.

## Deep learning classifiers

From the early 2010s onward, deep neural networks displaced hand-engineered features for classification on images, audio, video, and text. A neural network classifier ends in a softmax (for multiclass) or sigmoid (for binary or multilabel) output layer and is trained end-to-end by stochastic gradient descent on a cross-entropy objective. The breakthrough was less the loss function (already standard) than the combination of large labeled datasets, GPU compute, and architectural advances.

### Convolutional networks for image classification

Yann LeCun and collaborators applied [convolutional neural network](/wiki/convolutional_neural_network)s (CNNs) to handwritten digit recognition in the LeNet system at AT&T Bell Labs in the 1990s, but CNNs reached broad attention only after Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton's [alexnet](/wiki/alexnet), which won the [imagenet](/wiki/imagenet) Large Scale Visual Recognition Challenge in 2012 with a top-5 error of 15.3 percent, against 26.2 percent for the runner-up, a margin of more than ten percentage points.[^12][^19] AlexNet had 60 million parameters and 650,000 neurons in eight learned layers, used ReLU activations, dropout, and training on two GPUs, and is widely cited as the moment deep learning displaced hand-crafted feature pipelines in computer vision.[^12]

The [imagenet](/wiki/imagenet) database, organized by the WordNet noun hierarchy and first presented by Jia Deng, Fei-Fei Li, and colleagues in 2009, contains more than 14 million hand-annotated images across over 21,000 categories.[^20] The associated benchmark, formalized by Olga Russakovsky and colleagues, scored object-classification systems on a subset of roughly 1.2 million training images across 1,000 categories, with held-out validation and test sets.[^13] Annual improvements followed in rapid succession: VGG and GoogLeNet in 2014, [resnet](/wiki/resnet) in 2015, and a long tail of architectural variants. ResNet, introduced by Kaiming He and colleagues, added residual (skip) connections that let the network learn residual functions relative to the identity. The authors report evaluating "residual nets with a depth of up to 152 layers, 8x deeper than VGG nets but still having lower complexity", and that "an ensemble of these residual nets achieves 3.57% error on the ImageNet test set", which "won the 1st place on the ILSVRC 2015 classification task".[^14] ResNet's residual block has since been adopted by nearly every subsequent vision and language architecture.

### Vision transformers

[Vision transformer](/wiki/vision_transformer) (ViT), introduced by Alexey Dosovitskiy and collaborators at Google Research in 2020, applied the transformer encoder architecture directly to images by splitting each image into a sequence of fixed-size patches (commonly 16x16 pixels), embedding each patch linearly, and feeding the resulting tokens through standard transformer blocks with positional encoding.[^15] When pre-trained on large datasets such as JFT-300M, ViT matched or exceeded the accuracy of comparable CNNs on ImageNet while requiring substantially less training compute, demonstrating that the inductive biases of convolutions are not strictly necessary at scale.[^15] Variants including DeiT, Swin Transformer, and BEiT have explored data-efficient training, hierarchical attention, and self-supervised pre-training.

### Sequence models and transformers for text classification

For text classification, the dominant architecture evolved from bag-of-words logistic regression and SVMs in the 1990s and 2000s, to [rnn](/wiki/rnn) and [lstm](/wiki/lstm) models in the mid-2010s, to [transformer](/wiki/transformer) encoders from 2018 onward. [BERT](/wiki/bert), introduced by [Jacob Devlin](/wiki/jacob_devlin) and collaborators in 2018, is a bidirectional transformer pre-trained with masked language modeling on the BooksCorpus and English Wikipedia and fine-tuned on labeled downstream tasks including sentiment classification, natural language inference, and question answering. The paper reports that BERT "obtains new state-of-the-art results on eleven natural language processing tasks", pushing the GLUE score to 80.5 (a 7.7-point absolute improvement), MultiNLI accuracy to 86.7 percent, and SQuAD v1.1 test F1 to 93.2 at release.[^3] Variants such as RoBERTa, ALBERT, DeBERTa, and DistilBERT followed, refining training recipes, parameter sharing, attention patterns, and distillation. By 2026, encoder-only transformers remain the standard tool for high-accuracy text classification when labeled data is available.

## Loss functions

Almost all probabilistic classifiers are trained by minimizing a [loss function](/wiki/loss_function) derived from the negative log-likelihood of the training labels under the model.

- **[Cross-entropy](/wiki/cross-entropy) loss**, also called [log loss](/wiki/log_loss) in the binary case, is the dominant loss for both shallow and deep classifiers. For K-class problems it equals the negative log-probability the model assigns to the true class, summed (or averaged) over the dataset. Minimizing cross-entropy is equivalent to maximum likelihood estimation under the assumed softmax or sigmoid output distribution.[^1]
- **[Hinge loss](/wiki/hinge_loss)** is the canonical loss for support vector machines. For a binary problem with labels in {-1, +1} and margin score f(x), the hinge loss is max(0, 1 - y f(x)), which is zero when the example is correctly classified with margin at least one and grows linearly otherwise.[^8] Squared hinge and L1 variants exist.
- **Logistic loss** is the binary special case of cross-entropy and the loss minimized by logistic regression. The exponential loss minimized by AdaBoost belongs to the same broad family.[^1]
- **[Focal loss](/wiki/focal_loss)**, introduced by Tsung-Yi Lin and colleagues at Facebook AI Research in 2017, "down-weights the loss assigned to well-classified examples" via a (1 - p_t)^gamma factor so that training "focuses on a sparse set of hard examples".[^16] It was originally designed to address extreme foreground-background imbalance in dense object detection, where the RetinaNet detector it trained reached 39.1 AP on COCO, and has since been adopted in any classification setting with severe class imbalance.[^16][^21]

Auxiliary losses (label smoothing, supervised contrastive loss, knowledge-distillation losses) are often added to the primary cross-entropy term to improve generalization or calibration.

## How is classification accuracy measured?

A single number rarely captures a classifier's quality, so practitioners report several complementary metrics. Every metric below ultimately derives from the [confusion matrix](/wiki/confusion_matrix), which tabulates the joint distribution of true and predicted labels.

For a binary problem with classes labeled positive (P) and negative (N), the confusion matrix has four cells: true positives (TP), false positives (FP), false negatives (FN), and true negatives (TN). From these:

| Metric | Definition | When useful |
| --- | --- | --- |
| [Accuracy](/wiki/accuracy) | (TP + TN) / (TP + TN + FP + FN) | Roughly balanced classes, symmetric error costs |
| [Precision](/wiki/precision) | TP / (TP + FP) | False positives are expensive (e.g., automated charges) |
| [Recall](/wiki/recall) (sensitivity) | TP / (TP + FN) | False negatives are expensive (e.g., cancer screening) |
| Specificity | TN / (TN + FP) | Both classes matter, complements recall |
| [F1 score](/wiki/f1_score) | 2 (precision x recall) / (precision + recall) | Single number combining precision and recall |
| ROC AUC | Area under the ROC curve | Threshold-free ranking quality |
| PR AUC | Area under the precision-recall curve | Heavily imbalanced positives |
| Log loss | Negative average log-likelihood | Penalizes overconfident wrong predictions |

The receiver operating characteristic (ROC) curve plots the true positive rate against the false positive rate as the classification threshold sweeps from 0 to 1. The [auc area under the roc curve](/wiki/auc_area_under_the_roc_curve) (ROC AUC) summarizes this curve as a single scalar in [0, 1] and equals the probability that a randomly chosen positive example receives a higher score than a randomly chosen negative one. An AUC of 0.5 corresponds to random guessing; an AUC of 1.0 indicates perfect ranking.[^1] On heavily imbalanced datasets, ROC AUC can be misleadingly optimistic, and the precision-recall curve (and its AUC) is preferred because it focuses on positive-class performance.

For multiclass problems, the confusion matrix is KxK. Per-class precision, recall, and F1 are typically aggregated via macro-averaging (unweighted mean across classes), micro-averaging (compute global TP/FP/FN first, then derive the metric), or weighted averaging (mean weighted by class support). Multilabel metrics include subset accuracy, Hamming loss, and macro-averaged AUC.[^1]

## Class imbalance

Many real-world classification problems have a strong skew in label frequencies: fraudulent transactions may be 0.1 percent of all transactions, rare diseases may appear in fewer than 1 percent of patients, and clickthroughs are often a small fraction of impressions. Naively minimizing average loss on such data tends to produce models that predict the majority class for almost every input and still score high on accuracy.[^17] See [class-imbalanced dataset](/wiki/class-imbalanced_dataset).

Standard mitigations operate at the data level, the algorithm level, or both.

- **Resampling**. Random oversampling duplicates minority examples; random undersampling discards majority examples. Both are simple but have drawbacks (overfitting and information loss respectively). Synthetic Minority Over-sampling Technique ([smote](/wiki/smote)), introduced by Nitesh Chawla and colleagues in 2002, generates synthetic minority examples by interpolating between a minority point and one of its k nearest minority neighbors, increasing diversity compared to plain duplication.[^17] Variants (Borderline-SMOTE, ADASYN) target the decision boundary.
- **Class weights and cost-sensitive learning**. Many libraries support a `class_weight` parameter that reweights the loss so that minority examples contribute more per sample. This is equivalent to a cost matrix in cost-sensitive learning.
- **[Focal loss](/wiki/focal_loss)** down-weights well-classified examples and was originally designed for the extreme imbalance of dense object detection.[^16]
- **Threshold tuning**. The default 0.5 decision threshold is rarely optimal under imbalance; instead, the threshold is selected to maximize F1, expected utility, or another metric on a validation set.
- **Balanced ensembles**. Methods such as BalancedRandomForestClassifier and EasyEnsemble combine undersampling with bagging or boosting.

Resampling should be applied only to the training data; applying it to the test or validation set inflates measured performance and produces an unreliable estimate of deployment behavior.[^17]

## Calibration

A classifier is **calibrated** when its predicted probabilities match empirical frequencies: among all examples for which the model predicts probability 0.8, roughly 80 percent should belong to the predicted class. Calibration matters whenever downstream systems use the probabilities for thresholding, expected-utility decisions, or risk estimation. Logistic regression tends to be well calibrated out of the box; modern deep networks are usually overconfident, a phenomenon Chuan Guo and colleagues documented systematically in 2017 in "On Calibration of Modern Neural Networks". They observe that "modern neural networks, unlike those from a decade ago, are poorly calibrated", and that depth, width, weight decay, and batch normalization all contribute to the problem.[^4] See [calibration](/wiki/calibration).

Three calibration methods are standard:

- **Platt scaling**, introduced by John Platt in 1999 for support vector machines, fits a logistic regression (a sigmoid with two parameters) to the classifier's raw scores on a held-out set, mapping scores into calibrated probabilities.[^18]
- **Isotonic regression** fits a piecewise-constant, monotonically non-decreasing function from scores to probabilities. It is more flexible than Platt scaling but requires more data to avoid overfitting.[^6]
- **Temperature scaling**, formalized by Guo and colleagues, is a single-parameter variant of Platt scaling for multiclass softmax outputs: it divides the logits by a learned scalar T before applying softmax, choosing T to minimize negative log-likelihood on a validation set. It does not change the predicted class but dramatically improves calibration on deep networks while preserving accuracy.[^4]

## What is zero-shot classification?

By the mid-2020s, large pre-trained [foundation model](/wiki/foundation_model)s reshaped classification practice. Rather than collect labeled examples for every new task, practitioners increasingly rely on models pre-trained on web-scale data and prompt or condition them to classify in new label spaces with little or no task-specific training data. Zero-shot classification refers to assigning inputs to categories the model was never explicitly trained to predict, by describing those categories in natural language at inference time.

[CLIP](/wiki/clip) (Contrastive Language-Image Pre-training), introduced by Alec Radford and colleagues at OpenAI in 2021, trains an image encoder and a text encoder jointly on 400 million image-caption pairs scraped from the web using a contrastive objective that aligns matched pairs and separates mismatched ones.[^5] The authors describe "predicting which caption goes with which image" as "an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet".[^5] After pre-training, CLIP can classify an image into an arbitrary set of categories described in natural language by encoding each candidate caption ("a photo of a cat", "a photo of a dog", etc.) and assigning the image to the caption whose embedding has the highest cosine similarity with the image embedding. CLIP reached 76.2 percent zero-shot top-1 accuracy on ImageNet, matching "the accuracy of the original ResNet-50 on ImageNet zero-shot without the need for any dataset specific training", and its embedding space underpins many downstream open-vocabulary classification and retrieval systems.[^5][^22]

Beyond CLIP, several modern approaches use pre-trained representations for classification:

- **Embedding-based classification** uses [embeddings](/wiki/embeddings) from a pre-trained vision or language model as fixed features, training only a small linear or k-NN classifier on top. This dramatically reduces the labeled data requirements compared to training a model from scratch.
- **Encoder fine-tuning** continues to be the highest-accuracy approach when sufficient labels are available, particularly with [bert](/wiki/bert)-family encoders for text or ViT-family backbones for images.
- **Prompted LLM classification** treats a generative large language model as a classifier by providing the candidate labels in the prompt and asking the model to output the chosen label. Performance depends heavily on prompt design, label naming, and ordering. [In context learning](/wiki/in_context_learning) generalizes this further by including a handful of labeled examples in the prompt itself, allowing the model to adapt to a new task without parameter updates.
- **[Few-shot learning](/wiki/few-shot_learning)** methods, including matching networks and prototypical networks, learn classifiers explicitly for the regime of one to a few labeled examples per class.

These methods do not replace classical classifiers wholesale. For tabular data, [gradient boosting](/wiki/gradient_boosting) still typically outperforms deep approaches; for high-volume, high-stakes deployments with abundant labels, a fine-tuned task-specific model usually beats a zero-shot prompt. But the practical default for "I need a classifier and have very little labeled data" has shifted from logistic regression with hand-engineered features to a foundation-model embedding plus a small head.

## Which classification algorithm should I use?

There is no single best classifier; the appropriate choice depends on the data type, the number of labeled examples, and the deployment constraints. As of 2026, common practitioner defaults are:

| Situation | Typical first choice |
| --- | --- |
| Tabular data with hundreds to millions of rows | [Gradient boosting](/wiki/gradient_boosting) ([xgboost](/wiki/xgboost), [lightgbm](/wiki/lightgbm), [catboost](/wiki/catboost)) |
| Small tabular dataset, need interpretability | [Logistic regression](/wiki/logistic_regression) or a single [decision tree](/wiki/decision_tree) |
| High-dimensional sparse text, classic pipeline | [Naive bayes](/wiki/naive_bayes) or linear [support vector machine svm](/wiki/support_vector_machine_svm) |
| Images with abundant labels | Fine-tuned CNN ([resnet](/wiki/resnet)) or [vision transformer](/wiki/vision_transformer) |
| Text with abundant labels | Fine-tuned [bert](/wiki/bert)-family encoder |
| Very few labeled examples | Foundation-model [embeddings](/wiki/embeddings) plus a small head, or zero-shot [clip](/wiki/clip) / prompted LLM |

The "no free lunch" theorem implies no algorithm dominates across all problems, so empirical comparison on a held-out set remains the reliable selection method.[^1]

## Applications

Classification is used across nearly every industry that has structured outcomes:

| Domain | Task | Typical approach as of 2026 |
| --- | --- | --- |
| Email and messaging | Spam, phishing, and abuse detection | Gradient boosting and fine-tuned BERT-family encoders |
| Finance | Credit scoring, fraud detection, transaction risk | [XGBoost](/wiki/xgboost), [lightgbm](/wiki/lightgbm), logistic regression, deep tabular models |
| Healthcare | Disease screening, pathology slide triage, ECG arrhythmia detection | CNNs and vision transformers for imaging, gradient boosting for EHR features |
| [Computer vision](/wiki/computer_vision) | Object recognition, [object detection](/wiki/object_detection), scene classification | CNNs and ViT-family backbones, often pre-trained |
| [Natural language processing](/wiki/natural_language_processing) | [Sentiment analysis](/wiki/sentiment_analysis), topic classification, intent detection, content moderation | BERT-family fine-tuning, prompted LLMs for low-resource tasks |
| Cybersecurity | Malware family classification, intrusion detection | Random forests, gradient boosting, neural network classifiers |
| Industrial | Defect detection in manufacturing, sensor anomaly classification | CNNs for visual inspection, gradient boosting for sensor data |
| Marketing | Churn prediction, lead scoring, response models | Gradient boosting, logistic regression, deep tabular models |
| Autonomous systems | Traffic sign recognition, pedestrian classification | CNNs and ViTs as components of a larger perception stack |

In each case, the practical pipeline includes [feature engineering](/wiki/feature_engineering) (for tabular data) or pre-processing (for images and text), model selection and hyperparameter tuning, calibration, threshold selection, fairness and bias audits, and monitoring of distribution shift in production.

## See also

- [supervised learning](/wiki/supervised_learning)
- [regression](/wiki/regression)
- [binary classification](/wiki/binary_classification)
- [multi-class classification](/wiki/multi-class_classification)
- [confusion matrix](/wiki/confusion_matrix)
- [accuracy](/wiki/accuracy)
- [precision](/wiki/precision)
- [recall](/wiki/recall)
- [f1 score](/wiki/f1_score)
- [auc area under the roc curve](/wiki/auc_area_under_the_roc_curve)
- [loss function](/wiki/loss_function)
- [cross-entropy](/wiki/cross-entropy)
- [hinge loss](/wiki/hinge_loss)
- [focal loss](/wiki/focal_loss)
- [calibration](/wiki/calibration)
- [class-imbalanced dataset](/wiki/class-imbalanced_dataset)
- [smote](/wiki/smote)
- [logistic regression](/wiki/logistic_regression)
- [support vector machine svm](/wiki/support_vector_machine_svm)
- [k nearest neighbors](/wiki/k_nearest_neighbors)
- [naive bayes](/wiki/naive_bayes)
- [decision tree](/wiki/decision_tree)
- [random forest](/wiki/random_forest)
- [gradient boosting](/wiki/gradient_boosting)
- [xgboost](/wiki/xgboost)
- [lightgbm](/wiki/lightgbm)
- [catboost](/wiki/catboost)
- [linear discriminant analysis](/wiki/linear_discriminant_analysis)
- [neural network](/wiki/neural_network)
- [convolutional neural network](/wiki/convolutional_neural_network)
- [alexnet](/wiki/alexnet)
- [resnet](/wiki/resnet)
- [vision transformer](/wiki/vision_transformer)
- [bert](/wiki/bert)
- [transformer](/wiki/transformer)
- [clip](/wiki/clip)
- [foundation model](/wiki/foundation_model)
- [in context learning](/wiki/in_context_learning)
- [few-shot learning](/wiki/few-shot_learning)
- [discriminative model](/wiki/discriminative_model)
- [generative model](/wiki/generative_model)
- [regression model](/wiki/regression_model)

## References

[^1]: Hastie, T., Tibshirani, R., and Friedman, J. "The Elements of Statistical Learning: Data Mining, Inference, and Prediction (second edition)", Springer, 2009. https://hastie.su.domains/ElemStatLearn/. Accessed 2026-05-26.

[^2]: Bishop, C. M. "Pattern Recognition and Machine Learning", Springer, 2006. https://www.microsoft.com/en-us/research/people/cmbishop/prml-book/. Accessed 2026-05-26.

[^3]: Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", arXiv, 2018-10-11. https://arxiv.org/abs/1810.04805. Accessed 2026-05-26.

[^4]: Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. "On Calibration of Modern Neural Networks", arXiv (ICML 2017), 2017-06-14. https://arxiv.org/abs/1706.04599. Accessed 2026-05-26.

[^5]: Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. "Learning Transferable Visual Models From Natural Language Supervision", arXiv, 2021-02-26. https://arxiv.org/abs/2103.00020. Accessed 2026-05-26.

[^6]: Niculescu-Mizil, A., and Caruana, R. "Predicting Good Probabilities With Supervised Learning", Proceedings of the 22nd International Conference on Machine Learning (ICML), 2005. https://www.cs.cornell.edu/~alexn/papers/calibration.icml05.crc.rev3.pdf. Accessed 2026-05-26.

[^7]: Fisher, R. A. "The Use of Multiple Measurements in Taxonomic Problems", Annals of Eugenics, vol. 7, pp. 179-188, 1936. https://onlinelibrary.wiley.com/doi/10.1111/j.1469-1809.1936.tb02137.x. Accessed 2026-05-26.

[^8]: Cortes, C., and Vapnik, V. "Support-Vector Networks", Machine Learning, vol. 20, pp. 273-297, 1995. https://link.springer.com/article/10.1007/BF00994018. Accessed 2026-05-26.

[^9]: Breiman, L. "Random Forests", Machine Learning, vol. 45, no. 1, pp. 5-32, 2001. https://link.springer.com/article/10.1023/A:1010933404324. Accessed 2026-05-26.

[^10]: Chen, T., and Guestrin, C. "XGBoost: A Scalable Tree Boosting System", arXiv (KDD 2016), 2016-03-09. https://arxiv.org/abs/1603.02754. Accessed 2026-06-20.

[^11]: Ng, A. Y., and Jordan, M. I. "On Discriminative vs. Generative Classifiers: A Comparison of Logistic Regression and Naive Bayes", Advances in Neural Information Processing Systems 14 (NIPS 2001), 2001. https://proceedings.neurips.cc/paper/2001/hash/7b7a53e239400a13bd6be6c91c4f6c4e-Abstract.html. Accessed 2026-06-20.

[^12]: Krizhevsky, A., Sutskever, I., and Hinton, G. E. "ImageNet Classification with Deep Convolutional Neural Networks", Advances in Neural Information Processing Systems 25 (NIPS 2012), 2012. https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks. Accessed 2026-05-26.

[^13]: Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. "ImageNet Large Scale Visual Recognition Challenge", International Journal of Computer Vision, vol. 115, pp. 211-252, 2015. https://arxiv.org/abs/1409.0575. Accessed 2026-05-26.

[^14]: He, K., Zhang, X., Ren, S., and Sun, J. "Deep Residual Learning for Image Recognition", arXiv, 2015-12-10. https://arxiv.org/abs/1512.03385. Accessed 2026-06-20.

[^15]: Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale", arXiv (ICLR 2021), 2020-10-22. https://arxiv.org/abs/2010.11929. Accessed 2026-05-26.

[^16]: Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollar, P. "Focal Loss for Dense Object Detection", arXiv (ICCV 2017), 2017-08-07. https://arxiv.org/abs/1708.02002. Accessed 2026-06-20.

[^17]: Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. "SMOTE: Synthetic Minority Over-sampling Technique", Journal of Artificial Intelligence Research, vol. 16, pp. 321-357, 2002. https://www.jair.org/index.php/jair/article/view/10302. Accessed 2026-05-26.

[^18]: Platt, J. "Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods", in Advances in Large Margin Classifiers, MIT Press, 1999. https://www.cs.cornell.edu/courses/cs678/2007sp/platt.pdf. Accessed 2026-05-26.

[^19]: Stanford Vision Lab. "ImageNet Large Scale Visual Recognition Challenge 2012 (ILSVRC2012) Results", 2012. https://www.image-net.org/challenges/LSVRC/2012/results.html. Accessed 2026-06-20.

[^20]: Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. "ImageNet: A Large-Scale Hierarchical Image Database", IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009. https://www.image-net.org/static_files/papers/imagenet_cvpr09.pdf. Accessed 2026-06-20.

[^21]: Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollar, P. "Focal Loss for Dense Object Detection", Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2980-2988, 2017. https://openaccess.thecvf.com/content_ICCV_2017/papers/Lin_Focal_Loss_for_ICCV_2017_paper.pdf. Accessed 2026-06-20.

[^22]: Radford, A., et al. "Learning Transferable Visual Models From Natural Language Supervision", Proceedings of the 38th International Conference on Machine Learning (ICML), pp. 8748-8763, 2021. https://proceedings.mlr.press/v139/radford21a.html. Accessed 2026-06-20.

