# Binary Classification

> Source: https://aiwiki.ai/wiki/binary_classification
> Updated: 2026-07-13
> Categories: Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Classification model](/wiki/classification_model), [Multi-class classification](/wiki/multi-class_classification)*

## Introduction

Binary classification is a [supervised learning](/wiki/supervised_learning) task in which a model assigns each input to exactly one of two mutually exclusive classes, conventionally labeled the [positive class](/wiki/positive_class) (1) and the [negative class](/wiki/negative_class) (0). It is one of the most fundamental problems in [machine learning](/wiki/machine_learning) and underpins applications such as spam filtering, fraud detection, and medical diagnosis.[1][2] In formal terms, a binary classifier is a function that maps an input feature vector $$\mathbf{x}$$ to a single binary output $$y$$, where $$y$$ belongs to $$\{0, 1\}$$.

The statistical foundation for the most widely used binary classifier, [logistic regression](/wiki/logistic_regression), was laid by British statistician David Cox in his 1958 paper "The Regression Analysis of Binary Sequences," published in the Journal of the Royal Statistical Society.[11] The margin-maximizing [support vector machine](/wiki/support_vector_machine_svm) followed in 1995, introduced by Corinna Cortes and Vladimir Vapnik at AT&T Bell Labs.[12]

Unlike [multi-class classification](/wiki/multi-class_classification), which involves three or more possible output categories, binary classification restricts the prediction space to exactly two outcomes. This constraint makes the problem mathematically simpler in some respects, but the practical challenges of building accurate binary classifiers remain significant, particularly when dealing with noisy data, overlapping class distributions, or severe class imbalance.[7]

## What is Binary Classification?

In binary classification, a [machine learning](/wiki/machine_learning) algorithm learns to classify input data into one of two classes based on [labeled](/wiki/label) [training data](/wiki/training_data). When given input data, this algorithm makes a prediction as to which class the input belongs in.

Binary classification involves classifying input data into two classes based on learned patterns from training data, such as spam or not spam, fraud or not fraud, and disease or not disease. The goal is to accurately classify new input data into the appropriate class based on these learned patterns.

Binary classification is a [supervised learning](/wiki/supervised_learning) task, meaning the algorithm is trained using labeled data where each data point has been assigned a label indicating its class membership.[2] During the training phase, the model learns a decision boundary that separates the two classes in the [feature space](/wiki/feature_space). Once trained, the model generalizes this boundary to unseen data during inference.[1]

### Positive and Negative Class Conventions

In binary classification, the two classes are referred to as the [positive class](/wiki/positive_class) and the [negative class](/wiki/negative_class). By convention, the positive class (labeled 1) typically represents the outcome of primary interest, while the negative class (labeled 0) represents the default or baseline outcome. For example, in fraud detection the positive class is "fraudulent" and the negative class is "legitimate." In medical diagnosis, the positive class is "disease present" and the negative class is "disease absent."

The choice of which class to designate as positive affects the interpretation of evaluation metrics. [Precision](/wiki/precision), [recall](/wiki/recall), and the [confusion matrix](/wiki/confusion_matrix) are all defined relative to the positive class.[9] Swapping the positive and negative labels does not change the model's underlying behavior, but it changes the numeric values of these metrics. For this reason, practitioners should clearly define the positive class before evaluating results, particularly in domains where the minority class carries higher importance (such as detecting rare diseases or fraudulent transactions).

### Example Use Cases

| Application | Input Features | Positive Class | Negative Class |
|---|---|---|---|
| Email spam filtering | Email content, headers, sender | Spam | Not spam |
| Credit risk assessment | Income, credit score, employment history | High-risk | Low-risk |
| Medical diagnosis | Age, blood pressure, lab results | Disease present | Disease absent |
| Fraud detection | Transaction amount, location, type | Fraudulent | Non-fraudulent |
| Sentiment analysis | Text of a review or post | Positive sentiment | Negative sentiment |
| Churn prediction | Usage patterns, account tenure | Will churn | Will stay |

## Algorithms for Binary Classification

Binary classification can be performed using many different [machine learning](/wiki/machine_learning) algorithms, including [logistic regression](/wiki/logistic_regression), [decision trees](/wiki/decision_tree), [random forests](/wiki/random_forest), [support vector machines](/wiki/support_vector_machine_svm), [gradient boosting](/wiki/gradient_boosting), and [neural networks](/wiki/neural_network).[2] The specific choice depends on the problem being solved and the characteristics of the data.

### Logistic Regression

[Logistic regression](/wiki/logistic_regression) is one of the most widely used algorithms for binary classification. Despite its name, it is a classification method, not a regression method. Logistic regression models the probability that a given input belongs to the [positive class](/wiki/positive_class) by applying the [sigmoid function](/wiki/sigmoid_function) to a linear combination of input features.[2] The model learns a weight vector $$\mathbf{w}$$ and a bias term $$\mathbf{b}$$ such that the predicted probability is:

$$
P(y = 1 \mid x) = \sigma(w^\top x + b)
$$

The method traces to David Cox's 1958 treatment of binary regression, which studied situations where "a sequence of 0's and 1's is observed and the chance that a particular trial is a 1 depends on the value of one or more independent variables."[11] Logistic regression is valued for its simplicity, interpretability, and computational efficiency. The learned coefficients indicate the direction and magnitude of each feature's influence on the prediction. It works best when the relationship between the features and the log-odds of the outcome is approximately linear.

### Support Vector Machines

[Support vector machines](/wiki/support_vector_machine_svm) (SVMs) approach binary classification by finding the optimal [hyperplane](/wiki/hyperplane) that separates the two classes with the maximum margin. The margin is the distance between the hyperplane and the nearest data points from each class, known as support vectors. By maximizing this margin, SVMs produce classifiers that tend to generalize well to unseen data.[2] The support-vector network was introduced by Cortes and Vapnik in 1995 in the journal Machine Learning, where they described it as a learning machine "for two-group classification problems."[12]

For data that is not linearly separable, SVMs use the kernel trick to project the data into a higher-dimensional space where a linear separator can be found. Common kernel functions include the radial basis function (RBF), polynomial, and sigmoid kernels. SVMs are effective in high-dimensional spaces and perform well even when the number of features exceeds the number of training samples.

### Decision Trees

[Decision trees](/wiki/decision_tree) classify data by recursively partitioning the feature space based on feature values. At each internal node, the algorithm selects the feature and threshold that best separates the classes according to a splitting criterion such as Gini impurity or information gain.[2] The process continues until a stopping condition is met, such as reaching a maximum depth or a minimum number of samples per leaf.

Decision trees handle both numerical and categorical features naturally and produce models that are easy to visualize and interpret. However, individual decision trees are prone to [overfitting](/wiki/overfitting), especially when grown deep without pruning.

### Random Forest

[Random forest](/wiki/random_forest) is an [ensemble learning](/wiki/ensemble_learning) method that trains multiple decision trees on random subsets of the training data and features, then aggregates their predictions through majority voting.[2] This approach reduces the variance associated with individual trees and typically produces more robust classifiers.

Each tree in the forest is trained on a bootstrap sample (a random sample drawn with replacement) of the original data, and at each split, only a random subset of features is considered. The combination of bagging and feature randomization makes random forests resistant to overfitting and effective across many types of binary classification problems.

### Gradient Boosting

[Gradient boosting](/wiki/gradient_boosting) builds an ensemble of weak learners (usually shallow decision trees) sequentially, where each new tree corrects the errors of the previous ensemble. The method optimizes a [loss function](/wiki/loss_function) by adding trees that follow the negative gradient of the loss.[2] Popular implementations include [XGBoost](/wiki/xgboost), [LightGBM](/wiki/lightgbm), and [CatBoost](/wiki/catboost).

Gradient boosting methods frequently achieve state-of-the-art results on tabular data and are widely used in competitions and production systems. Key hyperparameters include the learning rate (which controls the contribution of each tree), the number of trees, and the maximum tree depth. [Regularization](/wiki/regularization) techniques such as L1/L2 penalties and subsampling help prevent overfitting.

### Neural Networks

[Neural networks](/wiki/neural_network) can be applied to binary classification by using a single output neuron with a [sigmoid activation function](/wiki/sigmoid_function). The network learns complex, nonlinear relationships between input features and the target class through multiple layers of interconnected neurons. [Deep learning](/wiki/deep_learning) models, such as [convolutional neural networks](/wiki/convolutional_neural_network) for image data and [recurrent neural networks](/wiki/recurrent_neural_network) for sequential data, have achieved strong performance on many binary classification benchmarks.[6]

For binary classification, the output layer consists of one neuron with sigmoid activation, producing a value between 0 and 1 that represents the predicted probability of the [positive class](/wiki/positive_class). The network is trained using [binary cross-entropy](/wiki/cross-entropy) as the loss function and optimized with algorithms such as [stochastic gradient descent](/wiki/stochastic_gradient_descent_sgd) or [Adam](/wiki/adam_optimizer).[6]

### Naive Bayes

[Naive Bayes](/wiki/naive_bayes) classifiers apply Bayes' theorem with a strong independence assumption between features.[1] Despite this simplifying assumption, Naive Bayes performs surprisingly well on many binary classification tasks, particularly text classification problems like [spam detection](/wiki/spam_detection) and [sentiment analysis](/wiki/sentiment_analysis). The algorithm is computationally efficient, requires minimal training data, and scales well to high-dimensional feature spaces. Variants include Gaussian Naive Bayes (for continuous features), Multinomial Naive Bayes (for count data), and Bernoulli Naive Bayes (for binary features).

### Algorithm Comparison

| Algorithm | Interpretability | Handles Non-linearity | Training Speed | Prone to Overfitting | Best For |
|---|---|---|---|---|---|
| [Logistic Regression](/wiki/logistic_regression) | High | No (linear) | Fast | Low | Linearly separable data, baseline models |
| [SVM](/wiki/support_vector_machine_svm) | Low (with kernels) | Yes (with kernels) | Moderate | Moderate | High-dimensional data, small to medium datasets |
| [Decision Tree](/wiki/decision_tree) | High | Yes | Fast | High | Exploratory analysis, interpretable models |
| [Random Forest](/wiki/random_forest) | Moderate | Yes | Moderate | Low | General-purpose, tabular data |
| [Gradient Boosting](/wiki/gradient_boosting) | Low | Yes | Slow | Moderate (with tuning) | Tabular data, competitions, high accuracy needs |
| [Neural Network](/wiki/neural_network) | Low | Yes | Slow | High (without regularization) | Large datasets, unstructured data (images, text) |
| [Naive Bayes](/wiki/naive_bayes) | High | No (linear boundaries) | Very fast | Low | Text classification, small datasets, baselines |

## Loss Functions for Binary Classification

A [loss function](/wiki/loss_function) quantifies the difference between a model's predictions and the true labels. For binary classification, the two most commonly used loss functions are binary [cross-entropy](/wiki/cross-entropy) (log loss) and hinge loss.

### Binary Cross-Entropy (Log Loss)

The standard loss function for training binary classifiers is binary cross-entropy, also called log loss.[6] For a single training example with true label $$y$$ (0 or 1) and predicted probability $$p$$, the binary cross-entropy loss is defined as:

$$
L(y, p) = -\left[y \log(p) + (1 - y) \log(1 - p)\right]
$$

When $$y = 1$$ ([positive class](/wiki/positive_class)), the loss simplifies to $$-\log(p)$$, which penalizes the model heavily when it assigns a low probability to a true positive. When $$y = 0$$ ([negative class](/wiki/negative_class)), the loss becomes $$-\log(1 - p)$$, penalizing high predicted probabilities for actual negatives. The logarithmic nature of this function creates a large gradient for confident but incorrect predictions, which accelerates learning during [backpropagation](/wiki/backpropagation).

For a dataset of $$N$$ samples, the overall loss is the average across all samples:

$$
L = -\frac{1}{N} \sum_{i=1}^{N} \left[y_i \log(p_i) + (1 - y_i) \log(1 - p_i)\right]
$$

Binary cross-entropy is a convex function for [logistic regression](/wiki/logistic_regression), guaranteeing a single global minimum.[2] For neural networks, the optimization landscape is non-convex, but binary cross-entropy still provides well-behaved gradients that facilitate training.[6]

### Hinge Loss

Hinge loss is the loss function used by [support vector machines](/wiki/support_vector_machine_svm). Unlike binary cross-entropy, hinge loss operates on raw scores (logits) rather than probabilities and does not produce probabilistic output. For a single training example with true label $$y$$ (encoded as $$-1$$ or $$+1$$) and predicted score $$f(x)$$, hinge loss is:

$$
L(y, f(x)) = \max(0, 1 - y \cdot f(x))
$$

Hinge loss equals zero when the prediction is correct and the margin (the product $$y \cdot f(x)$$) exceeds 1. When the margin is less than 1, the loss increases linearly. This means hinge loss penalizes not only incorrect predictions but also correct predictions that fall too close to the decision boundary. The focus on margin maximization is what gives SVMs their strong generalization properties.[2]

Because hinge loss is not differentiable at the point where $$y \cdot f(x) = 1$$, subgradient methods are used during optimization.

### Comparison of Loss Functions

| Property | Binary Cross-Entropy | Hinge Loss |
|---|---|---|
| Output type | Probability (0 to 1) | Raw score (logit) |
| Probabilistic interpretation | Yes | No |
| Primary algorithm | [Logistic regression](/wiki/logistic_regression), [neural networks](/wiki/neural_network) | [SVM](/wiki/support_vector_machine_svm) |
| Penalizes confident mistakes | Heavily (logarithmic) | Linearly |
| Margin-based | No | Yes |
| Smoothness | Smooth and differentiable | Not differentiable at hinge point |
| Best when | Calibrated probabilities are needed | Maximum-margin separation is desired |

## Output: Probability via Sigmoid

Most binary classifiers produce a raw score (logit) that must be transformed into a probability. The [sigmoid function](/wiki/sigmoid_function) (also called the logistic function) performs this transformation:

$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$

The sigmoid function maps any real-valued number $$z$$ to the range $$(0, 1)$$. When $$z$$ is a large positive number, $$\sigma(z)$$ approaches 1; when $$z$$ is a large negative number, $$\sigma(z)$$ approaches 0; and when $$z = 0$$, $$\sigma(z) = 0.5$$. This S-shaped curve provides a smooth, differentiable mapping from logits to probabilities.

In [logistic regression](/wiki/logistic_regression), the logit $$z$$ is the linear combination of input features: $$z = w^\top x + b$$. In [neural networks](/wiki/neural_network), $$z$$ is the output of the final layer before the activation function. The resulting probability $$p = \sigma(z)$$ represents the model's confidence that the input belongs to the [positive class](/wiki/positive_class).[1]

The sigmoid function has a useful derivative property: $$\sigma'(z) = \sigma(z)(1 - \sigma(z))$$, which simplifies gradient computation during training.

## Probability Calibration

The raw probability output of a binary classifier does not always accurately reflect the true likelihood of class membership. For example, a model might predict a probability of 0.8 for a set of instances, but only 60% of those instances actually belong to the positive class. When predicted probabilities match observed frequencies, the model is said to be well-calibrated.[10] Calibration is important in applications where the predicted probability itself drives decisions, such as medical risk scoring or insurance pricing.

Some algorithms, like [logistic regression](/wiki/logistic_regression), tend to produce well-calibrated probabilities by default. Others, like [SVMs](/wiki/support_vector_machine_svm), [random forests](/wiki/random_forest), and [gradient boosting](/wiki/gradient_boosting) models, often require post-hoc calibration.[10]

### Platt Scaling

Platt scaling fits a logistic regression model to the classifier's output scores, learning parameters A and B such that the calibrated probability is $$p = \frac{1}{1 + \exp(A \cdot f(x) + B)}$$.[8] This method works well when the distortion in predicted probabilities follows a sigmoid shape, which is commonly the case for SVMs and neural networks.[8] Platt scaling is simple to implement and works well even with small calibration datasets.

### Isotonic Regression

Isotonic regression is a non-parametric calibration method that learns a piecewise constant, monotonically increasing function mapping classifier scores to calibrated probabilities. It is more flexible than Platt scaling and can correct any monotonic distortion. However, isotonic regression requires more calibration data (typically 1,000+ samples) to avoid overfitting. When sufficient data is available, it generally outperforms Platt scaling.[10]

Calibration should always be performed on a held-out calibration set (separate from both the training and test sets) to avoid biasing the calibrated probabilities.

## Threshold Selection

After a binary classifier produces a probability score, a decision threshold must be applied to convert this probability into a class label. The default threshold is 0.5: if the predicted probability exceeds 0.5, the input is classified as positive; otherwise, it is classified as negative. However, the optimal threshold is often not 0.5, especially when dealing with imbalanced classes or when the costs of different types of errors are unequal.

### Methods for Selecting the Optimal Threshold

Several approaches exist for determining the best threshold:

| Method | Strategy | Best When |
|---|---|---|
| ROC curve analysis | Maximize Youden index (sensitivity + specificity - 1) | Overall discrimination matters |
| Precision-recall tradeoff | Tune threshold to favor [precision](/wiki/precision) or [recall](/wiki/recall) | One type of error is more costly |
| Cost-sensitive optimization | Minimize total expected cost given asymmetric error costs | FP and FN have different financial costs |
| F1 score maximization | Choose threshold producing highest F1 on validation set | Balanced tradeoff between precision and recall |
| Business requirements | Domain experts set threshold based on operational constraints | Regulatory or policy constraints exist |

For example, in medical screening, a lower threshold increases recall (catching more true positives) at the cost of reduced precision (more false positives).[9] In fraud detection, missing a fraudulent transaction (false negative) may be far more costly than flagging a legitimate one (false positive), so the threshold is lowered accordingly.[7]

Threshold selection should always be performed on a validation set, not the training set, to avoid overfitting the threshold to training data.

## Confusion Matrix

The [confusion matrix](/wiki/confusion_matrix) is a fundamental tool for evaluating binary classifiers. It organizes predictions into a 2x2 table based on the actual and predicted class labels:

| | Predicted Positive | Predicted Negative |
|---|---|---|
| **Actual Positive** | True Positive (TP) | False Negative (FN) |
| **Actual Negative** | False Positive (FP) | True Negative (TN) |

Each cell in the confusion matrix captures a specific type of prediction outcome:

- **True Positive (TP):** The model correctly predicts the [positive class](/wiki/positive_class). The actual label is positive, and the predicted label is also positive.
- **True Negative (TN):** The model correctly predicts the [negative class](/wiki/negative_class). The actual label is negative, and the predicted label is also negative.
- **False Positive (FP):** The model incorrectly predicts the positive class. The actual label is negative, but the predicted label is positive. This is also called a Type I error.
- **False Negative (FN):** The model incorrectly predicts the negative class. The actual label is positive, but the predicted label is negative. This is also called a Type II error.

All standard binary classification metrics can be derived from these four values.[9] The confusion matrix provides a complete picture of classifier performance and reveals patterns of errors that aggregate metrics may obscure.

## Evaluation Metrics

The performance of a binary classification model is evaluated using various metrics such as [accuracy](/wiki/accuracy), [precision](/wiki/precision), [recall](/wiki/recall), [F1 score](/wiki/f1_score), and others.[9] When selecting an evaluation metric, consider the specific problem at hand and the relative importance of accurately recognizing each class.

### Metrics Summary Table

| Metric | Formula | Range | Description |
|---|---|---|---|
| [Accuracy](/wiki/accuracy) | $$\frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}$$ | 0 to 1 | Proportion of all predictions that are correct |
| [Precision](/wiki/precision) | $$\frac{\text{TP}}{\text{TP} + \text{FP}}$$ | 0 to 1 | Proportion of positive predictions that are correct |
| [Recall](/wiki/recall) (Sensitivity) | $$\frac{\text{TP}}{\text{TP} + \text{FN}}$$ | 0 to 1 | Proportion of actual positives correctly identified |
| [F1 Score](/wiki/f1_score) | $$\frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$$ | 0 to 1 | Harmonic mean of precision and recall |
| Specificity (TNR) | $$\frac{\text{TN}}{\text{TN} + \text{FP}}$$ | 0 to 1 | Proportion of actual negatives correctly identified |
| [AUC-ROC](/wiki/auc_area_under_the_roc_curve) | Area under the ROC curve | 0 to 1 | Model's ability to discriminate between classes across all thresholds |
| AUC-PR | Area under the Precision-Recall curve | 0 to 1 | Model performance on the positive class across thresholds |
| [MCC](/wiki/matthews_correlation_coefficient) | $$\frac{\text{TP} \cdot \text{TN} - \text{FP} \cdot \text{FN}}{\sqrt{(\text{TP}+\text{FP})(\text{TP}+\text{FN})(\text{TN}+\text{FP})(\text{TN}+\text{FN})}}$$ | -1 to 1 | Correlation between predicted and actual classifications |

### Accuracy

[Accuracy](/wiki/accuracy) measures the percentage of correct predictions made by the model on a set of test data. It is calculated as $$\frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}$$. While intuitive and easy to understand, accuracy can be misleading when the dataset is imbalanced.[7] For example, if 95% of transactions are legitimate, a model that always predicts "not fraud" achieves 95% accuracy while failing to detect any actual fraud.

### Precision

[Precision](/wiki/precision) is the proportion of true positive predictions among all positive predictions made by the model.[9] It answers the question: "Of all the instances the model labeled positive, how many were actually positive?" Precision is especially important in applications where false positives are costly, such as spam filtering (where a legitimate email incorrectly flagged as spam may cause the user to miss important messages).

### Recall (Sensitivity)

[Recall](/wiki/recall) measures the proportion of actual positives that the model correctly identified.[9] It answers the question: "Of all the actual positive instances, how many did the model catch?" Recall is critical in applications where false negatives are costly, such as medical diagnosis (where missing a disease could delay treatment) or [fraud detection](/wiki/fraud_detection).

### F1 Score

The [F1 score](/wiki/f1_score) is the harmonic mean of [precision](/wiki/precision) and [recall](/wiki/recall), providing a single metric that balances both concerns. It is calculated as:

$$
F_1 = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
$$

The harmonic mean gives more weight to lower values, so the F1 score is high only when both precision and recall are high.[9] This makes it a useful metric when the costs of false positives and false negatives are roughly equal and the dataset is imbalanced.

### Specificity (True Negative Rate)

Specificity measures the proportion of actual negatives that the model correctly identified. It is calculated as $$\frac{\text{TN}}{\text{TN} + \text{FP}}$$.[9] Specificity is the counterpart of [recall](/wiki/recall) for the [negative class](/wiki/negative_class). In medical testing, specificity indicates how well a test correctly identifies people who do not have a disease.

### What does AUC-ROC measure?

The [ROC curve](/wiki/auc_area_under_the_roc_curve) (Receiver Operating Characteristic) plots the true positive rate (recall) against the false positive rate (1 - specificity) at various classification thresholds. The area under this curve (AUC-ROC) provides a threshold-independent measure of a model's discriminative ability.[5] An AUC-ROC of 1.0 represents a perfect classifier, while 0.5 represents random guessing.

In their foundational 1982 paper in Radiology, James Hanley and Barbara McNeil established the probabilistic interpretation of this quantity, writing that "the area represents the probability that a randomly chosen diseased subject is (correctly) rated or ranked with greater suspicion than a randomly chosen non-diseased subject."[14] Equivalently, AUC-ROC is the probability that the model ranks a randomly chosen positive instance above a randomly chosen negative instance, a value identical to the nonparametric Wilcoxon-Mann-Whitney statistic.[14]

AUC-ROC is widely used because it summarizes performance across all possible thresholds in a single number. However, it can be overly optimistic when the dataset is heavily imbalanced, because the false positive rate denominator (FP + TN) is dominated by the large number of true negatives.[5]

### AUC-PR (Precision-Recall AUC)

The Precision-Recall curve plots [precision](/wiki/precision) against [recall](/wiki/recall) at various thresholds. The area under this curve (AUC-PR) is particularly informative for imbalanced datasets where the [positive class](/wiki/positive_class) is rare. Unlike AUC-ROC, the Precision-Recall curve does not include true negatives in its computation, so it provides a clearer picture of how well the model identifies the minority class.[5]

The baseline AUC-PR for a random classifier equals the prevalence of the positive class.[5] In a dataset where only 1% of samples are positive, a random classifier achieves an AUC-PR of approximately 0.01, making it easier to distinguish meaningful improvements from chance performance.

### Matthews Correlation Coefficient (MCC)

The [Matthews correlation coefficient](/wiki/matthews_correlation_coefficient) is a balanced metric that uses all four values from the [confusion matrix](/wiki/confusion_matrix). It produces a value between -1 and +1, where +1 indicates perfect prediction, 0 indicates performance no better than random, and -1 indicates total disagreement between predictions and actual labels. The measure is named after biochemist Brian W. Matthews, who introduced it in 1975 to compare predicted and observed protein secondary structures; it is mathematically identical to Karl Pearson's phi coefficient.[13]

MCC is considered one of the most reliable single-number measures for binary classification quality because it accounts for the balance ratios of all four confusion matrix categories. Research has shown that MCC is more informative than F1 score and accuracy, especially on imbalanced datasets, because it only produces a high score when the classifier performs well on both the positive and negative classes.[4] Chicco and Jurman, in their 2020 analysis in BMC Genomics, concluded that MCC "produces a high score only if the prediction obtained good results in all of the four confusion matrix categories," proportionally to the number of positive and negative elements in the dataset.[4]

## Handling Class Imbalance

Class imbalance occurs when one class significantly outnumbers the other in the training data. This is common in real-world binary classification problems: fraudulent transactions may represent less than 0.1% of all transactions, and disease-positive cases may be a small fraction of all patients. Models trained on imbalanced data tend to be biased toward the majority class, resulting in poor [recall](/wiki/recall) for the minority class.[7]

Several techniques address class imbalance:

### SMOTE (Synthetic Minority Oversampling Technique)

[SMOTE](/wiki/smote) generates synthetic samples for the minority class by interpolating between existing minority-class instances. For each minority-class sample, SMOTE identifies its k-nearest neighbors (typically $$k = 5$$), then creates new synthetic examples at random points along the line segments connecting the sample to its neighbors.[3] The technique was introduced by Nitesh Chawla and colleagues in 2002 in the Journal of Artificial Intelligence Research and has become one of the most widely cited methods for imbalanced learning.[3]

SMOTE produces more diverse synthetic samples than simple random oversampling (which merely duplicates existing examples) and helps the classifier learn a more generalizable decision boundary. Several variants have been developed, including Borderline-SMOTE (which focuses on generating samples near the class boundary), SMOTE-ENN (which combines oversampling with Edited Nearest Neighbors for cleaning), and ADASYN (Adaptive Synthetic Sampling, which generates more samples in regions where the classifier performs poorly).[7]

### Class Weights

Instead of modifying the dataset, class weights adjust the [loss function](/wiki/loss_function) so that misclassifications of the minority class receive a higher penalty.[7] For example, if the positive class represents 10% of the data, its weight might be set to 9 (the ratio of negative to positive samples), making each minority-class error nine times more influential during training.

Most machine learning frameworks support class weighting natively. In scikit-learn, the `class_weight='balanced'` parameter automatically adjusts weights inversely proportional to class frequencies.

### Threshold Adjustment

Rather than using the default 0.5 threshold, the classification threshold can be lowered to increase the model's sensitivity to the minority class.[7] For example, setting the threshold to 0.3 means that any predicted probability above 0.3 is classified as positive, increasing [recall](/wiki/recall) at the potential cost of [precision](/wiki/precision). Threshold calibration on a validation set allows practitioners to find the threshold that optimizes the desired metric.

### Summary of Imbalance Techniques

| Technique | Category | Description |
|---|---|---|
| Random undersampling | Data-level | Removes random majority-class samples to balance the dataset |
| Random oversampling | Data-level | Duplicates random minority-class samples |
| [SMOTE](/wiki/smote) | Data-level | Generates synthetic minority samples via interpolation |
| ADASYN | Data-level | Adaptively generates more synthetic samples in difficult regions |
| Class weights | Algorithm-level | Adjusts the loss function to penalize minority-class errors more heavily |
| Threshold adjustment | Post-processing | Modifies the decision boundary after model training |
| Ensemble methods | Algorithm-level | Combines multiple models trained on balanced subsets (e.g., BalancedRandomForest) |
| Cost-sensitive learning | Algorithm-level | Incorporates misclassification costs directly into the training objective |

## Binary vs. Multi-class and Multi-label Classification

Binary classification is distinct from both [multi-class classification](/wiki/multi-class_classification) and multi-label classification. Understanding the differences is important for selecting the correct modeling approach.

| Type | Number of Classes | Labels per Sample | Output Function | Loss Function | Example |
|---|---|---|---|---|---|
| Binary classification | 2 | 1 | [Sigmoid](/wiki/sigmoid_function) | Binary [cross-entropy](/wiki/cross-entropy) | Spam or not spam |
| [Multi-class classification](/wiki/multi-class_classification) | 3 or more | 1 | [Softmax](/wiki/softmax) | Categorical [cross-entropy](/wiki/cross-entropy) | Classifying an image as cat, dog, or bird |
| Multi-label classification | 2 or more | 0 or more | Sigmoid (per label) | Binary cross-entropy (per label) | Tagging a movie as action, comedy, and drama simultaneously |

In **multi-class classification**, the classes are mutually exclusive: each sample belongs to exactly one class. The [softmax function](/wiki/softmax) is typically used in the output layer instead of sigmoid, and categorical [cross-entropy](/wiki/cross-entropy) replaces binary cross-entropy as the loss function.[6]

In **multi-label classification**, the classes are not mutually exclusive: a single sample can belong to multiple classes at the same time. This is often handled by treating each label as an independent binary classification problem. The output layer uses a [sigmoid](/wiki/sigmoid_function) activation for each label, and the model is trained with binary cross-entropy computed independently for each label.

Binary classification can also serve as a building block for multi-class problems through strategies such as one-vs-rest (OvR), where a separate binary classifier is trained for each class, and one-vs-one (OvO), where a binary classifier is trained for every pair of classes.[1]

## Bayes Optimal Classifier

In the theoretical framework of statistical learning, the Bayes optimal classifier represents the best possible binary classifier for a given problem. It assigns each input to the class with the highest posterior probability:

$$
y^* = \arg\max_c P(y = c \mid x)
$$

For binary classification, this simplifies to predicting the positive class when $$P(y = 1 \mid x) > 0.5$$ and the negative class otherwise. The Bayes optimal classifier achieves the lowest possible error rate (known as the Bayes error rate) for a given data distribution. No other classifier can outperform it on expectation.[1][2]

In practice, the true class-conditional distributions are unknown, so the Bayes optimal classifier cannot be computed directly. All practical machine learning algorithms attempt to approximate it by learning a decision boundary from training data. Understanding the Bayes optimal classifier provides a theoretical benchmark for evaluating how close a learned model comes to the best achievable performance.

## Applications

Binary classification is used extensively across many domains. The following table summarizes prominent applications:

| Application | Positive Class | Negative Class | Common Algorithms | Key Metric |
|---|---|---|---|---|
| [Spam detection](/wiki/spam_detection) | Spam email | Legitimate email | [Naive Bayes](/wiki/naive_bayes), [Logistic Regression](/wiki/logistic_regression), [SVM](/wiki/support_vector_machine_svm) | Precision |
| [Fraud detection](/wiki/fraud_detection) | Fraudulent transaction | Legitimate transaction | [Gradient Boosting](/wiki/gradient_boosting), [Random Forest](/wiki/random_forest), [Neural Networks](/wiki/neural_network) | Recall, AUC-PR |
| Medical diagnosis | Disease present | Disease absent | [Logistic Regression](/wiki/logistic_regression), [Random Forest](/wiki/random_forest), [Deep Learning](/wiki/deep_learning) | Recall, Specificity |
| [Sentiment analysis](/wiki/sentiment_analysis) | Positive sentiment | Negative sentiment | [BERT](/wiki/bert), [Logistic Regression](/wiki/logistic_regression), [SVM](/wiki/support_vector_machine_svm) | F1 Score, Accuracy |
| Churn prediction | Customer will churn | Customer will stay | [Gradient Boosting](/wiki/gradient_boosting), [Logistic Regression](/wiki/logistic_regression) | AUC-ROC |
| Manufacturing defect detection | Defective product | Non-defective product | [CNN](/wiki/convolutional_neural_network), [SVM](/wiki/support_vector_machine_svm) | Recall, Precision |
| Credit scoring | Default | No default | [Logistic Regression](/wiki/logistic_regression), [Gradient Boosting](/wiki/gradient_boosting) | AUC-ROC, MCC |
| Disease screening | Positive test | Negative test | [Logistic Regression](/wiki/logistic_regression), [Random Forest](/wiki/random_forest) | Sensitivity, Specificity |

### Spam Detection

Spam detection was one of the earliest large-scale applications of binary classification. Email providers classify incoming messages as spam or not spam using features such as the presence of certain keywords, the sender's reputation, header information, and embedded links. Early systems relied on [Naive Bayes](/wiki/naive_bayes) classifiers, while modern systems use combinations of [deep learning](/wiki/deep_learning) models and rule-based filters.[6]

### Fraud Detection

Financial institutions deploy binary classifiers to flag potentially fraudulent transactions in real time. These systems analyze features such as transaction amount, location, time, merchant category, and user behavior patterns. The extreme class imbalance (fraud typically accounts for less than 0.1% of transactions) makes this a challenging application where [recall](/wiki/recall) and AUC-PR are often prioritized over [accuracy](/wiki/accuracy).[5][7]

### Medical Diagnosis

Binary classification is widely applied in medical settings to detect diseases from clinical data, lab results, or medical images. Examples include classifying mammograms as showing malignant or benign tissue, predicting whether a patient has diabetes based on blood test results, and detecting pneumonia from chest X-rays. In medical applications, the cost of false negatives (missing a disease) is typically much higher than the cost of false positives (a healthy person flagged for further testing), so models are often tuned for high [recall](/wiki/recall).

### Sentiment Analysis

[Sentiment analysis](/wiki/sentiment_analysis) classifies text (such as product reviews, social media posts, or customer feedback) as expressing positive or negative sentiment. While sentiment can also be modeled as a multi-class problem (positive, neutral, negative) or on a continuous scale, the binary formulation remains common. Modern approaches use [transformer](/wiki/transformer)-based models such as [BERT](/wiki/bert) fine-tuned on labeled sentiment data.[6]

## Explain Like I'm 5 (ELI5)

Imagine you have a big box of apples. Some apples are good and some are bad. Your job is to look at each apple and sort it into one of two buckets: the "good" bucket or the "bad" bucket. That is binary classification. You look at clues like the color, shape, and whether it has spots to make your decision.

A computer does the same thing, but instead of apples it might be sorting emails into "spam" or "not spam," or looking at a medical test to decide "sick" or "healthy." The computer learns from lots of examples that people have already sorted. After seeing enough examples, it gets good at guessing which bucket new items belong in, even if it has never seen those exact items before.

The word "binary" means "two," so binary classification simply means sorting things into exactly two groups.

## References

1. Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Springer.
2. Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning: Data Mining, [Inference](/wiki/inference), and Prediction* (2nd ed.). Springer.
3. Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). "SMOTE: Synthetic Minority Over-sampling Technique." *Journal of Artificial Intelligence Research*, 16, 321-357.
4. Chicco, D., & Jurman, G. (2020). "The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation." *BMC Genomics*, 21(1), 6.
5. Davis, J., & Goadrich, M. (2006). "The Relationship Between Precision-Recall and ROC Curves." *Proceedings of the 23rd International Conference on Machine Learning*, 233-240.
6. Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press.
7. He, H., & Garcia, E. A. (2009). "Learning from Imbalanced Data." *IEEE Transactions on Knowledge and Data Engineering*, 21(9), 1263-1284.
8. Platt, J. C. (1999). "Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods." *Advances in Large Margin Classifiers*, MIT Press, 61-74.
9. Powers, D. M. W. (2011). "Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness and Correlation." *Journal of Machine Learning Technologies*, 2(1), 37-63.
10. Niculescu-Mizil, A., & Caruana, R. (2005). "Predicting Good Probabilities With Supervised Learning." *Proceedings of the 22nd International Conference on Machine Learning*, 625-632.
11. Cox, D. R. (1958). "The Regression Analysis of Binary Sequences." *Journal of the Royal Statistical Society: Series B (Methodological)*, 20(2), 215-242.
12. Cortes, C., & Vapnik, V. (1995). "Support-Vector Networks." *Machine Learning*, 20(3), 273-297.
13. Matthews, B. W. (1975). "Comparison of the predicted and observed secondary structure of T4 phage lysozyme." *Biochimica et Biophysica Acta (BBA) - Protein Structure*, 405(2), 442-451.
14. Hanley, J. A., & McNeil, B. J. (1982). "The Meaning and Use of the Area under a Receiver Operating Characteristic (ROC) Curve." *Radiology*, 143(1), 29-36.