Binary Classification

Introduction

Binary classification is a type of machine learning problem where the goal is to classify input data into one of two classes or categories. These classes are conventionally labeled as the positive class (1) and the negative class (0). The classes are mutually exclusive. Binary classification is one of the most fundamental tasks in supervised learning and serves as the foundation for many real-world prediction systems. In formal terms, a binary classifier is a function that maps an input feature vector x to a single binary output y, where y belongs to {0, 1}.

Unlike multi-class classification, which involves three or more possible output categories, binary classification restricts the prediction space to exactly two outcomes. This constraint makes the problem mathematically simpler in some respects, but the practical challenges of building accurate binary classifiers remain significant, particularly when dealing with noisy data, overlapping class distributions, or severe class imbalance.

What is Binary Classification?

In binary classification, a machine learning algorithm learns to classify input data into one of two classes based on labeled training data. When given input data, this algorithm makes a prediction as to which class the input belongs in.

Binary classification involves classifying input data into two classes based on learned patterns from training data, such as spam or not spam, fraud or not fraud, and disease or not disease. The goal is to accurately classify new input data into the appropriate class based on these learned patterns.

Binary classification is a supervised learning task, meaning the algorithm is trained using labeled data where each data point has been assigned a label indicating its class membership. During the training phase, the model learns a decision boundary that separates the two classes in the feature space. Once trained, the model generalizes this boundary to unseen data during inference.

Positive and Negative Class Conventions

In binary classification, the two classes are referred to as the positive class and the negative class. By convention, the positive class (labeled 1) typically represents the outcome of primary interest, while the negative class (labeled 0) represents the default or baseline outcome. For example, in fraud detection the positive class is "fraudulent" and the negative class is "legitimate." In medical diagnosis, the positive class is "disease present" and the negative class is "disease absent."

The choice of which class to designate as positive affects the interpretation of evaluation metrics. Precision, recall, and the confusion matrix are all defined relative to the positive class. Swapping the positive and negative labels does not change the model's underlying behavior, but it changes the numeric values of these metrics. For this reason, practitioners should clearly define the positive class before evaluating results, particularly in domains where the minority class carries higher importance (such as detecting rare diseases or fraudulent transactions).

Example Use Cases

Application	Input Features	Positive Class	Negative Class
Email spam filtering	Email content, headers, sender	Spam	Not spam
Credit risk assessment	Income, credit score, employment history	High-risk	Low-risk
Medical diagnosis	Age, blood pressure, lab results	Disease present	Disease absent
Fraud detection	Transaction amount, location, type	Fraudulent	Non-fraudulent
Sentiment analysis	Text of a review or post	Positive sentiment	Negative sentiment
Churn prediction	Usage patterns, account tenure	Will churn	Will stay

Algorithms for Binary Classification

Binary classification can be performed using many different machine learning algorithms, including logistic regression, decision trees, random forests, support vector machines, gradient boosting, and neural networks. The specific choice depends on the problem being solved and the characteristics of the data.

Logistic Regression

Logistic regression is one of the most widely used algorithms for binary classification. Despite its name, it is a classification method, not a regression method. Logistic regression models the probability that a given input belongs to the positive class by applying the sigmoid function to a linear combination of input features. The model learns a weight vector w and a bias term b such that the predicted probability is:

P(y = 1 | x) = sigmoid(w^T * x + b)

Logistic regression is valued for its simplicity, interpretability, and computational efficiency. The learned coefficients indicate the direction and magnitude of each feature's influence on the prediction. It works best when the relationship between the features and the log-odds of the outcome is approximately linear.

Support Vector Machines

Support vector machines (SVMs) approach binary classification by finding the optimal hyperplane that separates the two classes with the maximum margin. The margin is the distance between the hyperplane and the nearest data points from each class, known as support vectors. By maximizing this margin, SVMs produce classifiers that tend to generalize well to unseen data.

For data that is not linearly separable, SVMs use the kernel trick to project the data into a higher-dimensional space where a linear separator can be found. Common kernel functions include the radial basis function (RBF), polynomial, and sigmoid kernels. SVMs are effective in high-dimensional spaces and perform well even when the number of features exceeds the number of training samples.

Decision Trees

Decision trees classify data by recursively partitioning the feature space based on feature values. At each internal node, the algorithm selects the feature and threshold that best separates the classes according to a splitting criterion such as Gini impurity or information gain. The process continues until a stopping condition is met, such as reaching a maximum depth or a minimum number of samples per leaf.

Decision trees handle both numerical and categorical features naturally and produce models that are easy to visualize and interpret. However, individual decision trees are prone to overfitting, especially when grown deep without pruning.

Random Forest

Random forest is an ensemble learning method that trains multiple decision trees on random subsets of the training data and features, then aggregates their predictions through majority voting. This approach reduces the variance associated with individual trees and typically produces more robust classifiers.

Each tree in the forest is trained on a bootstrap sample (a random sample drawn with replacement) of the original data, and at each split, only a random subset of features is considered. The combination of bagging and feature randomization makes random forests resistant to overfitting and effective across many types of binary classification problems.

Gradient Boosting

Gradient boosting builds an ensemble of weak learners (usually shallow decision trees) sequentially, where each new tree corrects the errors of the previous ensemble. The method optimizes a loss function by adding trees that follow the negative gradient of the loss. Popular implementations include XGBoost, LightGBM, and CatBoost.

Gradient boosting methods frequently achieve state-of-the-art results on tabular data and are widely used in competitions and production systems. Key hyperparameters include the learning rate (which controls the contribution of each tree), the number of trees, and the maximum tree depth. Regularization techniques such as L1/L2 penalties and subsampling help prevent overfitting.

Neural Networks

Neural networks can be applied to binary classification by using a single output neuron with a sigmoid activation function. The network learns complex, nonlinear relationships between input features and the target class through multiple layers of interconnected neurons. Deep learning models, such as convolutional neural networks for image data and recurrent neural networks for sequential data, have achieved strong performance on many binary classification benchmarks.

For binary classification, the output layer consists of one neuron with sigmoid activation, producing a value between 0 and 1 that represents the predicted probability of the positive class. The network is trained using binary cross-entropy as the loss function and optimized with algorithms such as stochastic gradient descent or Adam.

Naive Bayes

Naive Bayes classifiers apply Bayes' theorem with a strong independence assumption between features. Despite this simplifying assumption, Naive Bayes performs surprisingly well on many binary classification tasks, particularly text classification problems like spam detection and sentiment analysis. The algorithm is computationally efficient, requires minimal training data, and scales well to high-dimensional feature spaces. Variants include Gaussian Naive Bayes (for continuous features), Multinomial Naive Bayes (for count data), and Bernoulli Naive Bayes (for binary features).

Algorithm Comparison

Algorithm	Interpretability	Handles Non-linearity	Training Speed	Prone to Overfitting	Best For
Logistic Regression	High	No (linear)	Fast	Low	Linearly separable data, baseline models
SVM	Low (with kernels)	Yes (with kernels)	Moderate	Moderate	High-dimensional data, small to medium datasets
Decision Tree	High	Yes	Fast	High	Exploratory analysis, interpretable models
Random Forest	Moderate	Yes	Moderate	Low	General-purpose, tabular data
Gradient Boosting	Low	Yes	Slow	Moderate (with tuning)	Tabular data, competitions, high accuracy needs
Neural Network	Low	Yes	Slow	High (without regularization)	Large datasets, unstructured data (images, text)
Naive Bayes	High	No (linear boundaries)	Very fast	Low	Text classification, small datasets, baselines

Loss Functions for Binary Classification

A loss function quantifies the difference between a model's predictions and the true labels. For binary classification, the two most commonly used loss functions are binary cross-entropy (log loss) and hinge loss.

Binary Cross-Entropy (Log Loss)

The standard loss function for training binary classifiers is binary cross-entropy, also called log loss. For a single training example with true label y (0 or 1) and predicted probability p, the binary cross-entropy loss is defined as:

L(y, p) = -[y * log(p) + (1 - y) * log(1 - p)]

When y = 1 (positive class), the loss simplifies to -log(p), which penalizes the model heavily when it assigns a low probability to a true positive. When y = 0 (negative class), the loss becomes -log(1 - p), penalizing high predicted probabilities for actual negatives. The logarithmic nature of this function creates a large gradient for confident but incorrect predictions, which accelerates learning during backpropagation.

For a dataset of N samples, the overall loss is the average across all samples:

L = -(1/N) * sum[y_i * log(p_i) + (1 - y_i) * log(1 - p_i)]

Binary cross-entropy is a convex function for logistic regression, guaranteeing a single global minimum. For neural networks, the optimization landscape is non-convex, but binary cross-entropy still provides well-behaved gradients that facilitate training.

Hinge Loss

Hinge loss is the loss function used by support vector machines. Unlike binary cross-entropy, hinge loss operates on raw scores (logits) rather than probabilities and does not produce probabilistic output. For a single training example with true label y (encoded as -1 or +1) and predicted score f(x), hinge loss is:

L(y, f(x)) = max(0, 1 - y * f(x))

Hinge loss equals zero when the prediction is correct and the margin (the product y * f(x)) exceeds 1. When the margin is less than 1, the loss increases linearly. This means hinge loss penalizes not only incorrect predictions but also correct predictions that fall too close to the decision boundary. The focus on margin maximization is what gives SVMs their strong generalization properties.

Because hinge loss is not differentiable at the point where y * f(x) = 1, subgradient methods are used during optimization.

Comparison of Loss Functions

Property	Binary Cross-Entropy	Hinge Loss
Output type	Probability (0 to 1)	Raw score (logit)
Probabilistic interpretation	Yes	No
Primary algorithm	Logistic regression, neural networks	SVM
Penalizes confident mistakes	Heavily (logarithmic)	Linearly
Margin-based	No	Yes
Smoothness	Smooth and differentiable	Not differentiable at hinge point
Best when	Calibrated probabilities are needed	Maximum-margin separation is desired

Output: Probability via Sigmoid

Most binary classifiers produce a raw score (logit) that must be transformed into a probability. The sigmoid function (also called the logistic function) performs this transformation:

sigmoid(z) = 1 / (1 + exp(-z))

The sigmoid function maps any real-valued number z to the range (0, 1). When z is a large positive number, sigmoid(z) approaches 1; when z is a large negative number, sigmoid(z) approaches 0; and when z = 0, sigmoid(z) = 0.5. This S-shaped curve provides a smooth, differentiable mapping from logits to probabilities.

In logistic regression, the logit z is the linear combination of input features: z = w^T * x + b. In neural networks, z is the output of the final layer before the activation function. The resulting probability p = sigmoid(z) represents the model's confidence that the input belongs to the positive class.

The sigmoid function has a useful derivative property: sigmoid'(z) = sigmoid(z) * (1 - sigmoid(z)), which simplifies gradient computation during training.

Probability Calibration

The raw probability output of a binary classifier does not always accurately reflect the true likelihood of class membership. For example, a model might predict a probability of 0.8 for a set of instances, but only 60% of those instances actually belong to the positive class. When predicted probabilities match observed frequencies, the model is said to be well-calibrated. Calibration is important in applications where the predicted probability itself drives decisions, such as medical risk scoring or insurance pricing.

Some algorithms, like logistic regression, tend to produce well-calibrated probabilities by default. Others, like SVMs, random forests, and gradient boosting models, often require post-hoc calibration.

Platt Scaling

Platt scaling fits a logistic regression model to the classifier's output scores, learning parameters A and B such that the calibrated probability is p = 1 / (1 + exp(A * f(x) + B)). This method works well when the distortion in predicted probabilities follows a sigmoid shape, which is commonly the case for SVMs and neural networks. Platt scaling is simple to implement and works well even with small calibration datasets.

Isotonic Regression

Isotonic regression is a non-parametric calibration method that learns a piecewise constant, monotonically increasing function mapping classifier scores to calibrated probabilities. It is more flexible than Platt scaling and can correct any monotonic distortion. However, isotonic regression requires more calibration data (typically 1,000+ samples) to avoid overfitting. When sufficient data is available, it generally outperforms Platt scaling.

Calibration should always be performed on a held-out calibration set (separate from both the training and test sets) to avoid biasing the calibrated probabilities.

Threshold Selection

After a binary classifier produces a probability score, a decision threshold must be applied to convert this probability into a class label. The default threshold is 0.5: if the predicted probability exceeds 0.5, the input is classified as positive; otherwise, it is classified as negative. However, the optimal threshold is often not 0.5, especially when dealing with imbalanced classes or when the costs of different types of errors are unequal.

Methods for Selecting the Optimal Threshold

Several approaches exist for determining the best threshold:

Method	Strategy	Best When
ROC curve analysis	Maximize Youden index (sensitivity + specificity - 1)	Overall discrimination matters
Precision-recall tradeoff	Tune threshold to favor precision or recall	One type of error is more costly
Cost-sensitive optimization	Minimize total expected cost given asymmetric error costs	FP and FN have different financial costs
F1 score maximization	Choose threshold producing highest F1 on validation set	Balanced tradeoff between precision and recall
Business requirements	Domain experts set threshold based on operational constraints	Regulatory or policy constraints exist

For example, in medical screening, a lower threshold increases recall (catching more true positives) at the cost of reduced precision (more false positives). In fraud detection, missing a fraudulent transaction (false negative) may be far more costly than flagging a legitimate one (false positive), so the threshold is lowered accordingly.

Threshold selection should always be performed on a validation set, not the training set, to avoid overfitting the threshold to training data.

Confusion Matrix

The confusion matrix is a fundamental tool for evaluating binary classifiers. It organizes predictions into a 2x2 table based on the actual and predicted class labels:

	Predicted Positive	Predicted Negative
Actual Positive	True Positive (TP)	False Negative (FN)
Actual Negative	False Positive (FP)	True Negative (TN)

Each cell in the confusion matrix captures a specific type of prediction outcome:

True Positive (TP): The model correctly predicts the positive class. The actual label is positive, and the predicted label is also positive.
True Negative (TN): The model correctly predicts the negative class. The actual label is negative, and the predicted label is also negative.
False Positive (FP): The model incorrectly predicts the positive class. The actual label is negative, but the predicted label is positive. This is also called a Type I error.
False Negative (FN): The model incorrectly predicts the negative class. The actual label is positive, but the predicted label is negative. This is also called a Type II error.

All standard binary classification metrics can be derived from these four values. The confusion matrix provides a complete picture of classifier performance and reveals patterns of errors that aggregate metrics may obscure.

Evaluation Metrics

The performance of a binary classification model is evaluated using various metrics such as accuracy, precision, recall, F1 score, and others. When selecting an evaluation metric, consider the specific problem at hand and the relative importance of accurately recognizing each class.

Metrics Summary Table

Metric	Formula	Range	Description
Accuracy	(TP + TN) / (TP + TN + FP + FN)	0 to 1	Proportion of all predictions that are correct
Precision	TP / (TP + FP)	0 to 1	Proportion of positive predictions that are correct
Recall (Sensitivity)	TP / (TP + FN)	0 to 1	Proportion of actual positives correctly identified
F1 Score	2 * Precision * Recall / (Precision + Recall)	0 to 1	Harmonic mean of precision and recall
Specificity (TNR)	TN / (TN + FP)	0 to 1	Proportion of actual negatives correctly identified
AUC-ROC	Area under the ROC curve	0 to 1	Model's ability to discriminate between classes across all thresholds
AUC-PR	Area under the Precision-Recall curve	0 to 1	Model performance on the positive class across thresholds
MCC	(TPTN - FPFN) / sqrt((TP+FP)(TP+FN)(TN+FP)(TN+FN))	-1 to 1	Correlation between predicted and actual classifications

Accuracy

Accuracy measures the percentage of correct predictions made by the model on a set of test data. It is calculated as (TP + TN) / (TP + TN + FP + FN). While intuitive and easy to understand, accuracy can be misleading when the dataset is imbalanced. For example, if 95% of transactions are legitimate, a model that always predicts "not fraud" achieves 95% accuracy while failing to detect any actual fraud.

Precision

Precision is the proportion of true positive predictions among all positive predictions made by the model. It answers the question: "Of all the instances the model labeled positive, how many were actually positive?" Precision is especially important in applications where false positives are costly, such as spam filtering (where a legitimate email incorrectly flagged as spam may cause the user to miss important messages).

Recall (Sensitivity)

Recall measures the proportion of actual positives that the model correctly identified. It answers the question: "Of all the actual positive instances, how many did the model catch?" Recall is critical in applications where false negatives are costly, such as medical diagnosis (where missing a disease could delay treatment) or fraud detection.

F1 Score

The F1 score is the harmonic mean of precision and recall, providing a single metric that balances both concerns. It is calculated as:

F1 = 2 * (Precision * Recall) / (Precision + Recall)

The harmonic mean gives more weight to lower values, so the F1 score is high only when both precision and recall are high. This makes it a useful metric when the costs of false positives and false negatives are roughly equal and the dataset is imbalanced.

Specificity (True Negative Rate)

Specificity measures the proportion of actual negatives that the model correctly identified. It is calculated as TN / (TN + FP). Specificity is the counterpart of recall for the negative class. In medical testing, specificity indicates how well a test correctly identifies people who do not have a disease.

AUC-ROC

The ROC curve (Receiver Operating Characteristic) plots the true positive rate (recall) against the false positive rate (1 - specificity) at various classification thresholds. The area under this curve (AUC-ROC) provides a threshold-independent measure of a model's discriminative ability. An AUC-ROC of 1.0 represents a perfect classifier, while 0.5 represents random guessing.

AUC-ROC is widely used because it summarizes performance across all possible thresholds in a single number. However, it can be overly optimistic when the dataset is heavily imbalanced, because the false positive rate denominator (FP + TN) is dominated by the large number of true negatives.

AUC-PR (Precision-Recall AUC)

The Precision-Recall curve plots precision against recall at various thresholds. The area under this curve (AUC-PR) is particularly informative for imbalanced datasets where the positive class is rare. Unlike AUC-ROC, the Precision-Recall curve does not include true negatives in its computation, so it provides a clearer picture of how well the model identifies the minority class.

The baseline AUC-PR for a random classifier equals the prevalence of the positive class. In a dataset where only 1% of samples are positive, a random classifier achieves an AUC-PR of approximately 0.01, making it easier to distinguish meaningful improvements from chance performance.

Matthews Correlation Coefficient (MCC)

The Matthews correlation coefficient is a balanced metric that uses all four values from the confusion matrix. It produces a value between -1 and +1, where +1 indicates perfect prediction, 0 indicates performance no better than random, and -1 indicates total disagreement between predictions and actual labels.

MCC is considered one of the most reliable single-number measures for binary classification quality because it accounts for the balance ratios of all four confusion matrix categories. Research has shown that MCC is more informative than F1 score and accuracy, especially on imbalanced datasets, because it only produces a high score when the classifier performs well on both the positive and negative classes.

Handling Class Imbalance

Class imbalance occurs when one class significantly outnumbers the other in the training data. This is common in real-world binary classification problems: fraudulent transactions may represent less than 0.1% of all transactions, and disease-positive cases may be a small fraction of all patients. Models trained on imbalanced data tend to be biased toward the majority class, resulting in poor recall for the minority class.

Several techniques address class imbalance:

SMOTE (Synthetic Minority Oversampling Technique)

SMOTE generates synthetic samples for the minority class by interpolating between existing minority-class instances. For each minority-class sample, SMOTE identifies its k-nearest neighbors (typically k = 5), then creates new synthetic examples at random points along the line segments connecting the sample to its neighbors.

SMOTE produces more diverse synthetic samples than simple random oversampling (which merely duplicates existing examples) and helps the classifier learn a more generalizable decision boundary. Several variants have been developed, including Borderline-SMOTE (which focuses on generating samples near the class boundary), SMOTE-ENN (which combines oversampling with Edited Nearest Neighbors for cleaning), and ADASYN (Adaptive Synthetic Sampling, which generates more samples in regions where the classifier performs poorly).

Class Weights

Instead of modifying the dataset, class weights adjust the loss function so that misclassifications of the minority class receive a higher penalty. For example, if the positive class represents 10% of the data, its weight might be set to 9 (the ratio of negative to positive samples), making each minority-class error nine times more influential during training.

Most machine learning frameworks support class weighting natively. In scikit-learn, the class_weight='balanced' parameter automatically adjusts weights inversely proportional to class frequencies.

Threshold Adjustment

Rather than using the default 0.5 threshold, the classification threshold can be lowered to increase the model's sensitivity to the minority class. For example, setting the threshold to 0.3 means that any predicted probability above 0.3 is classified as positive, increasing recall at the potential cost of precision. Threshold calibration on a validation set allows practitioners to find the threshold that optimizes the desired metric.

Summary of Imbalance Techniques

Technique	Category	Description
Random undersampling	Data-level	Removes random majority-class samples to balance the dataset
Random oversampling	Data-level	Duplicates random minority-class samples
SMOTE	Data-level	Generates synthetic minority samples via interpolation
ADASYN	Data-level	Adaptively generates more synthetic samples in difficult regions
Class weights	Algorithm-level	Adjusts the loss function to penalize minority-class errors more heavily
Threshold adjustment	Post-processing	Modifies the decision boundary after model training
Ensemble methods	Algorithm-level	Combines multiple models trained on balanced subsets (e.g., BalancedRandomForest)
Cost-sensitive learning	Algorithm-level	Incorporates misclassification costs directly into the training objective

Binary vs. Multi-class and Multi-label Classification

Binary classification is distinct from both multi-class classification and multi-label classification. Understanding the differences is important for selecting the correct modeling approach.

Type	Number of Classes	Labels per Sample	Output Function	Loss Function	Example
Binary classification	2	1	Sigmoid	Binary cross-entropy	Spam or not spam
Multi-class classification	3 or more	1	Softmax	Categorical cross-entropy	Classifying an image as cat, dog, or bird
Multi-label classification	2 or more	0 or more	Sigmoid (per label)	Binary cross-entropy (per label)	Tagging a movie as action, comedy, and drama simultaneously

In multi-class classification, the classes are mutually exclusive: each sample belongs to exactly one class. The softmax function is typically used in the output layer instead of sigmoid, and categorical cross-entropy replaces binary cross-entropy as the loss function.

In multi-label classification, the classes are not mutually exclusive: a single sample can belong to multiple classes at the same time. This is often handled by treating each label as an independent binary classification problem. The output layer uses a sigmoid activation for each label, and the model is trained with binary cross-entropy computed independently for each label.

Binary classification can also serve as a building block for multi-class problems through strategies such as one-vs-rest (OvR), where a separate binary classifier is trained for each class, and one-vs-one (OvO), where a binary classifier is trained for every pair of classes.

Bayes Optimal Classifier

In the theoretical framework of statistical learning, the Bayes optimal classifier represents the best possible binary classifier for a given problem. It assigns each input to the class with the highest posterior probability:

y* = argmax_c P(y = c | x)

For binary classification, this simplifies to predicting the positive class when P(y = 1 | x) > 0.5 and the negative class otherwise. The Bayes optimal classifier achieves the lowest possible error rate (known as the Bayes error rate) for a given data distribution. No other classifier can outperform it on expectation.

In practice, the true class-conditional distributions are unknown, so the Bayes optimal classifier cannot be computed directly. All practical machine learning algorithms attempt to approximate it by learning a decision boundary from training data. Understanding the Bayes optimal classifier provides a theoretical benchmark for evaluating how close a learned model comes to the best achievable performance.

Applications

Binary classification is used extensively across many domains. The following table summarizes prominent applications:

Application	Positive Class	Negative Class	Common Algorithms	Key Metric
Spam detection	Spam email	Legitimate email	Naive Bayes, Logistic Regression, SVM	Precision
Fraud detection	Fraudulent transaction	Legitimate transaction	Gradient Boosting, Random Forest, Neural Networks	Recall, AUC-PR
Medical diagnosis	Disease present	Disease absent	Logistic Regression, Random Forest, Deep Learning	Recall, Specificity
Sentiment analysis	Positive sentiment	Negative sentiment	BERT, Logistic Regression, SVM	F1 Score, Accuracy
Churn prediction	Customer will churn	Customer will stay	Gradient Boosting, Logistic Regression	AUC-ROC
Manufacturing defect detection	Defective product	Non-defective product	CNN, SVM	Recall, Precision
Credit scoring	Default	No default	Logistic Regression, Gradient Boosting	AUC-ROC, MCC
Disease screening	Positive test	Negative test	Logistic Regression, Random Forest	Sensitivity, Specificity

Spam Detection

Spam detection was one of the earliest large-scale applications of binary classification. Email providers classify incoming messages as spam or not spam using features such as the presence of certain keywords, the sender's reputation, header information, and embedded links. Early systems relied on Naive Bayes classifiers, while modern systems use combinations of deep learning models and rule-based filters.

Fraud Detection

Financial institutions deploy binary classifiers to flag potentially fraudulent transactions in real time. These systems analyze features such as transaction amount, location, time, merchant category, and user behavior patterns. The extreme class imbalance (fraud typically accounts for less than 0.1% of transactions) makes this a challenging application where recall and AUC-PR are often prioritized over accuracy.

Medical Diagnosis

Binary classification is widely applied in medical settings to detect diseases from clinical data, lab results, or medical images. Examples include classifying mammograms as showing malignant or benign tissue, predicting whether a patient has diabetes based on blood test results, and detecting pneumonia from chest X-rays. In medical applications, the cost of false negatives (missing a disease) is typically much higher than the cost of false positives (a healthy person flagged for further testing), so models are often tuned for high recall.

Sentiment Analysis

Sentiment analysis classifies text (such as product reviews, social media posts, or customer feedback) as expressing positive or negative sentiment. While sentiment can also be modeled as a multi-class problem (positive, neutral, negative) or on a continuous scale, the binary formulation remains common. Modern approaches use transformer-based models such as BERT fine-tuned on labeled sentiment data.

Explain Like I'm 5 (ELI5)

Imagine you have a big box of apples. Some apples are good and some are bad. Your job is to look at each apple and sort it into one of two buckets: the "good" bucket or the "bad" bucket. That is binary classification. You look at clues like the color, shape, and whether it has spots to make your decision.

A computer does the same thing, but instead of apples it might be sorting emails into "spam" or "not spam," or looking at a medical test to decide "sick" or "healthy." The computer learns from lots of examples that people have already sorted. After seeing enough examples, it gets good at guessing which bucket new items belong in, even if it has never seen those exact items before.

The word "binary" means "two," so binary classification simply means sorting things into exactly two groups.

References

Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Springer.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning: Data Mining, Inference, and Prediction* (2nd ed.). Springer.
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). "SMOTE: Synthetic Minority Over-sampling Technique." *Journal of Artificial Intelligence Research*, 16, 321-357.
Chicco, D., & Jurman, G. (2020). "The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation." *BMC Genomics*, 21(1), 6.
Davis, J., & Goadrich, M. (2006). "The Relationship Between Precision-Recall and ROC Curves." *Proceedings of the 23rd International Conference on Machine Learning*, 233-240.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press.
He, H., & Garcia, E. A. (2009). "Learning from Imbalanced Data." *IEEE Transactions on Knowledge and Data Engineering*, 21(9), 1263-1284.
Platt, J. C. (1999). "Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods." *Advances in Large Margin Classifiers*, MIT Press, 61-74.
Powers, D. M. W. (2011). "Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness and Correlation." *Journal of Machine Learning Technologies*, 2(1), 37-63.
Niculescu-Mizil, A., & Caruana, R. (2005). "Predicting Good Probabilities With Supervised Learning." *Proceedings of the 22nd International Conference on Machine Learning*, 625-632.

Introduction

What is Binary Classification?

Positive and Negative Class Conventions

Example Use Cases

Algorithms for Binary Classification

Logistic Regression

Support Vector Machines

Decision Trees

Random Forest

Gradient Boosting

Neural Networks

Naive Bayes

Algorithm Comparison

Loss Functions for Binary Classification

Binary Cross-Entropy (Log Loss)

Hinge Loss

Comparison of Loss Functions

Output: Probability via Sigmoid

Probability Calibration

Platt Scaling

Isotonic Regression

Threshold Selection

Methods for Selecting the Optimal Threshold

Confusion Matrix

Evaluation Metrics

Metrics Summary Table

Accuracy

Precision

Recall (Sensitivity)

F1 Score

Specificity (True Negative Rate)

AUC-ROC

AUC-PR (Precision-Recall AUC)

Matthews Correlation Coefficient (MCC)

Handling Class Imbalance

SMOTE (Synthetic Minority Oversampling Technique)

Class Weights

Threshold Adjustment

Summary of Imbalance Techniques

Binary vs. Multi-class and Multi-label Classification

Bayes Optimal Classifier

Applications

Spam Detection

Fraud Detection

Medical Diagnosis

Sentiment Analysis

Explain Like I'm 5 (ELI5)

References

Improve this article

Related Articles

ARC-AGI 2

Squared Hinge Loss

AUC (Area Under the ROC Curve)

Accuracy

Class-Imbalanced Dataset

Classification Model

Introduction

What is Binary Classification?

Positive and Negative Class Conventions

Example Use Cases

Algorithms for Binary Classification

Logistic Regression

Support Vector Machines

Decision Trees

Random Forest

Gradient Boosting

Neural Networks

Naive Bayes

Algorithm Comparison

Loss Functions for Binary Classification

Binary Cross-Entropy (Log Loss)

Hinge Loss

Comparison of Loss Functions

Output: Probability via Sigmoid

Probability Calibration

Platt Scaling

Isotonic Regression

Threshold Selection

Methods for Selecting the Optimal Threshold

Confusion Matrix