See also: Classification model, Multi-class classification
Binary classification is a type of machine learning problem where the goal is to classify input data into one of two classes or categories. These classes are conventionally labeled as the positive class (1) and the negative class (0). The classes are mutually exclusive. Binary classification is one of the most fundamental tasks in supervised learning and serves as the foundation for many real-world prediction systems. In formal terms, a binary classifier is a function that maps an input feature vector x to a single binary output y, where y belongs to {0, 1}.
Unlike multi-class classification, which involves three or more possible output categories, binary classification restricts the prediction space to exactly two outcomes. This constraint makes the problem mathematically simpler in some respects, but the practical challenges of building accurate binary classifiers remain significant, particularly when dealing with noisy data, overlapping class distributions, or severe class imbalance.
In binary classification, a machine learning algorithm learns to classify input data into one of two classes based on labeled training data. When given input data, this algorithm makes a prediction as to which class the input belongs in.
Binary classification involves classifying input data into two classes based on learned patterns from training data, such as spam or not spam, fraud or not fraud, and disease or not disease. The goal is to accurately classify new input data into the appropriate class based on these learned patterns.
Binary classification is a supervised learning task, meaning the algorithm is trained using labeled data where each data point has been assigned a label indicating its class membership. During the training phase, the model learns a decision boundary that separates the two classes in the feature space. Once trained, the model generalizes this boundary to unseen data during inference.
In binary classification, the two classes are referred to as the positive class and the negative class. By convention, the positive class (labeled 1) typically represents the outcome of primary interest, while the negative class (labeled 0) represents the default or baseline outcome. For example, in fraud detection the positive class is "fraudulent" and the negative class is "legitimate." In medical diagnosis, the positive class is "disease present" and the negative class is "disease absent."
The choice of which class to designate as positive affects the interpretation of evaluation metrics. Precision, recall, and the confusion matrix are all defined relative to the positive class. Swapping the positive and negative labels does not change the model's underlying behavior, but it changes the numeric values of these metrics. For this reason, practitioners should clearly define the positive class before evaluating results, particularly in domains where the minority class carries higher importance (such as detecting rare diseases or fraudulent transactions).
| Application | Input Features | Positive Class | Negative Class |
|---|---|---|---|
| Email spam filtering | Email content, headers, sender | Spam | Not spam |
| Credit risk assessment | Income, credit score, employment history | High-risk | Low-risk |
| Medical diagnosis | Age, blood pressure, lab results | Disease present | Disease absent |
| Fraud detection | Transaction amount, location, type | Fraudulent | Non-fraudulent |
| Sentiment analysis | Text of a review or post | Positive sentiment | Negative sentiment |
| Churn prediction | Usage patterns, account tenure | Will churn | Will stay |
Binary classification can be performed using many different machine learning algorithms, including logistic regression, decision trees, random forests, support vector machines, gradient boosting, and neural networks. The specific choice depends on the problem being solved and the characteristics of the data.
Logistic regression is one of the most widely used algorithms for binary classification. Despite its name, it is a classification method, not a regression method. Logistic regression models the probability that a given input belongs to the positive class by applying the sigmoid function to a linear combination of input features. The model learns a weight vector w and a bias term b such that the predicted probability is:
P(y = 1 | x) = sigmoid(w^T * x + b)
Logistic regression is valued for its simplicity, interpretability, and computational efficiency. The learned coefficients indicate the direction and magnitude of each feature's influence on the prediction. It works best when the relationship between the features and the log-odds of the outcome is approximately linear.
Support vector machines (SVMs) approach binary classification by finding the optimal hyperplane that separates the two classes with the maximum margin. The margin is the distance between the hyperplane and the nearest data points from each class, known as support vectors. By maximizing this margin, SVMs produce classifiers that tend to generalize well to unseen data.
For data that is not linearly separable, SVMs use the kernel trick to project the data into a higher-dimensional space where a linear separator can be found. Common kernel functions include the radial basis function (RBF), polynomial, and sigmoid kernels. SVMs are effective in high-dimensional spaces and perform well even when the number of features exceeds the number of training samples.
Decision trees classify data by recursively partitioning the feature space based on feature values. At each internal node, the algorithm selects the feature and threshold that best separates the classes according to a splitting criterion such as Gini impurity or information gain. The process continues until a stopping condition is met, such as reaching a maximum depth or a minimum number of samples per leaf.
Decision trees handle both numerical and categorical features naturally and produce models that are easy to visualize and interpret. However, individual decision trees are prone to overfitting, especially when grown deep without pruning.
Random forest is an ensemble learning method that trains multiple decision trees on random subsets of the training data and features, then aggregates their predictions through majority voting. This approach reduces the variance associated with individual trees and typically produces more robust classifiers.
Each tree in the forest is trained on a bootstrap sample (a random sample drawn with replacement) of the original data, and at each split, only a random subset of features is considered. The combination of bagging and feature randomization makes random forests resistant to overfitting and effective across many types of binary classification problems.
Gradient boosting builds an ensemble of weak learners (usually shallow decision trees) sequentially, where each new tree corrects the errors of the previous ensemble. The method optimizes a loss function by adding trees that follow the negative gradient of the loss. Popular implementations include XGBoost, LightGBM, and CatBoost.
Gradient boosting methods frequently achieve state-of-the-art results on tabular data and are widely used in competitions and production systems. Key hyperparameters include the learning rate (which controls the contribution of each tree), the number of trees, and the maximum tree depth. Regularization techniques such as L1/L2 penalties and subsampling help prevent overfitting.
Neural networks can be applied to binary classification by using a single output neuron with a sigmoid activation function. The network learns complex, nonlinear relationships between input features and the target class through multiple layers of interconnected neurons. Deep learning models, such as convolutional neural networks for image data and recurrent neural networks for sequential data, have achieved strong performance on many binary classification benchmarks.
For binary classification, the output layer consists of one neuron with sigmoid activation, producing a value between 0 and 1 that represents the predicted probability of the positive class. The network is trained using binary cross-entropy as the loss function and optimized with algorithms such as stochastic gradient descent or Adam.
Naive Bayes classifiers apply Bayes' theorem with a strong independence assumption between features. Despite this simplifying assumption, Naive Bayes performs surprisingly well on many binary classification tasks, particularly text classification problems like spam detection and sentiment analysis. The algorithm is computationally efficient, requires minimal training data, and scales well to high-dimensional feature spaces. Variants include Gaussian Naive Bayes (for continuous features), Multinomial Naive Bayes (for count data), and Bernoulli Naive Bayes (for binary features).
| Algorithm | Interpretability | Handles Non-linearity | Training Speed | Prone to Overfitting | Best For |
|---|---|---|---|---|---|
| Logistic Regression | High | No (linear) | Fast | Low | Linearly separable data, baseline models |
| SVM | Low (with kernels) | Yes (with kernels) | Moderate | Moderate | High-dimensional data, small to medium datasets |
| Decision Tree | High | Yes | Fast | High | Exploratory analysis, interpretable models |
| Random Forest | Moderate | Yes | Moderate | Low | General-purpose, tabular data |
| Gradient Boosting | Low | Yes | Slow | Moderate (with tuning) | Tabular data, competitions, high accuracy needs |
| Neural Network | Low | Yes | Slow | High (without regularization) | Large datasets, unstructured data (images, text) |
| Naive Bayes | High | No (linear boundaries) | Very fast | Low | Text classification, small datasets, baselines |
A loss function quantifies the difference between a model's predictions and the true labels. For binary classification, the two most commonly used loss functions are binary cross-entropy (log loss) and hinge loss.
The standard loss function for training binary classifiers is binary cross-entropy, also called log loss. For a single training example with true label y (0 or 1) and predicted probability p, the binary cross-entropy loss is defined as:
L(y, p) = -[y * log(p) + (1 - y) * log(1 - p)]
When y = 1 (positive class), the loss simplifies to -log(p), which penalizes the model heavily when it assigns a low probability to a true positive. When y = 0 (negative class), the loss becomes -log(1 - p), penalizing high predicted probabilities for actual negatives. The logarithmic nature of this function creates a large gradient for confident but incorrect predictions, which accelerates learning during backpropagation.
For a dataset of N samples, the overall loss is the average across all samples:
L = -(1/N) * sum[y_i * log(p_i) + (1 - y_i) * log(1 - p_i)]
Binary cross-entropy is a convex function for logistic regression, guaranteeing a single global minimum. For neural networks, the optimization landscape is non-convex, but binary cross-entropy still provides well-behaved gradients that facilitate training.
Hinge loss is the loss function used by support vector machines. Unlike binary cross-entropy, hinge loss operates on raw scores (logits) rather than probabilities and does not produce probabilistic output. For a single training example with true label y (encoded as -1 or +1) and predicted score f(x), hinge loss is:
L(y, f(x)) = max(0, 1 - y * f(x))
Hinge loss equals zero when the prediction is correct and the margin (the product y * f(x)) exceeds 1. When the margin is less than 1, the loss increases linearly. This means hinge loss penalizes not only incorrect predictions but also correct predictions that fall too close to the decision boundary. The focus on margin maximization is what gives SVMs their strong generalization properties.
Because hinge loss is not differentiable at the point where y * f(x) = 1, subgradient methods are used during optimization.
| Property | Binary Cross-Entropy | Hinge Loss |
|---|---|---|
| Output type | Probability (0 to 1) | Raw score (logit) |
| Probabilistic interpretation | Yes | No |
| Primary algorithm | Logistic regression, neural networks | SVM |
| Penalizes confident mistakes | Heavily (logarithmic) | Linearly |
| Margin-based | No | Yes |
| Smoothness | Smooth and differentiable | Not differentiable at hinge point |
| Best when | Calibrated probabilities are needed | Maximum-margin separation is desired |
Most binary classifiers produce a raw score (logit) that must be transformed into a probability. The sigmoid function (also called the logistic function) performs this transformation:
sigmoid(z) = 1 / (1 + exp(-z))
The sigmoid function maps any real-valued number z to the range (0, 1). When z is a large positive number, sigmoid(z) approaches 1; when z is a large negative number, sigmoid(z) approaches 0; and when z = 0, sigmoid(z) = 0.5. This S-shaped curve provides a smooth, differentiable mapping from logits to probabilities.
In logistic regression, the logit z is the linear combination of input features: z = w^T * x + b. In neural networks, z is the output of the final layer before the activation function. The resulting probability p = sigmoid(z) represents the model's confidence that the input belongs to the positive class.
The sigmoid function has a useful derivative property: sigmoid'(z) = sigmoid(z) * (1 - sigmoid(z)), which simplifies gradient computation during training.
The raw probability output of a binary classifier does not always accurately reflect the true likelihood of class membership. For example, a model might predict a probability of 0.8 for a set of instances, but only 60% of those instances actually belong to the positive class. When predicted probabilities match observed frequencies, the model is said to be well-calibrated. Calibration is important in applications where the predicted probability itself drives decisions, such as medical risk scoring or insurance pricing.
Some algorithms, like logistic regression, tend to produce well-calibrated probabilities by default. Others, like SVMs, random forests, and gradient boosting models, often require post-hoc calibration.
Platt scaling fits a logistic regression model to the classifier's output scores, learning parameters A and B such that the calibrated probability is p = 1 / (1 + exp(A * f(x) + B)). This method works well when the distortion in predicted probabilities follows a sigmoid shape, which is commonly the case for SVMs and neural networks. Platt scaling is simple to implement and works well even with small calibration datasets.
Isotonic regression is a non-parametric calibration method that learns a piecewise constant, monotonically increasing function mapping classifier scores to calibrated probabilities. It is more flexible than Platt scaling and can correct any monotonic distortion. However, isotonic regression requires more calibration data (typically 1,000+ samples) to avoid overfitting. When sufficient data is available, it generally outperforms Platt scaling.
Calibration should always be performed on a held-out calibration set (separate from both the training and test sets) to avoid biasing the calibrated probabilities.
After a binary classifier produces a probability score, a decision threshold must be applied to convert this probability into a class label. The default threshold is 0.5: if the predicted probability exceeds 0.5, the input is classified as positive; otherwise, it is classified as negative. However, the optimal threshold is often not 0.5, especially when dealing with imbalanced classes or when the costs of different types of errors are unequal.
Several approaches exist for determining the best threshold:
| Method | Strategy | Best When |
|---|---|---|
| ROC curve analysis | Maximize Youden index (sensitivity + specificity - 1) | Overall discrimination matters |
| Precision-recall tradeoff | Tune threshold to favor precision or recall | One type of error is more costly |
| Cost-sensitive optimization | Minimize total expected cost given asymmetric error costs | FP and FN have different financial costs |
| F1 score maximization | Choose threshold producing highest F1 on validation set | Balanced tradeoff between precision and recall |
| Business requirements | Domain experts set threshold based on operational constraints | Regulatory or policy constraints exist |
For example, in medical screening, a lower threshold increases recall (catching more true positives) at the cost of reduced precision (more false positives). In fraud detection, missing a fraudulent transaction (false negative) may be far more costly than flagging a legitimate one (false positive), so the threshold is lowered accordingly.
Threshold selection should always be performed on a validation set, not the training set, to avoid overfitting the threshold to training data.
The confusion matrix is a fundamental tool for evaluating binary classifiers. It organizes predictions into a 2x2 table based on the actual and predicted class labels:
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | True Positive (TP) | False Negative (FN) |
| Actual Negative | False Positive (FP) | True Negative (TN) |
Each cell in the confusion matrix captures a specific type of prediction outcome:
All standard binary classification metrics can be derived from these four values. The confusion matrix provides a complete picture of classifier performance and reveals patterns of errors that aggregate metrics may obscure.
The performance of a binary classification model is evaluated using various metrics such as accuracy, precision, recall, F1 score, and others. When selecting an evaluation metric, consider the specific problem at hand and the relative importance of accurately recognizing each class.
| Metric | Formula | Range | Description |
|---|---|---|---|
| Accuracy | (TP + TN) / (TP + TN + FP + FN) | 0 to 1 | Proportion of all predictions that are correct |
| Precision | TP / (TP + FP) | 0 to 1 | Proportion of positive predictions that are correct |
| Recall (Sensitivity) | TP / (TP + FN) | 0 to 1 | Proportion of actual positives correctly identified |
| F1 Score | 2 * Precision * Recall / (Precision + Recall) | 0 to 1 | Harmonic mean of precision and recall |
| Specificity (TNR) | TN / (TN + FP) | 0 to 1 | Proportion of actual negatives correctly identified |
| AUC-ROC | Area under the ROC curve | 0 to 1 | Model's ability to discriminate between classes across all thresholds |
| AUC-PR | Area under the Precision-Recall curve | 0 to 1 | Model performance on the positive class across thresholds |
| MCC | (TPTN - FPFN) / sqrt((TP+FP)(TP+FN)(TN+FP)(TN+FN)) | -1 to 1 | Correlation between predicted and actual classifications |
Accuracy measures the percentage of correct predictions made by the model on a set of test data. It is calculated as (TP + TN) / (TP + TN + FP + FN). While intuitive and easy to understand, accuracy can be misleading when the dataset is imbalanced. For example, if 95% of transactions are legitimate, a model that always predicts "not fraud" achieves 95% accuracy while failing to detect any actual fraud.
Precision is the proportion of true positive predictions among all positive predictions made by the model. It answers the question: "Of all the instances the model labeled positive, how many were actually positive?" Precision is especially important in applications where false positives are costly, such as spam filtering (where a legitimate email incorrectly flagged as spam may cause the user to miss important messages).
Recall measures the proportion of actual positives that the model correctly identified. It answers the question: "Of all the actual positive instances, how many did the model catch?" Recall is critical in applications where false negatives are costly, such as medical diagnosis (where missing a disease could delay treatment) or fraud detection.
The F1 score is the harmonic mean of precision and recall, providing a single metric that balances both concerns. It is calculated as:
F1 = 2 * (Precision * Recall) / (Precision + Recall)
The harmonic mean gives more weight to lower values, so the F1 score is high only when both precision and recall are high. This makes it a useful metric when the costs of false positives and false negatives are roughly equal and the dataset is imbalanced.
Specificity measures the proportion of actual negatives that the model correctly identified. It is calculated as TN / (TN + FP). Specificity is the counterpart of recall for the negative class. In medical testing, specificity indicates how well a test correctly identifies people who do not have a disease.
The ROC curve (Receiver Operating Characteristic) plots the true positive rate (recall) against the false positive rate (1 - specificity) at various classification thresholds. The area under this curve (AUC-ROC) provides a threshold-independent measure of a model's discriminative ability. An AUC-ROC of 1.0 represents a perfect classifier, while 0.5 represents random guessing.
AUC-ROC is widely used because it summarizes performance across all possible thresholds in a single number. However, it can be overly optimistic when the dataset is heavily imbalanced, because the false positive rate denominator (FP + TN) is dominated by the large number of true negatives.
The Precision-Recall curve plots precision against recall at various thresholds. The area under this curve (AUC-PR) is particularly informative for imbalanced datasets where the positive class is rare. Unlike AUC-ROC, the Precision-Recall curve does not include true negatives in its computation, so it provides a clearer picture of how well the model identifies the minority class.
The baseline AUC-PR for a random classifier equals the prevalence of the positive class. In a dataset where only 1% of samples are positive, a random classifier achieves an AUC-PR of approximately 0.01, making it easier to distinguish meaningful improvements from chance performance.
The Matthews correlation coefficient is a balanced metric that uses all four values from the confusion matrix. It produces a value between -1 and +1, where +1 indicates perfect prediction, 0 indicates performance no better than random, and -1 indicates total disagreement between predictions and actual labels.
MCC is considered one of the most reliable single-number measures for binary classification quality because it accounts for the balance ratios of all four confusion matrix categories. Research has shown that MCC is more informative than F1 score and accuracy, especially on imbalanced datasets, because it only produces a high score when the classifier performs well on both the positive and negative classes.
Class imbalance occurs when one class significantly outnumbers the other in the training data. This is common in real-world binary classification problems: fraudulent transactions may represent less than 0.1% of all transactions, and disease-positive cases may be a small fraction of all patients. Models trained on imbalanced data tend to be biased toward the majority class, resulting in poor recall for the minority class.
Several techniques address class imbalance:
SMOTE generates synthetic samples for the minority class by interpolating between existing minority-class instances. For each minority-class sample, SMOTE identifies its k-nearest neighbors (typically k = 5), then creates new synthetic examples at random points along the line segments connecting the sample to its neighbors.
SMOTE produces more diverse synthetic samples than simple random oversampling (which merely duplicates existing examples) and helps the classifier learn a more generalizable decision boundary. Several variants have been developed, including Borderline-SMOTE (which focuses on generating samples near the class boundary), SMOTE-ENN (which combines oversampling with Edited Nearest Neighbors for cleaning), and ADASYN (Adaptive Synthetic Sampling, which generates more samples in regions where the classifier performs poorly).
Instead of modifying the dataset, class weights adjust the loss function so that misclassifications of the minority class receive a higher penalty. For example, if the positive class represents 10% of the data, its weight might be set to 9 (the ratio of negative to positive samples), making each minority-class error nine times more influential during training.
Most machine learning frameworks support class weighting natively. In scikit-learn, the class_weight='balanced' parameter automatically adjusts weights inversely proportional to class frequencies.
Rather than using the default 0.5 threshold, the classification threshold can be lowered to increase the model's sensitivity to the minority class. For example, setting the threshold to 0.3 means that any predicted probability above 0.3 is classified as positive, increasing recall at the potential cost of precision. Threshold calibration on a validation set allows practitioners to find the threshold that optimizes the desired metric.
| Technique | Category | Description |
|---|---|---|
| Random undersampling | Data-level | Removes random majority-class samples to balance the dataset |
| Random oversampling | Data-level | Duplicates random minority-class samples |
| SMOTE | Data-level | Generates synthetic minority samples via interpolation |
| ADASYN | Data-level | Adaptively generates more synthetic samples in difficult regions |
| Class weights | Algorithm-level | Adjusts the loss function to penalize minority-class errors more heavily |
| Threshold adjustment | Post-processing | Modifies the decision boundary after model training |
| Ensemble methods | Algorithm-level | Combines multiple models trained on balanced subsets (e.g., BalancedRandomForest) |
| Cost-sensitive learning | Algorithm-level | Incorporates misclassification costs directly into the training objective |
Binary classification is distinct from both multi-class classification and multi-label classification. Understanding the differences is important for selecting the correct modeling approach.
| Type | Number of Classes | Labels per Sample | Output Function | Loss Function | Example |
|---|---|---|---|---|---|
| Binary classification | 2 | 1 | Sigmoid | Binary cross-entropy | Spam or not spam |
| Multi-class classification | 3 or more | 1 | Softmax | Categorical cross-entropy | Classifying an image as cat, dog, or bird |
| Multi-label classification | 2 or more | 0 or more | Sigmoid (per label) | Binary cross-entropy (per label) | Tagging a movie as action, comedy, and drama simultaneously |
In multi-class classification, the classes are mutually exclusive: each sample belongs to exactly one class. The softmax function is typically used in the output layer instead of sigmoid, and categorical cross-entropy replaces binary cross-entropy as the loss function.
In multi-label classification, the classes are not mutually exclusive: a single sample can belong to multiple classes at the same time. This is often handled by treating each label as an independent binary classification problem. The output layer uses a sigmoid activation for each label, and the model is trained with binary cross-entropy computed independently for each label.
Binary classification can also serve as a building block for multi-class problems through strategies such as one-vs-rest (OvR), where a separate binary classifier is trained for each class, and one-vs-one (OvO), where a binary classifier is trained for every pair of classes.
In the theoretical framework of statistical learning, the Bayes optimal classifier represents the best possible binary classifier for a given problem. It assigns each input to the class with the highest posterior probability:
y* = argmax_c P(y = c | x)
For binary classification, this simplifies to predicting the positive class when P(y = 1 | x) > 0.5 and the negative class otherwise. The Bayes optimal classifier achieves the lowest possible error rate (known as the Bayes error rate) for a given data distribution. No other classifier can outperform it on expectation.
In practice, the true class-conditional distributions are unknown, so the Bayes optimal classifier cannot be computed directly. All practical machine learning algorithms attempt to approximate it by learning a decision boundary from training data. Understanding the Bayes optimal classifier provides a theoretical benchmark for evaluating how close a learned model comes to the best achievable performance.
Binary classification is used extensively across many domains. The following table summarizes prominent applications:
| Application | Positive Class | Negative Class | Common Algorithms | Key Metric |
|---|---|---|---|---|
| Spam detection | Spam email | Legitimate email | Naive Bayes, Logistic Regression, SVM | Precision |
| Fraud detection | Fraudulent transaction | Legitimate transaction | Gradient Boosting, Random Forest, Neural Networks | Recall, AUC-PR |
| Medical diagnosis | Disease present | Disease absent | Logistic Regression, Random Forest, Deep Learning | Recall, Specificity |
| Sentiment analysis | Positive sentiment | Negative sentiment | BERT, Logistic Regression, SVM | F1 Score, Accuracy |
| Churn prediction | Customer will churn | Customer will stay | Gradient Boosting, Logistic Regression | AUC-ROC |
| Manufacturing defect detection | Defective product | Non-defective product | CNN, SVM | Recall, Precision |
| Credit scoring | Default | No default | Logistic Regression, Gradient Boosting | AUC-ROC, MCC |
| Disease screening | Positive test | Negative test | Logistic Regression, Random Forest | Sensitivity, Specificity |
Spam detection was one of the earliest large-scale applications of binary classification. Email providers classify incoming messages as spam or not spam using features such as the presence of certain keywords, the sender's reputation, header information, and embedded links. Early systems relied on Naive Bayes classifiers, while modern systems use combinations of deep learning models and rule-based filters.
Financial institutions deploy binary classifiers to flag potentially fraudulent transactions in real time. These systems analyze features such as transaction amount, location, time, merchant category, and user behavior patterns. The extreme class imbalance (fraud typically accounts for less than 0.1% of transactions) makes this a challenging application where recall and AUC-PR are often prioritized over accuracy.
Binary classification is widely applied in medical settings to detect diseases from clinical data, lab results, or medical images. Examples include classifying mammograms as showing malignant or benign tissue, predicting whether a patient has diabetes based on blood test results, and detecting pneumonia from chest X-rays. In medical applications, the cost of false negatives (missing a disease) is typically much higher than the cost of false positives (a healthy person flagged for further testing), so models are often tuned for high recall.
Sentiment analysis classifies text (such as product reviews, social media posts, or customer feedback) as expressing positive or negative sentiment. While sentiment can also be modeled as a multi-class problem (positive, neutral, negative) or on a continuous scale, the binary formulation remains common. Modern approaches use transformer-based models such as BERT fine-tuned on labeled sentiment data.
Imagine you have a big box of apples. Some apples are good and some are bad. Your job is to look at each apple and sort it into one of two buckets: the "good" bucket or the "bad" bucket. That is binary classification. You look at clues like the color, shape, and whether it has spots to make your decision.
A computer does the same thing, but instead of apples it might be sorting emails into "spam" or "not spam," or looking at a medical test to decide "sick" or "healthy." The computer learns from lots of examples that people have already sorted. After seeing enough examples, it gets good at guessing which bucket new items belong in, even if it has never seen those exact items before.
The word "binary" means "two," so binary classification simply means sorting things into exactly two groups.