See also: machine learning terms, confusion matrix, precision, recall, F1 score
Accuracy in machine learning is a metric that measures the performance of a classification model. It represents the fraction of correct predictions made by the model on a given dataset compared to the total number of predictions. Accuracy is one of the most frequently used evaluation metrics in machine learning and serves as a standard for comparing models across a wide range of tasks.
Because accuracy is so intuitive, it is often the first metric that practitioners check after training a classifier. A model that achieves 95% accuracy, for instance, is correct 95 times out of every 100 predictions. This simplicity makes accuracy easy to explain to stakeholders who may not have a technical background. At the same time, that simplicity can be deceptive; a high accuracy score does not always mean a model is performing well, a point explored in detail in later sections of this article.
The concept of measuring classification correctness predates modern machine learning. In statistics, the classification error rate (1 minus accuracy) has been used since at least the mid-20th century to evaluate discriminant analysis and other classification methods. Frank Rosenblatt and other early researchers used confusion matrices to compare human and machine classifications of visual and auditory stimuli during the development of the perceptron in the late 1950s and 1960s. As machine learning grew into a distinct discipline, accuracy became the default evaluation metric for supervised classification tasks across domains ranging from image recognition to medical diagnosis.
Accuracy is defined as the ratio between the number of correct predictions and the total number of predictions made by a classifier. In mathematical notation:
Accuracy = (Number of correct predictions) / (Total number of predictions)
For a binary classification problem, this formula can be expressed using the four outcomes of a confusion matrix:
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Where:
| Symbol | Name | Meaning | |--------|------|---------|| | TP | True Positive | Instances correctly predicted as positive | | TN | True Negative | Instances correctly predicted as negative | | FP | False Positive | Instances incorrectly predicted as positive (Type I error) | | FN | False Negative | Instances incorrectly predicted as negative (Type II error) |
For instance, if a model is trained to classify images of cats and dogs and tested on 100 images, and it correctly identifies 80 of them, its accuracy is 80/100 = 0.8 or 80%.
The complement of accuracy is the error rate (also called the misclassification rate or 0-1 loss):
Error Rate = 1 - Accuracy = (FP + FN) / (TP + TN + FP + FN)
An accuracy of 0.92 corresponds to an error rate of 0.08, meaning 8% of predictions are incorrect. The error rate and accuracy always sum to 1.0, so they contain exactly the same information. The choice between reporting one or the other is largely a matter of convention; some fields and competitions prefer error rate because it highlights the mistakes a model makes rather than its successes.
For problems with more than two classes, accuracy generalizes naturally. If there are k classes, the classifier assigns each instance to one of the k labels, and accuracy is still the total number of correct predictions divided by the total number of instances. The confusion matrix for a k-class problem becomes a k x k matrix, where the diagonal entries represent correct classifications and off-diagonal entries represent misclassifications.
In multilabel classification, each instance can belong to multiple classes simultaneously. In this setting, the strictest form of accuracy is subset accuracy (also called exact match ratio), which counts a prediction as correct only if the entire set of predicted labels matches the true label set exactly. Scikit-learn's accuracy_score function uses subset accuracy by default for multilabel inputs. Because this measure is very strict, it tends to be lower than other metrics in multilabel settings.
Accuracy is an invaluable metric when the classes in a dataset are balanced, meaning there are approximately equal numbers of samples for each class. In such cases, accuracy serves as a reliable indication of the model's overall performance.
Accuracy works well under the following conditions:
| Condition | Why accuracy works |
|---|---|
| Balanced class distribution | No class dominates the score by sheer frequency |
| Equal misclassification costs | A false positive is roughly as costly as a false negative |
| Single summary metric needed | Accuracy compresses all predictions into one number for quick comparison |
| Benchmark comparisons | Standardized datasets with balanced classes (e.g., CIFAR-10, MNIST) use accuracy as the primary leaderboard metric |
For standard image classification benchmarks, accuracy has long been the go-to metric. On ImageNet, top-1 and top-5 accuracy are the standard measures reported in research papers and competitions. Similarly, benchmarks like MNIST (handwritten digit recognition) and CIFAR-10 (object recognition across 10 balanced classes) use accuracy because their class distributions are uniform.
When classes are imbalanced (one class with significantly more samples than the other), accuracy may not be a reliable measure of model performance. This phenomenon is known as the accuracy paradox, a term describing the paradoxical finding that accuracy can produce misleading results in predictive analytics. A model that always predicts the majority class can achieve a high accuracy score while being completely useless for the task at hand.
Consider these concrete scenarios where accuracy is misleading on a class-imbalanced dataset:
| Domain | Class Distribution | Naive Strategy | Accuracy | Actual Usefulness |
|---|---|---|---|---|
| Fraud detection | 98% legitimate, 2% fraudulent | Always predict "legitimate" | 98% | Detects zero fraud |
| Spam filtering | 90% non-spam, 10% spam | Always predict "non-spam" | 90% | Filters zero spam |
| Medical screening | 99.5% healthy, 0.5% diseased | Always predict "healthy" | 99.5% | Misses every case of disease |
| Terrorism detection | 99.999% non-threat | Always predict "non-threat" | 99.999% | Completely non-functional |
In a fraud detection system where only 2% of transactions are fraudulent, a model that labels every transaction as legitimate achieves 98% accuracy. Despite this impressive-sounding number, the model fails at its primary purpose: catching fraud.
The root cause is that accuracy weights all correct predictions equally. In an imbalanced dataset, correctly classifying the majority class contributes far more to the accuracy score than correctly classifying the minority class. A model can therefore "cheat" by ignoring the minority class entirely and still report a high accuracy. This is why alternative metrics like precision, recall, F1 score, and the Matthews correlation coefficient exist; they account for the distribution of predictions across classes.
To illustrate the paradox numerically, consider a city security system monitoring 1,000,000 people, of whom 10 are actual threats. A system that flags 1,000 people (including all 10 real threats) achieves 99.9% accuracy (999,010 correct out of 1,000,000). Yet its precision is only 1% (10 out of 1,000 flagged), meaning 99% of alerts are false alarms. The accuracy score completely obscures this problem.
Although accuracy is most commonly associated with classification, the term appears in other areas of machine learning and data science, sometimes with different meanings.
In classification, accuracy has the standard definition described above: the proportion of instances where the predicted label matches the true label. This applies to binary, multiclass, and multilabel settings.
Accuracy is not a standard metric for regression problems, where the output is a continuous value rather than a discrete label. Regression tasks instead use metrics such as mean squared error (MSE), root mean squared error (RMSE), mean absolute error (MAE), and R-squared. Some practitioners loosely refer to "accuracy" in regression contexts to mean the degree of closeness between predicted and actual values, but this informal usage should be avoided in technical writing because it conflates two different measurement concepts.
In information retrieval, the concept closest to classification accuracy is retrieval effectiveness, but the standard metrics differ. Precision and recall are used to evaluate whether a system returns relevant documents and avoids irrelevant ones. Metrics such as precision at k (P@k), mean average precision (MAP), and normalized discounted cumulative gain (nDCG) are preferred because they account for ranking position, not just correctness. Classification accuracy would be unsuitable for retrieval because the vast majority of documents in a collection are irrelevant to any given query, creating an extreme class imbalance problem.
Accuracy is one member of a family of classification metrics, each capturing a different aspect of model performance. Understanding how these metrics relate to accuracy helps practitioners choose the right evaluation approach for their specific problem.
Precision (also called positive predictive value) measures the fraction of positive predictions that are actually correct:
Precision = TP / (TP + FP)
Precision answers the question: "Of all instances the model labeled positive, how many actually were positive?" High precision means the model rarely produces false alarms. Precision is especially important in contexts where false positives are costly, such as email spam filtering (marking legitimate email as spam is disruptive) or criminal justice (false accusations have severe consequences).
Recall (also called sensitivity or true positive rate) measures the fraction of actual positive instances that the model correctly identified:
Recall = TP / (TP + FN)
Recall answers the question: "Of all instances that truly were positive, how many did the model catch?" High recall means the model misses few positive cases. Recall is critical in medical diagnosis (missing a cancer diagnosis is dangerous) and fraud detection (missing fraudulent transactions leads to financial losses).
Specificity (also called the true negative rate) measures how well the model identifies negative instances:
Specificity = TN / (TN + FP)
While recall focuses on the positive class, specificity focuses on the negative class. Together, sensitivity and specificity provide a complete picture of how well a model handles both classes. These two metrics are especially prominent in medical and clinical research.
The F1 score is the harmonic mean of precision and recall:
F1 = 2 * (Precision * Recall) / (Precision + Recall)
The F1 score ranges from 0 to 1 and balances the trade-off between precision and recall. It is particularly useful when class distributions are uneven, because it penalizes models that sacrifice one metric for the other. When precision and recall diverge significantly, the F1 score will be pulled toward the lower of the two values, exposing weaknesses that accuracy might hide.
The Matthews correlation coefficient (MCC) is a balanced measure that accounts for all four cells of the confusion matrix:
MCC = (TP * TN - FP * FN) / sqrt((TP+FP)(TP+FN)(TN+FP)(TN+FN))
MCC returns a value between -1 and +1, where +1 indicates perfect prediction, 0 indicates performance no better than random, and -1 indicates total disagreement. A 2020 study by Chicco and Jurman published in BMC Genomics demonstrated that MCC is more informative than both F1 score and accuracy for binary classification evaluation, especially on imbalanced datasets. In their example, a classifier that always predicts positive on an imbalanced set (95 positive, 5 negative) achieves 95% accuracy and an F1 score of 0.974, yet the MCC is only 0.14, correctly flagging the classifier as nearly useless.
The following table summarizes when to use each metric:
| Metric | Formula | Range | Best for |
|---|---|---|---|
| Accuracy | (TP+TN) / (TP+TN+FP+FN) | 0 to 1 | Balanced datasets, quick overview |
| Precision | TP / (TP+FP) | 0 to 1 | Minimizing false positives |
| Recall | TP / (TP+FN) | 0 to 1 | Minimizing false negatives |
| F1 Score | Harmonic mean of precision and recall | 0 to 1 | Imbalanced datasets, trade-off between precision and recall |
| Specificity | TN / (TN+FP) | 0 to 1 | Evaluating negative class performance |
| MCC | Correlation between predicted and actual | -1 to +1 | Imbalanced datasets, comprehensive evaluation |
Several modified forms of accuracy have been developed to address the limitations of standard accuracy in particular settings.
Balanced accuracy compensates for class imbalance by computing the average of recall (sensitivity) obtained on each class:
Balanced Accuracy = (Sensitivity + Specificity) / 2
For a multiclass problem, balanced accuracy is the macro-average of per-class recall values:
Balanced Accuracy = (1 / n_classes) * sum(recall_i for each class i)
If a binary classifier achieves 100% recall on the majority class but 0% on the minority class, standard accuracy might be 95% (with a 95/5 split), but balanced accuracy would be 50%, correctly reflecting the model's failure to learn the minority class. Scikit-learn provides balanced_accuracy_score in its sklearn.metrics module. The function also supports an adjusted parameter that rescales the score so that random performance yields 0 rather than 1/n_classes, making it easier to compare against chance performance.
Balanced accuracy is equivalent to computing accuracy_score with class-balanced sample weights. This makes it particularly useful when reporting a single number for datasets where some classes are much rarer than others.
Top-k accuracy relaxes the standard accuracy criterion by counting a prediction as correct if the true label appears anywhere among the model's top k predicted classes (ranked by predicted probability or score). Top-1 accuracy is equivalent to standard accuracy.
Top-5 accuracy became widely known through the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), where models had to classify images into one of 1,000 categories. Because many images contain ambiguous content or multiple objects, top-5 accuracy gave models credit for having the correct class among their five most confident predictions. AlexNet, the deep learning model that won ILSVRC 2012 and launched the modern deep learning era, achieved a top-5 error rate of 15.3%. By 2015, ResNet had pushed the top-5 error rate below 3.57%, surpassing human-level performance on this particular benchmark.
Scikit-learn provides top_k_accuracy_score for computing this metric. Top-k accuracy is also commonly used in recommendation systems and information retrieval.
Per-class accuracy (also called class-wise accuracy) calculates accuracy separately for each class. This decomposition reveals whether a model performs uniformly across all classes or excels at some while failing at others. Per-class accuracy is especially informative for multiclass problems where some classes are inherently harder to distinguish.
Accuracy is a threshold-dependent metric. For binary classifiers that output a probability score, a threshold (commonly 0.5) determines the class assignment. Changing the threshold changes the accuracy.
The Receiver Operating Characteristic (ROC) curve provides a threshold-independent view of classifier performance. It plots the true positive rate (recall) against the false positive rate (1 - specificity) at every possible threshold. The area under the ROC curve (AUC) summarizes this curve as a single number between 0 and 1, where 1.0 represents perfect classification and 0.5 represents random guessing.
Key differences between AUC and accuracy:
| Property | Accuracy | AUC-ROC |
|---|---|---|
| Threshold dependence | Yes (fixed at a single threshold) | No (evaluates across all thresholds) |
| Class imbalance sensitivity | High (misleading on imbalanced data) | Moderate (still informative on mild imbalance) |
| Interpretability | Very intuitive ("X% correct") | Less intuitive (area under a curve) |
| Model comparison | Compares at one operating point | Compares across all operating points |
| Output requirement | Predicted labels only | Predicted probabilities or scores |
When the dataset is heavily imbalanced, the precision-recall curve and the area under the precision-recall curve (AUC-PR) may be more informative than the ROC curve, because the ROC curve can appear overly optimistic when the negative class is very large.
A model's accuracy on a single test set is a point estimate that may not generalize. Several techniques exist to produce more reliable accuracy estimates.
The simplest approach is to hold out a portion of the data (commonly 20-30%) as a test set. The model is trained on the remaining data and evaluated on the held-out set. While straightforward, this method has high variance: a different random split can yield a noticeably different accuracy estimate, especially with small datasets.
K-fold cross-validation provides a more robust accuracy estimate. The dataset is divided into k equally sized folds. The model is trained k times, each time using k-1 folds for training and the remaining fold for testing. The final accuracy is the average across all k folds. Common choices are k = 5 and k = 10, as research has shown these values yield estimates with a good balance between bias and variance. K-fold cross-validation ensures that every data point is used for both training and testing exactly once, making efficient use of limited data.
Stratified splits ensure that each fold or partition maintains the same class proportions as the full dataset. This is especially important for imbalanced datasets, where a random split might produce folds that contain very few (or zero) examples of the minority class, leading to unreliable accuracy estimates.
A single accuracy number can be misleading without an indication of its uncertainty. Confidence intervals provide a range within which the true accuracy is likely to fall. For a classifier with accuracy p evaluated on n independent test instances, the 95% confidence interval based on the normal approximation to the binomial distribution is:
p +/- 1.96 * sqrt(p * (1 - p) / n)
For example, a model with 90% accuracy evaluated on 200 test samples has a 95% confidence interval of approximately 85.8% to 94.2%. This interval narrows as the test set grows larger. Reporting confidence intervals is good practice and helps avoid overconfident conclusions about small accuracy differences between models.
For small test sets or accuracy values near 0 or 1, the Wilson score interval or the Clopper-Pearson exact interval provides better coverage than the normal approximation.
When two classifiers achieve similar accuracy scores, it is important to determine whether the difference is statistically significant or simply due to chance. Two widely used tests are:
McNemar's Test: A paired, non-parametric test based on a 2x2 contingency table of the two models' predictions. It examines whether the two classifiers disagree in a systematic way. McNemar's test is appropriate when both models are evaluated on the same test set and is recommended over tests that assume independence between observations. For small sample sizes (when any cell in the contingency table has a count below 25), an exact binomial version of the test should be used instead of the chi-squared approximation.
5x2 Cross-Validation Paired t-Test: Proposed by Dietterich (1998), this method runs 5 iterations of 2-fold cross-validation and applies a modified paired t-test to the resulting accuracy differences. It offers better control of Type I error (false positive rate) than the standard paired t-test when comparing classifiers.
Many factors can influence the accuracy of a classification model.
The choice of algorithm significantly influences accuracy. Some algorithms are better suited for specific data types or problem structures. For example, decision trees handle categorical features naturally, while support vector machines excel on high-dimensional data. Deep neural networks tend to outperform traditional methods on large, complex datasets like images and text, but they may underperform on small, tabular datasets where gradient boosting methods often dominate.
Both the quality and quantity of training data influence accuracy. Noisy labels, missing values, and outliers can degrade accuracy. At the same time, increasing the amount of training data generally improves accuracy up to a point, after which returns diminish. Data cleaning, deduplication, and careful label verification are essential steps for maximizing accuracy.
The feature engineering process plays a critical role in accuracy. Selecting relevant features can improve accuracy substantially, while including irrelevant or redundant features can introduce noise and reduce performance. Techniques such as principal component analysis (PCA), mutual information, and recursive feature elimination help identify the most informative features.
Hyperparameters control the learning process itself (learning rate, regularization strength, tree depth, etc.) and can have a large impact on accuracy. Systematic tuning approaches such as grid search, random search, and Bayesian optimization help find configurations that maximize accuracy on validation data.
A model that memorizes the training data (overfitting) will show high training accuracy but low test accuracy. Conversely, a model that is too simple (underfitting) will have low accuracy on both training and test data. Regularization, early stopping, and cross-validation are standard techniques for managing this trade-off.
Most machine learning frameworks provide built-in functions for computing accuracy.
Scikit-learn offers several accuracy-related functions in sklearn.metrics:
accuracy_score(y_true, y_pred) computes standard accuracy. Setting normalize=False returns the count of correct predictions instead of the fraction.balanced_accuracy_score(y_true, y_pred) computes the macro-averaged recall, correcting for class imbalance.top_k_accuracy_score(y_true, y_score, k) computes top-k accuracy from probability predictions.Example usage:
from sklearn.metrics import accuracy_score, balanced_accuracy_score
y_true = [0, 1, 1, 0, 1, 0, 0, 1]
y_pred = [0, 1, 0, 0, 1, 1, 0, 1]
# Standard accuracy
print(accuracy_score(y_true, y_pred)) # 0.75
# Number of correct predictions
print(accuracy_score(y_true, y_pred, normalize=False)) # 6
# Balanced accuracy
print(balanced_accuracy_score(y_true, y_pred)) # 0.75
PyTorch users can compute accuracy through the TorchMetrics library, which provides torchmetrics.Accuracy supporting binary, multiclass, and multilabel settings. The task parameter specifies the classification type, and top_k enables top-k accuracy. TorchMetrics integrates with PyTorch Lightning for seamless logging during training.
Example usage:
import torch
import torchmetrics
# Binary accuracy
metric = torchmetrics.Accuracy(task="binary")
preds = torch.tensor([0, 1, 1, 0, 1])
target = torch.tensor([0, 1, 0, 0, 1])
print(metric(preds, target)) # tensor(0.8000)
# Multiclass top-2 accuracy
metric = torchmetrics.Accuracy(task="multiclass", num_classes=3, top_k=2)
preds = torch.tensor([[0.1, 0.6, 0.3], [0.7, 0.2, 0.1]])
target = torch.tensor([2, 0])
print(metric(preds, target)) # tensor(1.0)
In TensorFlow and Keras, accuracy is available as both a metric and a loss-adjacent measure:
tf.keras.metrics.Accuracy() for exact label matching.tf.keras.metrics.SparseCategoricalAccuracy() for integer-labeled multiclass problems.tf.keras.metrics.TopKCategoricalAccuracy(k) for top-k accuracy.These can be passed directly to model.compile(metrics=[...]) for automatic tracking during training and evaluation:
import tensorflow as tf
model = tf.keras.Sequential([
tf.keras.layers.Dense(64, activation="relu"),
tf.keras.layers.Dense(10, activation="softmax")
])
model.compile(
optimizer="adam",
loss="sparse_categorical_crossentropy",
metrics=[
tf.keras.metrics.SparseCategoricalAccuracy(),
tf.keras.metrics.TopKCategoricalAccuracy(k=5)
]
)
With the rise of large language models (LLMs), accuracy has become a central reporting metric for a wide range of natural language understanding and reasoning benchmarks.
The Massive Multitask Language Understanding (MMLU) benchmark consists of approximately 16,000 multiple-choice questions across 57 academic subjects, including mathematics, philosophy, law, and medicine. Accuracy on MMLU is reported as the percentage of questions answered correctly, with a random baseline of 25% (four answer choices). When MMLU was introduced in 2021 by Hendrycks et al., GPT-3 achieved roughly 43% accuracy. By 2024, frontier models such as GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro all exceeded 88% accuracy. However, MMLU has effectively saturated for frontier model comparisons; when multiple models score between 88% and 93%, the practical difference is difficult to interpret, and measurement noise may exceed real performance gaps. Additionally, a 2024 analysis found that approximately 6.5% of MMLU questions contain errors in their ground truth labels, further complicating interpretation of small accuracy differences.
The Grade School Math 8K (GSM8K) benchmark, introduced by Cobbe et al. in 2021, contains 8,500 grade-school-level math word problems requiring multi-step arithmetic reasoning. The accuracy metric is exact match: the model's final numerical answer must match the reference answer precisely. Early models like GPT-3 scored around 35%, but by 2024, chain-of-thought prompting with frontier models pushed accuracy above 95%. A 2023 contamination study revealed that removing suspected training-set-overlapping examples from GSM8K's test set reduced some models' accuracy by up to 13 percentage points, highlighting the challenge of data contamination in benchmark evaluation.
| Benchmark | Task Type | Accuracy Metric | Notes |
|---|---|---|---|
| MMLU | Multiple-choice knowledge | 4-way accuracy | Saturated for frontier models |
| GSM8K | Math reasoning | Exact match | Contamination concerns |
| HellaSwag | Sentence completion | 4-way accuracy | Tests commonsense reasoning |
| ARC | Science questions | Multiple-choice accuracy | Grade-school and challenge sets |
| WinoGrande | Coreference resolution | Binary accuracy | Tests commonsense understanding |
| TruthfulQA | Truthfulness | Multiple-choice accuracy | Measures factual correctness |
| GPQA | Graduate-level questions | 4-way accuracy | Expert-level difficulty |
A recurring challenge with accuracy-based benchmarks is saturation: once frontier models approach or exceed human-level accuracy, the benchmark loses its ability to discriminate between models. This has happened with MMLU, GSM8K, and HellaSwag, among others. The community has responded by creating harder benchmarks such as MMLU-Pro, GPQA Diamond, and FrontierMath, which aim to maintain a wider spread of accuracy scores across models.
Classification accuracy is just one evaluation paradigm. Depending on the task, other approaches may be more appropriate.
Perplexity measures how "surprised" a language model is by a sequence of text. Lower perplexity indicates better next-token prediction. Unlike accuracy, perplexity does not require labeled ground truth for classification; it directly measures the model's probability distribution over the vocabulary. Perplexity is the standard intrinsic evaluation metric for language modeling but does not directly assess task-specific performance like classification accuracy does.
The BLEU (Bilingual Evaluation Understudy) score evaluates machine-generated text by measuring n-gram overlap between model output and reference translations. It is the standard metric for machine translation and ranges from 0 to 1. Unlike accuracy, which requires exact label matches, BLEU measures partial overlap and applies a brevity penalty for outputs that are too short.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is used primarily for evaluating text summarization. It focuses on recall, measuring how many of the reference n-grams appear in the generated summary. Like BLEU, ROUGE operates on partial overlap rather than exact correctness, making it more nuanced than accuracy for generation tasks.
For many generative AI tasks, automated metrics (including accuracy) fall short. Human evaluation involves people rating model outputs on dimensions such as fluency, relevance, factual correctness, and helpfulness. While expensive and slow, human evaluation remains the gold standard for tasks like open-ended text generation, dialogue, and creative writing, where there is no single "correct" answer.
| Paradigm | Primary Use Case | Measures | Key Limitation |
|---|---|---|---|
| Accuracy | Classification | Exact correctness | Fails on imbalanced data |
| Perplexity | Language modeling | Next-token prediction quality | Doesn't measure task performance |
| BLEU | Machine translation | N-gram overlap with references | Doesn't capture meaning well |
| ROUGE | Text summarization | Recall of reference n-grams | Misses semantic similarity |
| Human evaluation | Open-ended generation | Quality, relevance, fluency | Expensive, subjective, slow |
Based on the strengths and limitations discussed throughout this article, the following guidelines help practitioners use accuracy effectively:
Always check class distribution first. Before reporting accuracy, verify that the dataset's classes are reasonably balanced. If they are not, supplement accuracy with metrics like F1 score, MCC, or AUC-ROC.
Report confidence intervals. A bare accuracy number without uncertainty information invites overinterpretation. Report confidence intervals or standard deviations from cross-validation.
Use cross-validation. A single train-test split can produce unreliable accuracy estimates, especially on small datasets. K-fold cross-validation provides a more stable estimate.
Pair accuracy with a confusion matrix. The confusion matrix reveals where errors occur. Two models can have identical accuracy but very different error patterns, and the confusion matrix exposes these differences.
Consider the cost of different errors. If false positives and false negatives have different real-world consequences, accuracy alone is insufficient. Use precision, recall, or a cost-sensitive metric instead.
Watch for overfitting. A large gap between training accuracy and test accuracy signals overfitting. Monitor both during model development.
Be cautious with benchmark accuracy. On standardized benchmarks, high accuracy does not always translate to real-world performance. Dataset contamination, label noise, and distributional shift can all inflate reported accuracy.
Imagine your teacher gives you a spelling test with 10 words. If you spell 8 of them right, your accuracy is 8 out of 10, or 80%. That is all accuracy means: how many did you get right out of how many you tried.
Now, here is why accuracy can sometimes be tricky. Suppose your test has 9 easy words and 1 really hard word. If you get all the easy ones right but miss the hard one, your accuracy is still 90%. But if the whole point of the test was to see if you could spell that hard word, then 90% does not really tell the teacher what they wanted to know. That is why, in machine learning, people look at other scores too, not just accuracy.