Accuracy

Introduction

Accuracy in machine learning is a metric that measures the performance of a classification model. It represents the fraction of correct predictions made by the model on a given dataset compared to the total number of predictions. Accuracy is one of the most frequently used evaluation metrics in machine learning and serves as a standard for comparing models across a wide range of tasks.

Because accuracy is so intuitive, it is often the first metric that practitioners check after training a classifier. A model that achieves 95% accuracy, for instance, is correct 95 times out of every 100 predictions. This simplicity makes accuracy easy to explain to stakeholders who may not have a technical background. At the same time, that simplicity can be deceptive; a high accuracy score does not always mean a model is performing well, a point explored in detail in later sections of this article.

The concept of measuring classification correctness predates modern machine learning. In statistics, the classification error rate (1 minus accuracy) has been used since at least the mid-20th century to evaluate discriminant analysis and other classification methods. Frank Rosenblatt and other early researchers used confusion matrices to compare human and machine classifications of visual and auditory stimuli during the development of the perceptron in the late 1950s and 1960s. As machine learning grew into a distinct discipline, accuracy became the default evaluation metric for supervised classification tasks across domains ranging from image recognition to medical diagnosis.

Formal definition and formula

Accuracy is defined as the ratio between the number of correct predictions and the total number of predictions made by a classifier. In mathematical notation:

Accuracy = (Number of correct predictions) / (Total number of predictions)

For a binary classification problem, this formula can be expressed using the four outcomes of a confusion matrix:

Accuracy = (TP + TN) / (TP + TN + FP + FN)

Where:

| Symbol | Name | Meaning | |--------|------|---------|| | TP | True Positive | Instances correctly predicted as positive | | TN | True Negative | Instances correctly predicted as negative | | FP | False Positive | Instances incorrectly predicted as positive (Type I error) | | FN | False Negative | Instances incorrectly predicted as negative (Type II error) |

For instance, if a model is trained to classify images of cats and dogs and tested on 100 images, and it correctly identifies 80 of them, its accuracy is 80/100 = 0.8 or 80%.

The complement of accuracy is the error rate (also called the misclassification rate or 0-1 loss):

Error Rate = 1 - Accuracy = (FP + FN) / (TP + TN + FP + FN)

An accuracy of 0.92 corresponds to an error rate of 0.08, meaning 8% of predictions are incorrect. The error rate and accuracy always sum to 1.0, so they contain exactly the same information. The choice between reporting one or the other is largely a matter of convention; some fields and competitions prefer error rate because it highlights the mistakes a model makes rather than its successes.

Multiclass accuracy

For problems with more than two classes, accuracy generalizes naturally. If there are k classes, the classifier assigns each instance to one of the k labels, and accuracy is still the total number of correct predictions divided by the total number of instances. The confusion matrix for a k-class problem becomes a k x k matrix, where the diagonal entries represent correct classifications and off-diagonal entries represent misclassifications.

Subset accuracy for multilabel classification

In multilabel classification, each instance can belong to multiple classes simultaneously. In this setting, the strictest form of accuracy is subset accuracy (also called exact match ratio), which counts a prediction as correct only if the entire set of predicted labels matches the true label set exactly. Scikit-learn's accuracy_score function uses subset accuracy by default for multilabel inputs. Because this measure is very strict, it tends to be lower than other metrics in multilabel settings.

When accuracy is appropriate

Accuracy is an invaluable metric when the classes in a dataset are balanced, meaning there are approximately equal numbers of samples for each class. In such cases, accuracy serves as a reliable indication of the model's overall performance.

Accuracy works well under the following conditions:

Condition	Why accuracy works
Balanced class distribution	No class dominates the score by sheer frequency
Equal misclassification costs	A false positive is roughly as costly as a false negative
Single summary metric needed	Accuracy compresses all predictions into one number for quick comparison
Benchmark comparisons	Standardized datasets with balanced classes (e.g., CIFAR-10, MNIST) use accuracy as the primary leaderboard metric

For standard image classification benchmarks, accuracy has long been the go-to metric. On ImageNet, top-1 and top-5 accuracy are the standard measures reported in research papers and competitions. Similarly, benchmarks like MNIST (handwritten digit recognition) and CIFAR-10 (object recognition across 10 balanced classes) use accuracy because their class distributions are uniform.

The accuracy paradox and class imbalance

When classes are imbalanced (one class with significantly more samples than the other), accuracy may not be a reliable measure of model performance. This phenomenon is known as the accuracy paradox, a term describing the paradoxical finding that accuracy can produce misleading results in predictive analytics. A model that always predicts the majority class can achieve a high accuracy score while being completely useless for the task at hand.

Real-world examples

Consider these concrete scenarios where accuracy is misleading on a class-imbalanced dataset:

Domain	Class Distribution	Naive Strategy	Accuracy	Actual Usefulness
Fraud detection	98% legitimate, 2% fraudulent	Always predict "legitimate"	98%	Detects zero fraud
Spam filtering	90% non-spam, 10% spam	Always predict "non-spam"	90%	Filters zero spam
Medical screening	99.5% healthy, 0.5% diseased	Always predict "healthy"	99.5%	Misses every case of disease
Terrorism detection	99.999% non-threat	Always predict "non-threat"	99.999%	Completely non-functional

In a fraud detection system where only 2% of transactions are fraudulent, a model that labels every transaction as legitimate achieves 98% accuracy. Despite this impressive-sounding number, the model fails at its primary purpose: catching fraud.

Why imbalance breaks accuracy

The root cause is that accuracy weights all correct predictions equally. In an imbalanced dataset, correctly classifying the majority class contributes far more to the accuracy score than correctly classifying the minority class. A model can therefore "cheat" by ignoring the minority class entirely and still report a high accuracy. This is why alternative metrics like precision, recall, F1 score, and the Matthews correlation coefficient exist; they account for the distribution of predictions across classes.

To illustrate the paradox numerically, consider a city security system monitoring 1,000,000 people, of whom 10 are actual threats. A system that flags 1,000 people (including all 10 real threats) achieves 99.9% accuracy (999,010 correct out of 1,000,000). Yet its precision is only 1% (10 out of 1,000 flagged), meaning 99% of alerts are false alarms. The accuracy score completely obscures this problem.

Accuracy in different contexts

Although accuracy is most commonly associated with classification, the term appears in other areas of machine learning and data science, sometimes with different meanings.

Classification

In classification, accuracy has the standard definition described above: the proportion of instances where the predicted label matches the true label. This applies to binary, multiclass, and multilabel settings.

Regression

Accuracy is not a standard metric for regression problems, where the output is a continuous value rather than a discrete label. Regression tasks instead use metrics such as mean squared error (MSE), root mean squared error (RMSE), mean absolute error (MAE), and R-squared. Some practitioners loosely refer to "accuracy" in regression contexts to mean the degree of closeness between predicted and actual values, but this informal usage should be avoided in technical writing because it conflates two different measurement concepts.

Information retrieval

In information retrieval, the concept closest to classification accuracy is retrieval effectiveness, but the standard metrics differ. Precision and recall are used to evaluate whether a system returns relevant documents and avoids irrelevant ones. Metrics such as precision at k (P@k), mean average precision (MAP), and normalized discounted cumulative gain (nDCG) are preferred because they account for ranking position, not just correctness. Classification accuracy would be unsuitable for retrieval because the vast majority of documents in a collection are irrelevant to any given query, creating an extreme class imbalance problem.

Accuracy is one member of a family of classification metrics, each capturing a different aspect of model performance. Understanding how these metrics relate to accuracy helps practitioners choose the right evaluation approach for their specific problem.

Precision

Precision (also called positive predictive value) measures the fraction of positive predictions that are actually correct:

Precision = TP / (TP + FP)

Precision answers the question: "Of all instances the model labeled positive, how many actually were positive?" High precision means the model rarely produces false alarms. Precision is especially important in contexts where false positives are costly, such as email spam filtering (marking legitimate email as spam is disruptive) or criminal justice (false accusations have severe consequences).

Recall

Recall (also called sensitivity or true positive rate) measures the fraction of actual positive instances that the model correctly identified:

Recall = TP / (TP + FN)

Recall answers the question: "Of all instances that truly were positive, how many did the model catch?" High recall means the model misses few positive cases. Recall is critical in medical diagnosis (missing a cancer diagnosis is dangerous) and fraud detection (missing fraudulent transactions leads to financial losses).

Specificity

Specificity (also called the true negative rate) measures how well the model identifies negative instances:

Specificity = TN / (TN + FP)

While recall focuses on the positive class, specificity focuses on the negative class. Together, sensitivity and specificity provide a complete picture of how well a model handles both classes. These two metrics are especially prominent in medical and clinical research.

F1 score

The F1 score is the harmonic mean of precision and recall:

F1 = 2 * (Precision * Recall) / (Precision + Recall)

The F1 score ranges from 0 to 1 and balances the trade-off between precision and recall. It is particularly useful when class distributions are uneven, because it penalizes models that sacrifice one metric for the other. When precision and recall diverge significantly, the F1 score will be pulled toward the lower of the two values, exposing weaknesses that accuracy might hide.

Matthews correlation coefficient (MCC)

The Matthews correlation coefficient (MCC) is a balanced measure that accounts for all four cells of the confusion matrix:

MCC = (TP * TN - FP * FN) / sqrt((TP+FP)(TP+FN)(TN+FP)(TN+FN))

MCC returns a value between -1 and +1, where +1 indicates perfect prediction, 0 indicates performance no better than random, and -1 indicates total disagreement. A 2020 study by Chicco and Jurman published in BMC Genomics demonstrated that MCC is more informative than both F1 score and accuracy for binary classification evaluation, especially on imbalanced datasets. In their example, a classifier that always predicts positive on an imbalanced set (95 positive, 5 negative) achieves 95% accuracy and an F1 score of 0.974, yet the MCC is only 0.14, correctly flagging the classifier as nearly useless.

Choosing the right metric

The following table summarizes when to use each metric:

Metric	Formula	Range	Best for
Accuracy	(TP+TN) / (TP+TN+FP+FN)	0 to 1	Balanced datasets, quick overview
Precision	TP / (TP+FP)	0 to 1	Minimizing false positives
Recall	TP / (TP+FN)	0 to 1	Minimizing false negatives
F1 Score	Harmonic mean of precision and recall	0 to 1	Imbalanced datasets, trade-off between precision and recall
Specificity	TN / (TN+FP)	0 to 1	Evaluating negative class performance
MCC	Correlation between predicted and actual	-1 to +1	Imbalanced datasets, comprehensive evaluation

Variants of accuracy

Several modified forms of accuracy have been developed to address the limitations of standard accuracy in particular settings.

Balanced accuracy

Balanced accuracy compensates for class imbalance by computing the average of recall (sensitivity) obtained on each class:

Balanced Accuracy = (Sensitivity + Specificity) / 2

For a multiclass problem, balanced accuracy is the macro-average of per-class recall values:

Balanced Accuracy = (1 / n_classes) * sum(recall_i for each class i)

If a binary classifier achieves 100% recall on the majority class but 0% on the minority class, standard accuracy might be 95% (with a 95/5 split), but balanced accuracy would be 50%, correctly reflecting the model's failure to learn the minority class. Scikit-learn provides balanced_accuracy_score in its sklearn.metrics module. The function also supports an adjusted parameter that rescales the score so that random performance yields 0 rather than 1/n_classes, making it easier to compare against chance performance.

Balanced accuracy is equivalent to computing accuracy_score with class-balanced sample weights. This makes it particularly useful when reporting a single number for datasets where some classes are much rarer than others.

Top-k accuracy

Top-k accuracy relaxes the standard accuracy criterion by counting a prediction as correct if the true label appears anywhere among the model's top k predicted classes (ranked by predicted probability or score). Top-1 accuracy is equivalent to standard accuracy.

Top-5 accuracy became widely known through the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), where models had to classify images into one of 1,000 categories. Because many images contain ambiguous content or multiple objects, top-5 accuracy gave models credit for having the correct class among their five most confident predictions. AlexNet, the deep learning model that won ILSVRC 2012 and launched the modern deep learning era, achieved a top-5 error rate of 15.3%. By 2015, ResNet had pushed the top-5 error rate below 3.57%, surpassing human-level performance on this particular benchmark.

Scikit-learn provides top_k_accuracy_score for computing this metric. Top-k accuracy is also commonly used in recommendation systems and information retrieval.

Per-class accuracy

Per-class accuracy (also called class-wise accuracy) calculates accuracy separately for each class. This decomposition reveals whether a model performs uniformly across all classes or excels at some while failing at others. Per-class accuracy is especially informative for multiclass problems where some classes are inherently harder to distinguish.

ROC curves and AUC vs. accuracy

Accuracy is a threshold-dependent metric. For binary classifiers that output a probability score, a threshold (commonly 0.5) determines the class assignment. Changing the threshold changes the accuracy.

The Receiver Operating Characteristic (ROC) curve provides a threshold-independent view of classifier performance. It plots the true positive rate (recall) against the false positive rate (1 - specificity) at every possible threshold. The area under the ROC curve (AUC) summarizes this curve as a single number between 0 and 1, where 1.0 represents perfect classification and 0.5 represents random guessing.

Key differences between AUC and accuracy:

Property	Accuracy	AUC-ROC
Threshold dependence	Yes (fixed at a single threshold)	No (evaluates across all thresholds)
Class imbalance sensitivity	High (misleading on imbalanced data)	Moderate (still informative on mild imbalance)
Interpretability	Very intuitive ("X% correct")	Less intuitive (area under a curve)
Model comparison	Compares at one operating point	Compares across all operating points
Output requirement	Predicted labels only	Predicted probabilities or scores

When the dataset is heavily imbalanced, the precision-recall curve and the area under the precision-recall curve (AUC-PR) may be more informative than the ROC curve, because the ROC curve can appear overly optimistic when the negative class is very large.

Estimating accuracy reliably

A model's accuracy on a single test set is a point estimate that may not generalize. Several techniques exist to produce more reliable accuracy estimates.

Train-test split

The simplest approach is to hold out a portion of the data (commonly 20-30%) as a test set. The model is trained on the remaining data and evaluated on the held-out set. While straightforward, this method has high variance: a different random split can yield a noticeably different accuracy estimate, especially with small datasets.

K-fold cross-validation

K-fold cross-validation provides a more robust accuracy estimate. The dataset is divided into k equally sized folds. The model is trained k times, each time using k-1 folds for training and the remaining fold for testing. The final accuracy is the average across all k folds. Common choices are k = 5 and k = 10, as research has shown these values yield estimates with a good balance between bias and variance. K-fold cross-validation ensures that every data point is used for both training and testing exactly once, making efficient use of limited data.

Stratified splitting

Stratified splits ensure that each fold or partition maintains the same class proportions as the full dataset. This is especially important for imbalanced datasets, where a random split might produce folds that contain very few (or zero) examples of the minority class, leading to unreliable accuracy estimates.

Confidence intervals

A single accuracy number can be misleading without an indication of its uncertainty. Confidence intervals provide a range within which the true accuracy is likely to fall. For a classifier with accuracy p evaluated on n independent test instances, the 95% confidence interval based on the normal approximation to the binomial distribution is:

p +/- 1.96 * sqrt(p * (1 - p) / n)

For example, a model with 90% accuracy evaluated on 200 test samples has a 95% confidence interval of approximately 85.8% to 94.2%. This interval narrows as the test set grows larger. Reporting confidence intervals is good practice and helps avoid overconfident conclusions about small accuracy differences between models.

For small test sets or accuracy values near 0 or 1, the Wilson score interval or the Clopper-Pearson exact interval provides better coverage than the normal approximation.

Statistical tests for comparing models

When two classifiers achieve similar accuracy scores, it is important to determine whether the difference is statistically significant or simply due to chance. Two widely used tests are:

McNemar's Test: A paired, non-parametric test based on a 2x2 contingency table of the two models' predictions. It examines whether the two classifiers disagree in a systematic way. McNemar's test is appropriate when both models are evaluated on the same test set and is recommended over tests that assume independence between observations. For small sample sizes (when any cell in the contingency table has a count below 25), an exact binomial version of the test should be used instead of the chi-squared approximation.

5x2 Cross-Validation Paired t-Test: Proposed by Dietterich (1998), this method runs 5 iterations of 2-fold cross-validation and applies a modified paired t-test to the resulting accuracy differences. It offers better control of Type I error (false positive rate) than the standard paired t-test when comparing classifiers.

Factors that affect accuracy

Many factors can influence the accuracy of a classification model.

Algorithm selection

The choice of algorithm significantly influences accuracy. Some algorithms are better suited for specific data types or problem structures. For example, decision trees handle categorical features naturally, while support vector machines excel on high-dimensional data. Deep neural networks tend to outperform traditional methods on large, complex datasets like images and text, but they may underperform on small, tabular datasets where gradient boosting methods often dominate.

Data quality and quantity

Both the quality and quantity of training data influence accuracy. Noisy labels, missing values, and outliers can degrade accuracy. At the same time, increasing the amount of training data generally improves accuracy up to a point, after which returns diminish. Data cleaning, deduplication, and careful label verification are essential steps for maximizing accuracy.

Feature engineering and selection

The feature engineering process plays a critical role in accuracy. Selecting relevant features can improve accuracy substantially, while including irrelevant or redundant features can introduce noise and reduce performance. Techniques such as principal component analysis (PCA), mutual information, and recursive feature elimination help identify the most informative features.

Hyperparameter tuning

Hyperparameters control the learning process itself (learning rate, regularization strength, tree depth, etc.) and can have a large impact on accuracy. Systematic tuning approaches such as grid search, random search, and Bayesian optimization help find configurations that maximize accuracy on validation data.

Overfitting and underfitting

A model that memorizes the training data (overfitting) will show high training accuracy but low test accuracy. Conversely, a model that is too simple (underfitting) will have low accuracy on both training and test data. Regularization, early stopping, and cross-validation are standard techniques for managing this trade-off.

Practical implementation

Most machine learning frameworks provide built-in functions for computing accuracy.

Scikit-learn

Scikit-learn offers several accuracy-related functions in sklearn.metrics:

accuracy_score(y_true, y_pred) computes standard accuracy. Setting normalize=False returns the count of correct predictions instead of the fraction.
balanced_accuracy_score(y_true, y_pred) computes the macro-averaged recall, correcting for class imbalance.
top_k_accuracy_score(y_true, y_score, k) computes top-k accuracy from probability predictions.

Example usage:

from sklearn.metrics import accuracy_score, balanced_accuracy_score

y_true = [0, 1, 1, 0, 1, 0, 0, 1]
y_pred = [0, 1, 0, 0, 1, 1, 0, 1]

# Standard accuracy
print(accuracy_score(y_true, y_pred))  # 0.75

# Number of correct predictions
print(accuracy_score(y_true, y_pred, normalize=False))  # 6

# Balanced accuracy
print(balanced_accuracy_score(y_true, y_pred))  # 0.75

PyTorch and TorchMetrics

PyTorch users can compute accuracy through the TorchMetrics library, which provides torchmetrics.Accuracy supporting binary, multiclass, and multilabel settings. The task parameter specifies the classification type, and top_k enables top-k accuracy. TorchMetrics integrates with PyTorch Lightning for seamless logging during training.

Example usage:

import torch
import torchmetrics

# Binary accuracy
metric = torchmetrics.Accuracy(task="binary")
preds = torch.tensor([0, 1, 1, 0, 1])
target = torch.tensor([0, 1, 0, 0, 1])
print(metric(preds, target))  # tensor(0.8000)

# Multiclass top-2 accuracy
metric = torchmetrics.Accuracy(task="multiclass", num_classes=3, top_k=2)
preds = torch.tensor([[0.1, 0.6, 0.3], [0.7, 0.2, 0.1]])
target = torch.tensor([2, 0])
print(metric(preds, target))  # tensor(1.0)

TensorFlow and Keras

In TensorFlow and Keras, accuracy is available as both a metric and a loss-adjacent measure:

tf.keras.metrics.Accuracy() for exact label matching.
tf.keras.metrics.SparseCategoricalAccuracy() for integer-labeled multiclass problems.
tf.keras.metrics.TopKCategoricalAccuracy(k) for top-k accuracy.

These can be passed directly to model.compile(metrics=[...]) for automatic tracking during training and evaluation:

import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation="relu"),
    tf.keras.layers.Dense(10, activation="softmax")
])

model.compile(
    optimizer="adam",
    loss="sparse_categorical_crossentropy",
    metrics=[
        tf.keras.metrics.SparseCategoricalAccuracy(),
        tf.keras.metrics.TopKCategoricalAccuracy(k=5)
    ]
)

Accuracy in large language model benchmarks

With the rise of large language models (LLMs), accuracy has become a central reporting metric for a wide range of natural language understanding and reasoning benchmarks.

MMLU

The Massive Multitask Language Understanding (MMLU) benchmark consists of approximately 16,000 multiple-choice questions across 57 academic subjects, including mathematics, philosophy, law, and medicine. Accuracy on MMLU is reported as the percentage of questions answered correctly, with a random baseline of 25% (four answer choices). When MMLU was introduced in 2021 by Hendrycks et al., GPT-3 achieved roughly 43% accuracy. By 2024, frontier models such as GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro all exceeded 88% accuracy. However, MMLU has effectively saturated for frontier model comparisons; when multiple models score between 88% and 93%, the practical difference is difficult to interpret, and measurement noise may exceed real performance gaps. Additionally, a 2024 analysis found that approximately 6.5% of MMLU questions contain errors in their ground truth labels, further complicating interpretation of small accuracy differences.

GSM8K

The Grade School Math 8K (GSM8K) benchmark, introduced by Cobbe et al. in 2021, contains 8,500 grade-school-level math word problems requiring multi-step arithmetic reasoning. The accuracy metric is exact match: the model's final numerical answer must match the reference answer precisely. Early models like GPT-3 scored around 35%, but by 2024, chain-of-thought prompting with frontier models pushed accuracy above 95%. A 2023 contamination study revealed that removing suspected training-set-overlapping examples from GSM8K's test set reduced some models' accuracy by up to 13 percentage points, highlighting the challenge of data contamination in benchmark evaluation.

Other LLM benchmarks using accuracy

Benchmark	Task Type	Accuracy Metric	Notes
MMLU	Multiple-choice knowledge	4-way accuracy	Saturated for frontier models
GSM8K	Math reasoning	Exact match	Contamination concerns
HellaSwag	Sentence completion	4-way accuracy	Tests commonsense reasoning
ARC	Science questions	Multiple-choice accuracy	Grade-school and challenge sets
WinoGrande	Coreference resolution	Binary accuracy	Tests commonsense understanding
TruthfulQA	Truthfulness	Multiple-choice accuracy	Measures factual correctness
GPQA	Graduate-level questions	4-way accuracy	Expert-level difficulty

Saturation and the limits of accuracy-based benchmarks

A recurring challenge with accuracy-based benchmarks is saturation: once frontier models approach or exceed human-level accuracy, the benchmark loses its ability to discriminate between models. This has happened with MMLU, GSM8K, and HellaSwag, among others. The community has responded by creating harder benchmarks such as MMLU-Pro, GPQA Diamond, and FrontierMath, which aim to maintain a wider spread of accuracy scores across models.

Accuracy vs. other evaluation paradigms

Classification accuracy is just one evaluation paradigm. Depending on the task, other approaches may be more appropriate.

Perplexity

Perplexity measures how "surprised" a language model is by a sequence of text. Lower perplexity indicates better next-token prediction. Unlike accuracy, perplexity does not require labeled ground truth for classification; it directly measures the model's probability distribution over the vocabulary. Perplexity is the standard intrinsic evaluation metric for language modeling but does not directly assess task-specific performance like classification accuracy does.

BLEU

The BLEU (Bilingual Evaluation Understudy) score evaluates machine-generated text by measuring n-gram overlap between model output and reference translations. It is the standard metric for machine translation and ranges from 0 to 1. Unlike accuracy, which requires exact label matches, BLEU measures partial overlap and applies a brevity penalty for outputs that are too short.

ROUGE

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is used primarily for evaluating text summarization. It focuses on recall, measuring how many of the reference n-grams appear in the generated summary. Like BLEU, ROUGE operates on partial overlap rather than exact correctness, making it more nuanced than accuracy for generation tasks.

Human evaluation

For many generative AI tasks, automated metrics (including accuracy) fall short. Human evaluation involves people rating model outputs on dimensions such as fluency, relevance, factual correctness, and helpfulness. While expensive and slow, human evaluation remains the gold standard for tasks like open-ended text generation, dialogue, and creative writing, where there is no single "correct" answer.

Paradigm	Primary Use Case	Measures	Key Limitation
Accuracy	Classification	Exact correctness	Fails on imbalanced data
Perplexity	Language modeling	Next-token prediction quality	Doesn't measure task performance
BLEU	Machine translation	N-gram overlap with references	Doesn't capture meaning well
ROUGE	Text summarization	Recall of reference n-grams	Misses semantic similarity
Human evaluation	Open-ended generation	Quality, relevance, fluency	Expensive, subjective, slow

Best practices

Based on the strengths and limitations discussed throughout this article, the following guidelines help practitioners use accuracy effectively:

Always check class distribution first. Before reporting accuracy, verify that the dataset's classes are reasonably balanced. If they are not, supplement accuracy with metrics like F1 score, MCC, or AUC-ROC.
Report confidence intervals. A bare accuracy number without uncertainty information invites overinterpretation. Report confidence intervals or standard deviations from cross-validation.
Use cross-validation. A single train-test split can produce unreliable accuracy estimates, especially on small datasets. K-fold cross-validation provides a more stable estimate.
Pair accuracy with a confusion matrix. The confusion matrix reveals where errors occur. Two models can have identical accuracy but very different error patterns, and the confusion matrix exposes these differences.
Consider the cost of different errors. If false positives and false negatives have different real-world consequences, accuracy alone is insufficient. Use precision, recall, or a cost-sensitive metric instead.
Watch for overfitting. A large gap between training accuracy and test accuracy signals overfitting. Monitor both during model development.
Be cautious with benchmark accuracy. On standardized benchmarks, high accuracy does not always translate to real-world performance. Dataset contamination, label noise, and distributional shift can all inflate reported accuracy.

Explain like I'm 5 (ELI5)

Imagine your teacher gives you a spelling test with 10 words. If you spell 8 of them right, your accuracy is 8 out of 10, or 80%. That is all accuracy means: how many did you get right out of how many you tried.

Now, here is why accuracy can sometimes be tricky. Suppose your test has 9 easy words and 1 really hard word. If you get all the easy ones right but miss the hard one, your accuracy is still 90%. But if the whole point of the test was to see if you could spell that hard word, then 90% does not really tell the teacher what they wanted to know. That is why, in machine learning, people look at other scores too, not just accuracy.

References

Chicco, D., & Jurman, G. (2020). "The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation." *BMC Genomics*, 21, 6. https://doi.org/10.1186/s12864-019-6413-7
Hendrycks, D., et al. (2021). "Measuring Massive Multitask Language Understanding." *Proceedings of ICLR 2021*. arXiv:2009.03300.
Cobbe, K., et al. (2021). "Training Verifiers to Solve Math Word Problems." arXiv:2110.14168.
Dietterich, T. G. (1998). "Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms." *Neural Computation*, 10(7), 1895-1923.
Pedregosa, F., et al. (2011). "Scikit-learn: Machine Learning in Python." *Journal of Machine Learning Research*, 12, 2825-2830.
Brodersen, K. H., et al. (2010). "The Balanced Accuracy and its Posterior Distribution." *20th International Conference on Pattern Recognition*.
Raschka, S. (2018). "Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning." arXiv:1811.12808.
Sokolova, M., & Lapalme, G. (2009). "A systematic analysis of performance measures for classification tasks." *Information Processing & Management*, 45(4), 427-437.
Guo, Z., et al. (2024). "A Careful Examination of Large Language Model Performance on Grade School Math." arXiv:2405.00332.
Northcutt, C. G., Athalye, A., & Mueller, J. (2021). "Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks." *NeurIPS 2021 Datasets and Benchmarks Track*.

Introduction

Formal definition and formula

Multiclass accuracy

Subset accuracy for multilabel classification

When accuracy is appropriate

The accuracy paradox and class imbalance

Real-world examples

Why imbalance breaks accuracy

Accuracy in different contexts

Classification

Regression

Information retrieval

Related classification metrics

Precision

Recall

Specificity

F1 score

Matthews correlation coefficient (MCC)

Choosing the right metric

Variants of accuracy

Balanced accuracy

Top-k accuracy

Per-class accuracy

ROC curves and AUC vs. accuracy

Estimating accuracy reliably

Train-test split

K-fold cross-validation

Stratified splitting

Confidence intervals

Statistical tests for comparing models

Factors that affect accuracy

Algorithm selection

Data quality and quantity

Feature engineering and selection

Hyperparameter tuning

Overfitting and underfitting

Practical implementation

Scikit-learn

PyTorch and TorchMetrics

TensorFlow and Keras

Accuracy in large language model benchmarks

MMLU

GSM8K

Other LLM benchmarks using accuracy

Saturation and the limits of accuracy-based benchmarks

Accuracy vs. other evaluation paradigms

Perplexity

BLEU

ROUGE

Human evaluation

Best practices

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

ARC-AGI 2

Classification Threshold

Confusion Matrix

Decision Threshold

False Negative (FN)

False Negative Rate

Introduction

Formal definition and formula

Multiclass accuracy

Subset accuracy for multilabel classification

When accuracy is appropriate

The accuracy paradox and class imbalance

Real-world examples

Why imbalance breaks accuracy

Accuracy in different contexts

Classification

Regression

Information retrieval

Related classification metrics

Precision

Recall

Specificity

F1 score

Matthews correlation coefficient (MCC)

Choosing the right metric