Metric
Last reviewed
May 11, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 2,196 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 2,196 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms
In machine learning, a metric is a quantitative measure used to evaluate the performance of an algorithm or model. Metrics let researchers understand how effective a model is at a task, and they allow direct comparison between models trained on the same data. Google's Machine Learning Glossary defines a metric as a statistic that you care about. The choice of metric is often more important than the choice of algorithm, because the metric defines what "good" looks like.
Different tasks call for different metrics. A spam filter is judged on whether it catches spam without flagging real mail. A regression model is judged on how far predictions land from the truth. A search engine is judged on whether its top documents are the ones the user wanted. No single number captures all of these, so the field has many measures.
A metric should not be confused with a loss function. The loss function is what the optimizer minimizes during training; it must be differentiable so gradient descent can work on it. A metric is a number humans look at after training. Loss functions are for machines, metrics are for people. The two often disagree: a model with slightly higher cross-entropy loss may have slightly higher classification accuracy, because the two reward different things.
Classification tasks assign each input to one of several discrete classes. Most classification metrics start from the confusion matrix, which counts four kinds of outcomes for a binary classifier: true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN).
Accuracy is the proportion of correctly classified instances out of the total:
accuracy = (TP + TN) / (TP + TN + FP + FN)
It is the most intuitive classification metric and the default choice in scikit-learn's classifiers. However, accuracy can be deeply misleading on imbalanced datasets. If 99% of credit card transactions are legitimate, a model that always predicts "legitimate" scores 99% accuracy while catching zero fraud. Google's ML crash course warns that accuracy is usually a poor metric for class-imbalanced problems.
Precision and recall isolate the two kinds of mistakes a classifier can make on the positive class.
Precision (positive predictive value) is the fraction of predicted positives that are actually positive:
precision = TP / (TP + FP)
Precision answers "of the items we flagged, how many should we have flagged?" It matters when a false positive is expensive, like sending innocent email to the spam folder.
Recall (sensitivity, true positive rate) is the fraction of actual positives the model catches:
recall = TP / (TP + FN)
Recall answers "of the items that mattered, how many did we find?" It matters when a false negative is expensive, like missing a tumor on a medical scan.
The two trade off against each other. Lowering the decision threshold raises recall and lowers precision; raising the threshold does the opposite.
The F1 score is the harmonic mean of precision and recall:
F1 = 2 * (precision * recall) / (precision + recall)
The harmonic mean penalizes a model that does well on one but poorly on the other, so F1 is only high when both precision and recall are reasonable. The F-beta score generalizes F1 by weighting recall over precision (or vice versa).
The ROC curve plots the true positive rate against the false positive rate as the classification threshold sweeps from 0 to 1. AUC-ROC is the area under this curve, a single number between 0 and 1 summarizing how well the model ranks positive examples above negative ones. A value of 0.5 means random guessing; 1.0 means perfect separation. AUC-ROC has a probabilistic interpretation: it is the probability that a randomly chosen positive example receives a higher score than a randomly chosen negative example.
The precision-recall curve plots precision against recall across thresholds, and AUC-PR is the area under it. AUC-PR is generally more informative than AUC-ROC on heavily imbalanced datasets, because the negative class dominates the false positive rate and makes ROC curves look optimistically good. Fraud detection and rare-disease prediction often report AUC-PR alongside or instead of AUC-ROC.
Specificity (TN / (TN + FP)) is the recall of the negative class. Balanced accuracy averages sensitivity and specificity. The Matthews correlation coefficient (MCC) treats both error types symmetrically. Log loss (cross-entropy) measures how well calibrated the predicted probabilities are, not just whether the top class is correct.
Regression tasks predict a continuous numerical value. The metrics below all measure some notion of how far predictions land from the truth.
Mean Squared Error (MSE) is the average of squared differences between predicted and actual values:
MSE = (1/n) * sum((y_i - y_hat_i)^2)
Squaring penalizes large errors heavily, making MSE sensitive to outliers. It is differentiable everywhere, which is why squared error is one of the most common training losses for regression. The reported MSE is in squared units of the target, awkward when the target is dollars or meters.
Root Mean Squared Error (RMSE) is the square root of MSE:
RMSE = sqrt(MSE)
RMSE is in the same units as the target, which makes it the most commonly reported regression metric in practice. It still penalizes large errors strongly, so a model with a few wild predictions will have worse RMSE than one with consistently moderate errors.
Mean Absolute Error (MAE) averages the absolute differences between predicted and actual values:
MAE = (1/n) * sum(|y_i - y_hat_i|)
MAE treats every unit of error linearly, so a prediction off by 100 contributes ten times as much as one off by 10, not 100 times. This makes MAE robust to outliers. The tradeoff is that MAE is not differentiable at zero, which complicates its use as a training loss for some optimizers.
The coefficient of determination, R^2, is the fraction of variance in the target that the model explains:
R^2 = 1 - (sum((y_i - y_hat_i)^2) / sum((y_i - y_bar)^2))
R^2 of 1 means the model perfectly explains the target; 0 means it does no better than always predicting the mean; negative values mean it is worse than that baseline. R^2 is the default scoring metric for regression in scikit-learn. Adjusted R^2 corrects for the number of features.
Mean absolute percentage error (MAPE) expresses error as a percentage of the true value, intuitive but unstable when the truth is near zero. Huber loss combines MSE and MAE behavior.
Ranking tasks order a list of candidates by relevance. The user usually only looks at the top of the list, so these metrics weight the top of the ranking more heavily than the bottom.
Mean Reciprocal Rank (MRR) focuses on the position of the first relevant item. For each query, you compute 1 divided by the rank of the first correct answer, then average across queries. If the first relevant item is at rank 1 the contribution is 1; at rank 2, 0.5; at rank 10, 0.1. MRR fits scenarios with a single right answer, like factual question answering. It ignores anything after the first relevant hit.
Mean Average Precision (MAP) averages precision at each relevant item in the ranking, then averages across queries. MAP corresponds to the area under the precision-recall curve for the ranking, and rewards systems that pack relevant items near the top while still finding most of them. MAP works well when relevance is binary.
Normalized Discounted Cumulative Gain (NDCG) handles graded relevance, where some documents are more relevant than others on a numeric scale. NDCG discounts each item's relevance by the log of its rank, so a highly relevant document at position 1 contributes more than the same document at position 10. The result is normalized by the ideal DCG, so NDCG falls between 0 and 1. Web search engines have used NDCG as their workhorse offline metric for two decades, because it captures both relevance and order in one number.
Evaluating generated text is harder than evaluating a class label or a number, because there is rarely one correct output. The community has settled on a handful of automatic metrics that compare a generated string to one or more reference strings.
BLEU (Bilingual Evaluation Understudy) was introduced by Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu at IBM in 2002 to evaluate machine translation. BLEU computes a modified n-gram precision between the candidate translation and one or more reference translations, multiplied by a brevity penalty that discourages overly short outputs. Scores range from 0 to 1, often reported as a percentage. BLEU was initially greeted with skepticism but quickly became the standard metric in machine translation because it cut the cost of evaluating new systems. It correlates only loosely with human judgment, so it is now usually reported alongside other metrics.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) was designed for summarization and is the recall counterpart to BLEU. ROUGE-N measures n-gram recall; ROUGE-L uses the longest common subsequence between generated and reference text. Where BLEU asks how much of the generated text matches a reference, ROUGE asks how much of the reference shows up in the generated text. ROUGE is the standard automatic metric for text summarization.
METEOR (Metric for Evaluation of Translation with Explicit Ordering) improves on BLEU by considering stemming, synonyms, and word order. It correlates better with human judgment on sentence-level evaluation, at the cost of being slower and requiring language-specific resources like WordNet.
Perplexity measures how well a probability distribution predicts a sample. For a language model, perplexity is the exponential of the average negative log-likelihood per token: a perplexity of 20 means the model is on average as uncertain as if choosing uniformly from 20 equally likely next tokens. Lower is better. Perplexity is the dominant intrinsic metric for language model pretraining, though it does not directly measure whether the model produces text humans find useful.
Classic n-gram metrics struggle with the open-ended outputs of modern large language models. The field has moved toward embedding-based metrics like BERTScore, which compares contextual embeddings rather than surface tokens, and learned metrics like BLEURT, which predict human ratings. LLM-as-judge evaluation, where a strong model scores the outputs of a weaker one, is also increasingly common.
Clustering is unsupervised, so evaluation is trickier than for supervised tasks. The Adjusted Rand Index (ARI) compares a predicted clustering to a ground-truth clustering, correcting for chance. The Silhouette Score works without ground truth by measuring how tight each cluster is and how well separated it is from the others, with values from -1 to 1. Normalized Mutual Information (NMI) measures clustering agreement using information theory.
If classes are balanced and every error costs the same, accuracy is a reasonable starting point. For imbalanced data or asymmetric costs, look at precision, recall, F1, or AUC-PR. For regression, RMSE is the default when large errors are unacceptable and MAE when robustness to outliers matters. For ranking, NDCG dominates web search, MAP fits traditional information retrieval, and MRR fits question answering. For generation, no single automatic metric is sufficient, so most papers report several and supplement them with human evaluation.
A useful warning is Goodhart's law: when a measure becomes a target, it ceases to be a good measure. A model heavily optimized for a specific metric often gets worse on the underlying quality that metric was meant to capture. OpenAI researchers have studied this in the reinforcement learning from human feedback setting, showing that reward models eventually mislead their downstream policies if the optimizer pushes too hard. The practical implication is to evaluate on multiple metrics and treat any one number with healthy suspicion.
Imagine a game where you guess what fruit is in a bag without looking, then check if you were right. A metric is a way to keep score. You could count how many fruits you got exactly right, or how often you said "apple" when it was something else, or give yourself partial credit when you were close. Each scoring rule tells you something different about how good you are. Machine learning models play their own version of the same game, and metrics are the scoreboards.