Metric
Last reviewed
Sources
10 citations
Review status
Source-backed
Revision
v3 ยท 2,731 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
10 citations
Review status
Source-backed
Revision
v3 ยท 2,731 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms
In machine learning, a metric is a quantitative measure used to evaluate how well a model or algorithm performs a task. Google's Machine Learning Glossary defines a metric simply as "a statistic that you care about," and an objective as "a metric that your algorithm is trying to optimize." [1] Metrics let researchers judge a model in isolation and compare different models trained on the same data; the choice of metric matters as much as the choice of algorithm, because the metric encodes the definition of what "good" looks like. Common examples include accuracy, precision and recall, the F1 score, and area under the ROC curve for classification, and mean absolute error, mean squared error, root mean squared error, and R-squared for regression. [1][6]
A metric is a number humans look at to decide whether a model is working. Different tasks call for different metrics. A spam filter is judged on whether it catches spam without flagging real mail. A regression model is judged on how far predictions land from the truth. A search engine is judged on whether its top documents are the ones the user wanted. No single number captures all of these, so the field has many measures.
Metrics are computed on data the model did not train on, typically a held-out validation or test set, so that the score reflects generalization rather than memorization. A benchmark packages a dataset together with one or more agreed metrics so that competing systems can be compared on equal footing.
A metric should not be confused with a loss function. The loss function is what the optimizer minimizes during training; it must be differentiable so gradient descent can work on it. A metric is a number humans look at after training. Loss functions are for machines, metrics are for people. Google's glossary draws the same line: loss is "a measure of how far a model's prediction is from its label" used during training, while a metric is the statistic you ultimately care about. [1]
The two often disagree. A model with slightly higher cross-entropy loss may have slightly higher classification accuracy, because the two reward different things: loss scores the predicted probabilities, while accuracy only scores the top class. Many useful metrics, such as accuracy, F1, and AUC, are non-differentiable or piecewise-constant, which is precisely why they are optimized indirectly through a smooth surrogate loss rather than directly.
Classification tasks assign each input to one of several discrete classes. Most classification metrics start from the confusion matrix, which counts four kinds of outcomes for a binary classifier: true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). [2]
Accuracy is the proportion of correctly classified instances out of the total:
accuracy = (TP + TN) / (TP + TN + FP + FN)
It is the most intuitive classification metric and the default scoring metric for scikit-learn classifiers. [6] However, accuracy can be deeply misleading on imbalanced datasets. If 99% of credit card transactions are legitimate, a model that always predicts "legitimate" scores 99% accuracy while catching zero fraud. Google's ML crash course states plainly that "accuracy is usually a poor metric for assessing a model trained on a class-imbalanced dataset," and notes that precision and recall are usually more useful in that setting. [2]
Precision and recall isolate the two kinds of mistakes a classifier can make on the positive class.
Precision (positive predictive value) is the fraction of predicted positives that are actually positive:
precision = TP / (TP + FP)
Precision answers "of the items we flagged, how many should we have flagged?" It matters when a false positive is expensive, like sending innocent email to the spam folder. [2]
Recall (sensitivity, true positive rate) is the fraction of actual positives the model catches:
recall = TP / (TP + FN)
Recall answers "of the items that mattered, how many did we find?" It matters when a false negative is expensive, like missing a tumor on a medical scan. [2]
The two trade off against each other. Lowering the decision threshold raises recall and lowers precision; raising the threshold does the opposite.
The F1 score is the harmonic mean of precision and recall:
F1 = 2 * (precision * recall) / (precision + recall)
The harmonic mean penalizes a model that does well on one but poorly on the other, so F1 is only high when both precision and recall are reasonable. [2] The F-beta score generalizes F1 by weighting recall over precision (or vice versa).
The ROC curve plots the true positive rate against the false positive rate as the classification threshold sweeps from 0 to 1. AUC-ROC is the area under the ROC curve, a single number between 0 and 1 summarizing how well the model ranks positive examples above negative ones. A value of 0.5 means random guessing; 1.0 means perfect separation. AUC-ROC has a probabilistic interpretation: it is the probability that a randomly chosen positive example receives a higher score than a randomly chosen negative example. [2][6]
The precision-recall curve plots precision against recall across thresholds, and AUC-PR is the area under it. AUC-PR is generally more informative than AUC-ROC on heavily imbalanced datasets, because the negative class dominates the false positive rate and makes ROC curves look optimistically good. [2] Fraud detection and rare-disease prediction often report AUC-PR alongside or instead of AUC-ROC.
Specificity (TN / (TN + FP)) is the recall of the negative class. Balanced accuracy averages sensitivity and specificity. The Matthews correlation coefficient (MCC) treats both error types symmetrically. Log loss (cross-entropy) measures how well calibrated the predicted probabilities are, not just whether the top class is correct. [6]
Regression tasks predict a continuous numerical value. The metrics below all measure some notion of how far predictions land from the truth.
Mean Squared Error (MSE) is the average of squared differences between predicted and actual values:
MSE = (1/n) * sum((y_i - y_hat_i)^2)
Squaring penalizes large errors heavily, making MSE sensitive to outliers. It is differentiable everywhere, which is why squared error is one of the most common training losses for regression. [6] The reported MSE is in squared units of the target, awkward when the target is dollars or meters.
Root Mean Squared Error (RMSE) is the square root of MSE:
RMSE = sqrt(MSE)
RMSE is in the same units as the target, which makes it the most commonly reported regression metric in practice. It still penalizes large errors strongly, so a model with a few wild predictions will have worse RMSE than one with consistently moderate errors. [6]
Mean Absolute Error (MAE) averages the absolute differences between predicted and actual values:
MAE = (1/n) * sum(|y_i - y_hat_i|)
MAE treats every unit of error linearly, so a prediction off by 100 contributes ten times as much as one off by 10, not 100 times. This makes MAE robust to outliers. [6] The tradeoff is that MAE is not differentiable at zero, which complicates its use as a training loss for some optimizers.
The coefficient of determination, R^2, is the fraction of variance in the target that the model explains:
R^2 = 1 - (sum((y_i - y_hat_i)^2) / sum((y_i - y_bar)^2))
R^2 of 1 means the model perfectly explains the target; 0 means it does no better than always predicting the mean; negative values mean it is worse than that baseline. R^2 is the default scoring metric for regression in scikit-learn, the value returned by an estimator's score() method. [6] Adjusted R^2 corrects for the number of features.
Mean absolute percentage error (MAPE) expresses error as a percentage of the true value, intuitive but unstable when the truth is near zero. Huber loss combines MSE and MAE behavior. [6]
Ranking tasks order a list of candidates by relevance. The user usually only looks at the top of the list, so these metrics weight the top of the ranking more heavily than the bottom. [9]
Mean Reciprocal Rank (MRR) focuses on the position of the first relevant item. For each query, you compute 1 divided by the rank of the first correct answer, then average across queries. If the first relevant item is at rank 1 the contribution is 1; at rank 2, 0.5; at rank 10, 0.1. MRR fits scenarios with a single right answer, like factual question answering. It ignores anything after the first relevant hit. [9]
Mean Average Precision (MAP) averages precision at each relevant item in the ranking, then averages across queries. MAP corresponds to the area under the precision-recall curve for the ranking, and rewards systems that pack relevant items near the top while still finding most of them. MAP works well when relevance is binary. [9]
Normalized Discounted Cumulative Gain (NDCG) handles graded relevance, where some documents are more relevant than others on a numeric scale. NDCG discounts each item's relevance by the log of its rank, so a highly relevant document at position 1 contributes more than the same document at position 10. The result is normalized by the ideal DCG, so NDCG falls between 0 and 1. [8] Web search engines have used NDCG as their workhorse offline metric for two decades, because it captures both relevance and order in one number.
Evaluating generated text is harder than evaluating a class label or a number, because there is rarely one correct output. The community has settled on a handful of automatic metrics that compare a generated string to one or more reference strings.
BLEU (Bilingual Evaluation Understudy) was introduced by Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu at IBM and presented at ACL in July 2002 to evaluate machine translation. [3][4] BLEU computes a modified n-gram precision between the candidate translation and one or more reference translations, multiplied by a brevity penalty that discourages overly short outputs. Scores range from 0 to 1, often reported as a percentage. The authors framed it as "a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run." [3] BLEU was initially greeted with skepticism but quickly became the standard metric in machine translation because it cut the cost of evaluating new systems, and the original paper has accumulated close to 20,000 citations. [4] It correlates only loosely with human judgment, so it is now usually reported alongside other metrics.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) was designed for summarization and is the recall counterpart to BLEU. ROUGE-N measures n-gram recall; ROUGE-L uses the longest common subsequence between generated and reference text. Where BLEU asks how much of the generated text matches a reference, ROUGE asks how much of the reference shows up in the generated text. ROUGE is the standard automatic metric for text summarization.
METEOR (Metric for Evaluation of Translation with Explicit Ordering) improves on BLEU by considering stemming, synonyms, and word order. It correlates better with human judgment on sentence-level evaluation, at the cost of being slower and requiring language-specific resources like WordNet.
Perplexity measures how well a probability distribution predicts a sample. For a language model, perplexity is the exponential of the average negative log-likelihood per token: a perplexity of 20 means the model is on average as uncertain as if choosing uniformly from 20 equally likely next tokens. Lower is better. Perplexity is the dominant intrinsic metric for language model pretraining, though it does not directly measure whether the model produces text humans find useful.
Classic n-gram metrics struggle with the open-ended outputs of modern large language models. The field has moved toward embedding-based metrics like BERTScore, which compares contextual embeddings rather than surface tokens, and learned metrics like BLEURT, which predict human ratings. LLM-as-judge evaluation, where a strong model scores the outputs of a weaker one, is also increasingly common.
Clustering is unsupervised, so evaluation is trickier than for supervised tasks. The Adjusted Rand Index (ARI) compares a predicted clustering to a ground-truth clustering, correcting for chance. The Silhouette Score works without ground truth by measuring how tight each cluster is and how well separated it is from the others, with values from -1 to 1. Normalized Mutual Information (NMI) measures clustering agreement using information theory. [6]
If classes are balanced and every error costs the same, accuracy is a reasonable starting point. For imbalanced data or asymmetric costs, look at precision, recall, F1, or AUC-PR. [2] For regression, RMSE is the default when large errors are unacceptable and MAE when robustness to outliers matters. For ranking, NDCG dominates web search, MAP fits traditional information retrieval, and MRR fits question answering. For generation, no single automatic metric is sufficient, so most papers report several and supplement them with human evaluation.
The table below summarizes the common default metrics by task type.
| Task | Common metrics | Typical default | Range |
|---|---|---|---|
| Binary classification | Accuracy, precision, recall, F1, AUC-ROC, AUC-PR | Accuracy (balanced); F1 or AUC-PR (imbalanced) | 0 to 1 |
| Regression | MAE, MSE, RMSE, R^2, MAPE | R^2 (scikit-learn); RMSE (reporting) | R^2: up to 1; errors: 0 and up |
| Ranking / IR | MRR, MAP, NDCG | NDCG (web search) | 0 to 1 |
| Text generation | BLEU, ROUGE, METEOR, BERTScore, perplexity | Multiple plus human eval | varies |
| Clustering | ARI, Silhouette, NMI | ARI (with labels); Silhouette (without) | ARI/Silhouette: -1 to 1 |
A useful warning is Goodhart's law: when a measure becomes a target, it ceases to be a good measure. That phrasing comes from the anthropologist Marilyn Strathern, who in 1997 wrote, "When a measure becomes a target, it ceases to be a good measure." [10] A model heavily optimized for a specific metric often gets worse on the underlying quality that metric was meant to capture. OpenAI researchers studied this in the reinforcement learning from human feedback setting, showing that "optimizing against the reward model is expected to make the policy worse with respect to the gold reward model after a certain point, due to Goodhart's law," because the reward model is an imperfect proxy for human preference. [7] The practical implication is to evaluate on multiple metrics and treat any one number with healthy suspicion.
Imagine a game where you guess what fruit is in a bag without looking, then check if you were right. A metric is a way to keep score. You could count how many fruits you got exactly right, or how often you said "apple" when it was something else, or give yourself partial credit when you were close. Each scoring rule tells you something different about how good you are. Machine learning models play their own version of the same game, and metrics are the scoreboards.