# Metric

> Source: https://aiwiki.ai/wiki/metric
> Updated: 2026-06-27
> Categories: Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

In [machine learning](/wiki/machine_learning), a **metric** is a quantitative measure used to evaluate how well a model or algorithm performs a task. Google's Machine Learning Glossary defines a metric simply as "a statistic that you care about," and an objective as "a metric that your algorithm is trying to optimize." [1] Metrics let researchers judge a model in isolation and compare different models trained on the same data; the choice of metric matters as much as the choice of algorithm, because the metric encodes the definition of what "good" looks like. Common examples include [accuracy](/wiki/accuracy), [precision](/wiki/precision) and [recall](/wiki/recall), the [F1 score](/wiki/f1_score), and area under the ROC curve for classification, and mean absolute error, mean squared error, root mean squared error, and R-squared for regression. [1][6]

## What is a metric in machine learning?

A metric is a number humans look at to decide whether a model is working. Different tasks call for different metrics. A spam filter is judged on whether it catches spam without flagging real mail. A regression model is judged on how far predictions land from the truth. A search engine is judged on whether its top documents are the ones the user wanted. No single number captures all of these, so the field has many measures.

Metrics are computed on data the model did not train on, typically a held-out validation or test set, so that the score reflects generalization rather than memorization. A [benchmark](/wiki/benchmark) packages a dataset together with one or more agreed metrics so that competing systems can be compared on equal footing.

## What is the difference between a metric and a loss function?

A metric should not be confused with a [loss function](/wiki/loss_function). The loss function is what the [optimizer](/wiki/optimizer) minimizes during training; it must be differentiable so [gradient descent](/wiki/gradient_descent) can work on it. A metric is a number humans look at after training. Loss functions are for machines, metrics are for people. Google's glossary draws the same line: loss is "a measure of how far a model's prediction is from its label" used during training, while a metric is the statistic you ultimately care about. [1]

The two often disagree. A model with slightly higher cross-entropy loss may have slightly higher classification accuracy, because the two reward different things: loss scores the predicted probabilities, while accuracy only scores the top class. Many useful metrics, such as accuracy, F1, and AUC, are non-differentiable or piecewise-constant, which is precisely why they are optimized indirectly through a smooth surrogate loss rather than directly.

## What metrics are used for classification?

Classification tasks assign each input to one of several discrete classes. Most classification metrics start from the **[confusion matrix](/wiki/confusion_matrix)**, which counts four kinds of outcomes for a binary classifier: true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). [2]

### Accuracy

[Accuracy](/wiki/accuracy) is the proportion of correctly classified instances out of the total:

`accuracy = (TP + TN) / (TP + TN + FP + FN)`

It is the most intuitive classification metric and the default scoring metric for scikit-learn classifiers. [6] However, accuracy can be deeply misleading on imbalanced datasets. If 99% of credit card transactions are legitimate, a model that always predicts "legitimate" scores 99% accuracy while catching zero fraud. Google's ML crash course states plainly that "accuracy is usually a poor metric for assessing a model trained on a class-imbalanced dataset," and notes that precision and recall are usually more useful in that setting. [2]

### Precision and recall

[Precision](/wiki/precision) and [recall](/wiki/recall) isolate the two kinds of mistakes a classifier can make on the positive class.

**Precision** (positive predictive value) is the fraction of predicted positives that are actually positive:

`precision = TP / (TP + FP)`

Precision answers "of the items we flagged, how many should we have flagged?" It matters when a false positive is expensive, like sending innocent email to the spam folder. [2]

**Recall** (sensitivity, true positive rate) is the fraction of actual positives the model catches:

`recall = TP / (TP + FN)`

Recall answers "of the items that mattered, how many did we find?" It matters when a false negative is expensive, like missing a tumor on a medical scan. [2]

The two trade off against each other. Lowering the decision threshold raises recall and lowers precision; raising the threshold does the opposite.

### F1 score

The **[F1 score](/wiki/f1_score)** is the harmonic mean of precision and recall:

`F1 = 2 * (precision * recall) / (precision + recall)`

The harmonic mean penalizes a model that does well on one but poorly on the other, so F1 is only high when both precision and recall are reasonable. [2] The F-beta score generalizes F1 by weighting recall over precision (or vice versa).

### AUC-ROC

The **ROC curve** plots the true positive rate against the false positive rate as the classification threshold sweeps from 0 to 1. **AUC-ROC** is the [area under the ROC curve](/wiki/area_under_the_roc_curve), a single number between 0 and 1 summarizing how well the model ranks positive examples above negative ones. A value of 0.5 means random guessing; 1.0 means perfect separation. AUC-ROC has a probabilistic interpretation: it is the probability that a randomly chosen positive example receives a higher score than a randomly chosen negative example. [2][6]

### AUC-PR

The **[precision-recall curve](/wiki/precision-recall_curve)** plots precision against recall across thresholds, and **AUC-PR** is the area under it. AUC-PR is generally more informative than AUC-ROC on heavily imbalanced datasets, because the negative class dominates the false positive rate and makes ROC curves look optimistically good. [2] Fraud detection and rare-disease prediction often report AUC-PR alongside or instead of AUC-ROC.

### Other classification metrics

Specificity (TN / (TN + FP)) is the recall of the negative class. Balanced accuracy averages sensitivity and specificity. The Matthews correlation coefficient (MCC) treats both error types symmetrically. Log loss (cross-entropy) measures how well calibrated the predicted probabilities are, not just whether the top class is correct. [6]

## What metrics are used for regression?

Regression tasks predict a continuous numerical value. The metrics below all measure some notion of how far predictions land from the truth.

### Mean squared error

**Mean Squared Error (MSE)** is the average of squared differences between predicted and actual values:

`MSE = (1/n) * sum((y_i - y_hat_i)^2)`

Squaring penalizes large errors heavily, making MSE sensitive to outliers. It is differentiable everywhere, which is why squared error is one of the most common training losses for regression. [6] The reported MSE is in squared units of the target, awkward when the target is dollars or meters.

### Root mean squared error

**Root Mean Squared Error (RMSE)** is the square root of MSE:

`RMSE = sqrt(MSE)`

RMSE is in the same units as the target, which makes it the most commonly reported regression metric in practice. It still penalizes large errors strongly, so a model with a few wild predictions will have worse RMSE than one with consistently moderate errors. [6]

### Mean absolute error

**Mean Absolute Error (MAE)** averages the absolute differences between predicted and actual values:

`MAE = (1/n) * sum(|y_i - y_hat_i|)`

MAE treats every unit of error linearly, so a prediction off by 100 contributes ten times as much as one off by 10, not 100 times. This makes MAE robust to outliers. [6] The tradeoff is that MAE is not differentiable at zero, which complicates its use as a training loss for some optimizers.

### R-squared

The **coefficient of determination**, R^2, is the fraction of variance in the target that the model explains:

`R^2 = 1 - (sum((y_i - y_hat_i)^2) / sum((y_i - y_bar)^2))`

R^2 of 1 means the model perfectly explains the target; 0 means it does no better than always predicting the mean; negative values mean it is worse than that baseline. R^2 is the default scoring metric for regression in scikit-learn, the value returned by an estimator's score() method. [6] Adjusted R^2 corrects for the number of features.

### Other regression metrics

Mean absolute percentage error (MAPE) expresses error as a percentage of the true value, intuitive but unstable when the truth is near zero. Huber loss combines MSE and MAE behavior. [6]

## What metrics are used for ranking and information retrieval?

Ranking tasks order a list of candidates by relevance. The user usually only looks at the top of the list, so these metrics weight the top of the ranking more heavily than the bottom. [9]

### Mean reciprocal rank

**Mean Reciprocal Rank (MRR)** focuses on the position of the first relevant item. For each query, you compute 1 divided by the rank of the first correct answer, then average across queries. If the first relevant item is at rank 1 the contribution is 1; at rank 2, 0.5; at rank 10, 0.1. MRR fits scenarios with a single right answer, like factual question answering. It ignores anything after the first relevant hit. [9]

### Mean average precision

**Mean Average Precision (MAP)** averages precision at each relevant item in the ranking, then averages across queries. MAP corresponds to the area under the precision-recall curve for the ranking, and rewards systems that pack relevant items near the top while still finding most of them. MAP works well when relevance is binary. [9]

### Normalized discounted cumulative gain

**Normalized Discounted Cumulative Gain (NDCG)** handles graded relevance, where some documents are more relevant than others on a numeric scale. NDCG discounts each item's relevance by the log of its rank, so a highly relevant document at position 1 contributes more than the same document at position 10. The result is normalized by the ideal DCG, so NDCG falls between 0 and 1. [8] Web search engines have used NDCG as their workhorse offline metric for two decades, because it captures both relevance and order in one number.

## What metrics are used for text generation and language models?

Evaluating generated text is harder than evaluating a class label or a number, because there is rarely one correct output. The community has settled on a handful of automatic metrics that compare a generated string to one or more reference strings.

### BLEU

**[BLEU](/wiki/bleu) (Bilingual Evaluation Understudy)** was introduced by Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu at IBM and presented at ACL in July 2002 to evaluate machine translation. [3][4] BLEU computes a modified n-gram precision between the candidate translation and one or more reference translations, multiplied by a brevity penalty that discourages overly short outputs. Scores range from 0 to 1, often reported as a percentage. The authors framed it as "a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run." [3] BLEU was initially greeted with skepticism but quickly became the standard metric in machine translation because it cut the cost of evaluating new systems, and the original paper has accumulated close to 20,000 citations. [4] It correlates only loosely with human judgment, so it is now usually reported alongside other metrics.

### ROUGE

**ROUGE (Recall-Oriented Understudy for Gisting Evaluation)** was designed for summarization and is the recall counterpart to BLEU. ROUGE-N measures n-gram recall; ROUGE-L uses the longest common subsequence between generated and reference text. Where BLEU asks how much of the generated text matches a reference, ROUGE asks how much of the reference shows up in the generated text. ROUGE is the standard automatic metric for [text summarization](/wiki/text_summarization).

### METEOR

**METEOR (Metric for Evaluation of Translation with Explicit Ordering)** improves on BLEU by considering stemming, synonyms, and word order. It correlates better with human judgment on sentence-level evaluation, at the cost of being slower and requiring language-specific resources like WordNet.

### Perplexity

**[Perplexity](/wiki/perplexity)** measures how well a probability distribution predicts a sample. For a [language model](/wiki/language_model), perplexity is the exponential of the average negative log-likelihood per token: a perplexity of 20 means the model is on average as uncertain as if choosing uniformly from 20 equally likely next tokens. Lower is better. Perplexity is the dominant intrinsic metric for language model pretraining, though it does not directly measure whether the model produces text humans find useful.

### Newer LLM metrics

Classic n-gram metrics struggle with the open-ended outputs of modern large language models. The field has moved toward embedding-based metrics like [BERTScore](/wiki/bertscore), which compares contextual embeddings rather than surface tokens, and learned metrics like BLEURT, which predict human ratings. LLM-as-judge evaluation, where a strong model scores the outputs of a weaker one, is also increasingly common.

## What metrics are used for clustering?

Clustering is unsupervised, so evaluation is trickier than for supervised tasks. The Adjusted Rand Index (ARI) compares a predicted clustering to a ground-truth clustering, correcting for chance. The Silhouette Score works without ground truth by measuring how tight each cluster is and how well separated it is from the others, with values from -1 to 1. Normalized Mutual Information (NMI) measures clustering agreement using information theory. [6]

## How do you choose the right metric?

If classes are balanced and every error costs the same, accuracy is a reasonable starting point. For imbalanced data or asymmetric costs, look at precision, recall, F1, or AUC-PR. [2] For regression, RMSE is the default when large errors are unacceptable and MAE when robustness to outliers matters. For ranking, NDCG dominates web search, MAP fits traditional information retrieval, and MRR fits question answering. For generation, no single automatic metric is sufficient, so most papers report several and supplement them with human evaluation.

The table below summarizes the common default metrics by task type.

| Task | Common metrics | Typical default | Range |
|------|----------------|-----------------|-------|
| Binary classification | Accuracy, precision, recall, F1, AUC-ROC, AUC-PR | Accuracy (balanced); F1 or AUC-PR (imbalanced) | 0 to 1 |
| Regression | MAE, MSE, RMSE, R^2, MAPE | R^2 (scikit-learn); RMSE (reporting) | R^2: up to 1; errors: 0 and up |
| Ranking / IR | MRR, MAP, NDCG | NDCG (web search) | 0 to 1 |
| Text generation | BLEU, ROUGE, METEOR, BERTScore, perplexity | Multiple plus human eval | varies |
| Clustering | ARI, Silhouette, NMI | ARI (with labels); Silhouette (without) | ARI/Silhouette: -1 to 1 |

A useful warning is [Goodhart's law](/wiki/goodharts_law): when a measure becomes a target, it ceases to be a good measure. That phrasing comes from the anthropologist Marilyn Strathern, who in 1997 wrote, "When a measure becomes a target, it ceases to be a good measure." [10] A model heavily optimized for a specific metric often gets worse on the underlying quality that metric was meant to capture. OpenAI researchers studied this in the [reinforcement learning from human feedback](/wiki/reinforcement_learning_from_human_feedback) setting, showing that "optimizing against the reward model is expected to make the policy worse with respect to the gold reward model after a certain point, due to Goodhart's law," because the reward model is an imperfect proxy for human preference. [7] The practical implication is to evaluate on multiple metrics and treat any one number with healthy suspicion.

## Explain like I'm 5 (ELI5)

Imagine a game where you guess what fruit is in a bag without looking, then check if you were right. A **metric** is a way to keep score. You could count how many fruits you got exactly right, or how often you said "apple" when it was something else, or give yourself partial credit when you were close. Each scoring rule tells you something different about how good you are. Machine learning models play their own version of the same game, and metrics are the scoreboards.

## References

1. Google Developers, Machine Learning Glossary (definitions of metric, objective, and loss): https://developers.google.com/machine-learning/glossary
2. Google Developers, Classification: Accuracy, recall, precision, and related metrics: https://developers.google.com/machine-learning/crash-course/classification/accuracy-precision-recall
3. Papineni, K., Roukos, S., Ward, T., and Zhu, W. (2002). BLEU: A Method for Automatic Evaluation of Machine Translation. ACL Anthology: https://aclanthology.org/P02-1040/
4. IBM Research, The AI paper at the foundations of multilingual NLP (BLEU 20th anniversary): https://research.ibm.com/blog/bleu-nlp-benchmark-anniversary
5. Wikipedia, Evaluation of binary classifiers: https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers
6. scikit-learn documentation, Metrics and scoring: quantifying the quality of predictions: https://scikit-learn.org/stable/modules/model_evaluation.html
7. Gao, L., Schulman, J., and Hilton, J. (2022). Scaling Laws for Reward Model Overoptimization. OpenAI: https://openai.com/index/scaling-laws-for-reward-model-overoptimization/
8. Evidently AI, Normalized Discounted Cumulative Gain (NDCG) explained: https://www.evidentlyai.com/ranking-metrics/ndcg-metric
9. Pinecone, Evaluation Measures in Information Retrieval: https://www.pinecone.io/learn/offline-evaluation/
10. Strathern, M. (1997). Improving ratings: audit in the British University system. European Review, 5(3): https://www.cambridge.org/core/journals/european-review/article/abs/improving-ratings-audit-in-the-british-university-system/FC2EE640C0C44E3DB87C29FB666E9AAB