Precision

Introduction

In the context of machine learning, precision is a fundamental metric used to evaluate the performance of classification models. Also known as positive predictive value (PPV) in medical and statistical literature, precision measures the proportion of true positive instances among all instances classified as positive. This metric is particularly important in cases where the cost of false positives is high, such as in medical diagnosis, spam detection, or fraud alerting systems.

Precision answers the question: "Of all the items the model labeled as positive, how many actually are positive?" A model with high precision produces few false alarms, making it a critical metric when acting on a false positive carries significant consequences.

The concept of precision originates from the field of information retrieval, where it was formalized by C. J. van Rijsbergen in his 1979 textbook Information Retrieval. It has since become one of the most widely reported evaluation metrics across all branches of machine learning, alongside recall, accuracy, and the F1 score.

Formulation

The precision of a classification model is mathematically defined as the ratio of true positive predictions (TP) to the sum of true positive predictions and false positive predictions (FP). Mathematically, it can be expressed as:

Precision = TP / (TP + FP)

Where:

TP (True Positives) represents the number of instances that are actually positive and were correctly predicted as positive by the model.
FP (False Positives) represents the number of instances that are actually negative but were incorrectly predicted as positive by the model. False positives are also known as Type I errors.

Precision values range from 0 to 1 (or equivalently, 0% to 100%). A precision of 1.0 means that every instance the model predicted as positive was indeed positive (no false positives at all). A precision of 0.0 would mean that none of the instances predicted as positive were actually positive.

Precision is undefined when the denominator (TP + FP) equals zero, which occurs when the model makes no positive predictions at all. In practice, libraries like scikit-learn handle this edge case by returning 0.0 and issuing a warning.

Worked example

Consider a spam filter that classifies emails as either "spam" (positive) or "not spam" (negative). After processing 1,000 emails, the results are:

	Predicted Spam	Predicted Not Spam
Actually Spam	80 (TP)	20 (FN)
Actually Not Spam	10 (FP)	890 (TN)

The precision of this spam filter is:

Precision = TP / (TP + FP) = 80 / (80 + 10) = 80 / 90 = 0.889

This means that 88.9% of the emails flagged as spam were actually spam. The remaining 11.1% were legitimate emails incorrectly flagged (false positives).

Relationship to the confusion matrix

To fully understand precision, it helps to place TP and FP within the context of the full confusion matrix:

	Predicted Positive	Predicted Negative
Actually Positive	True Positive (TP)	False Negative (FN)
Actually Negative	False Positive (FP)	True Negative (TN)

Precision focuses on the left column of this matrix: of everything the model called positive (TP + FP), what fraction was correct (TP)? Note that precision does not consider false negatives at all. A model can miss many positive instances (low recall) and still have high precision, as long as the predictions it does make are correct.

The four cells of the confusion matrix give rise to several related metrics. Precision uses the "predicted positive" column, while recall uses the "actually positive" row. Specificity and the false positive rate use the "actually negative" row. Understanding which cells each metric draws from helps clarify what each one measures and where its blind spots lie.

Why false positives matter

False positives can be costly in many real-world applications:

Application	What a False Positive Means	Consequence
Email spam filtering	Legitimate email marked as spam	Important messages lost; missed business opportunities
Fraud detection	Legitimate transaction flagged as fraud	Customer inconvenience; blocked purchases
Legal document review	Irrelevant document flagged as relevant	Wasted attorney time; increased costs
Manufacturing quality control	Good product rejected	Wasted materials; reduced throughput
Criminal justice risk scoring	Low-risk individual flagged as high-risk	Unjust detention or denial of bail
Automated content moderation	Legitimate post removed as policy violation	User frustration; censorship concerns

In all of these scenarios, precision is the metric that directly captures the rate of false alarms.

The precision-recall tradeoff

Precision and recall are inversely related in most practical settings. Improving one typically comes at the expense of the other. This relationship is known as the precision-recall tradeoff.

Most classifiers produce a continuous score or probability rather than a hard binary decision. A classification threshold determines the cutoff above which an instance is classified as positive. Adjusting this threshold shifts the balance between precision and recall:

Raising the threshold makes the model more selective: fewer instances are classified as positive, so false positives decrease (precision goes up), but more true positives are missed (recall goes down).
Lowering the threshold makes the model more permissive: more instances are classified as positive, catching more true positives (recall goes up), but also admitting more false positives (precision goes down).

Effect of classification threshold

To illustrate concretely, consider a binary classification model that outputs a probability score between 0 and 1. Suppose we have 200 test samples and evaluate precision and recall at different threshold values:

Threshold	TP	FP	FN	Precision	Recall
0.3	90	40	5	0.692	0.947
0.5	82	18	13	0.820	0.863
0.7	68	7	27	0.907	0.716
0.9	45	2	50	0.957	0.474

As the threshold rises from 0.3 to 0.9, precision climbs from 0.692 to 0.957 while recall drops from 0.947 to 0.474. The optimal threshold depends on the application's tolerance for false positives versus false negatives.

Precision-recall curve

The precision-recall curve plots precision on the y-axis against recall on the x-axis at various threshold settings. A model that maintains high precision across all recall levels is superior. The area under the precision-recall curve (AUPRC) summarizes this tradeoff in a single number, with 1.0 representing a perfect classifier.

The AUPRC is especially useful for evaluating models on imbalanced datasets. Unlike the AUC-ROC, which can appear deceptively high when the negative class is much larger than the positive class, the AUPRC focuses exclusively on performance with respect to the positive class. Davis and Goadrich (2006) demonstrated that a curve dominating in ROC space also dominates in precision-recall space, but the reverse is not necessarily true.

When to favor precision over recall

Precision should be prioritized when the cost of a false positive is much higher than the cost of a false negative. Examples include:

Spam filtering: Marking a legitimate email as spam (false positive) may cause the user to miss critical information. Missing a spam email (false negative) is merely a minor inconvenience.
Search result ranking: Returning irrelevant results at the top of a list erodes user trust. Users are more tolerant of missing a relevant result buried deeper in the rankings.
Content recommendation: Recommending irrelevant content (false positive) degrades user trust and engagement. Not recommending a relevant item (false negative) simply means the user does not see it.
Judicial and law enforcement systems: Falsely flagging an innocent person has severe ethical and legal consequences.
Automated trading signals: A false positive buy signal on a financial instrument can result in direct monetary loss.

Conversely, recall should be prioritized when the cost of a false negative outweighs the cost of a false positive. Medical screening and safety-critical systems are classic examples: missing a disease or a structural defect can be life-threatening, so catching every positive case matters more than avoiding false alarms.

Scenario	Priority Metric	Rationale
Email spam filtering	Precision	Losing a real email is worse than seeing spam
Cancer screening	Recall	Missing a malignant tumor is worse than a false alarm
Search engine results	Precision	Users expect top results to be relevant
Airport security	Recall	Missing a threat is unacceptable
Fraud detection	Precision	Blocking legitimate transactions frustrates customers
Manufacturing defect detection	Recall	Shipping a defective product has high liability

Precision at K (P@K)

Precision@K (also written P@K) is a variant of precision used in information retrieval and ranking systems. It measures the proportion of relevant items among the top K results returned by a system:

Precision@K = (Number of relevant items in the top K results) / K

For example, if a search engine returns 10 results (K=10) and 7 of them are relevant, then Precision@10 = 7/10 = 0.70.

Precision@K is widely used in evaluating:

Web search engines (e.g., Precision@10 for the first page of results)
Recommendation systems (e.g., Precision@5 for a "top 5 picks for you" list)
Document retrieval in legal and medical literature search
Retrieval-augmented generation (RAG) pipelines, where the quality of retrieved context directly affects output quality

A limitation of Precision@K is that it does not account for the position of relevant items within the top K results. A system that places all relevant items at the top of the list and one that scatters them throughout receive the same Precision@K score. Rank-aware metrics such as Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG) address this limitation.

Common values of K

K Value	Typical Use Case
P@1	Voice assistants, question answering (top result must be correct)
P@3	Mobile search results (small screen, few visible results)
P@5	Short recommendation lists
P@10	First page of web search results
P@20	Extended search result pages
P@100	Batch retrieval, recall-oriented systems

Average precision and mean average precision

Average Precision (AP) addresses the positional limitation of Precision@K by computing the average of precision values at each rank position where a relevant item is found:

AP = (1 / number of relevant items) * sum of Precision@k for each k where item k is relevant

AP rewards systems that place relevant items higher in the ranked list. A system that returns all relevant items at the very top achieves AP = 1.0, while a system that scatters relevant items throughout the list achieves a lower AP even if the total number of relevant items retrieved is the same.

AP worked example

Suppose a query has 4 relevant documents in the collection, and a system returns 10 results. Relevant items appear at positions 1, 3, 5, and 10:

Position	Relevant?	Precision@k
1	Yes	1/1 = 1.000
2	No	1/2 = 0.500
3	Yes	2/3 = 0.667
4	No	2/4 = 0.500
5	Yes	3/5 = 0.600
6-9	No	...
10	Yes	4/10 = 0.400

AP = (1.000 + 0.667 + 0.600 + 0.400) / 4 = 2.667 / 4 = 0.667

Mean Average Precision (MAP) is the mean of AP values across multiple queries or users. It is one of the most commonly used metrics for evaluating ranked retrieval systems and recommendation systems.

Precision in object detection

In object detection, precision takes on a slightly different meaning because predictions are bounding boxes rather than binary labels. A predicted bounding box is considered a true positive if its Intersection over Union (IoU) with a ground-truth box exceeds a specified threshold.

The IoU is computed as the area of overlap between the predicted and ground-truth boxes divided by the area of their union. Different benchmarks set different IoU thresholds:

Benchmark	IoU Threshold	Metric Name
PASCAL VOC	0.5	AP@0.5 (also called AP50)
COCO (primary)	0.5 to 0.95, step 0.05	AP@[.5:.95]
COCO (loose)	0.5	AP@0.5
COCO (strict)	0.75	AP@0.75

The PASCAL VOC challenge uses AP averaged at IoU = 0.5 as its primary metric, which is relatively lenient since predicted boxes only need to overlap half of the ground-truth area. The COCO benchmark reports AP averaged over ten IoU thresholds (0.5 to 0.95 in steps of 0.05), denoted AP@[.5:.95]. This stricter evaluation penalizes detections that are only roughly localized, rewarding models that produce tighter bounding boxes.

In both benchmarks, mAP (mean Average Precision) is computed by averaging AP across all object classes. A model with high mAP achieves both high precision and good ranking of its detections across all categories.

Macro, micro, and weighted precision

When dealing with multi-class classification problems (more than two classes), precision must be aggregated across classes. There are three main approaches:

Averaging Method	How It Works	When to Use
Macro precision	Compute precision independently for each class, then take the unweighted mean	When all classes are equally important, regardless of size
Micro precision	Sum all true positives and false positives across all classes, then compute a single precision value	When you care about overall correctness across all predictions
Weighted precision	Compute precision for each class, then take the weighted mean (weighted by the number of true instances per class)	When class imbalance exists and you want to reflect each class proportionally

Macro precision

Macro precision treats all classes equally. For a problem with C classes:

Macro Precision = (1/C) * sum of Precision_c for c = 1 to C

where Precision_c = TP_c / (TP_c + FP_c) for each class c.

Macro precision can be strongly affected by classes with very few instances. If a minority class has just 2 true positives and 1 false positive, its precision of 0.67 carries the same weight in the macro average as a majority class with thousands of predictions.

Micro precision

Micro precision aggregates all predictions before computing the metric:

Micro Precision = (sum of TP_c for all c) / (sum of TP_c + FP_c for all c)

For a standard multi-class problem (where each instance belongs to exactly one class), micro precision equals the overall accuracy of the model. This equivalence arises because every false positive for one class is simultaneously a false negative for another class, so the total false positives across all classes equal the total false negatives. Micro precision is dominated by the performance on large classes, so it may mask poor performance on minority classes.

Practical example

Consider a sentiment classifier with three classes:

Class	TP	FP	Precision
Positive	80	10	80/90 = 0.889
Neutral	50	20	50/70 = 0.714
Negative	30	5	30/35 = 0.857

Macro precision = (0.889 + 0.714 + 0.857) / 3 = 0.820
Micro precision = (80 + 50 + 30) / (90 + 70 + 35) = 160 / 195 = 0.821
Weighted precision (weighted by support: 90, 70, 35) = (0.889 x 90 + 0.714 x 70 + 0.857 x 35) / 195 = 0.834

In this example, the three averages are close because the class sizes are not drastically different. When class imbalance is severe, the differences become much larger.

Precision on imbalanced datasets

Precision behavior changes significantly on imbalanced datasets, where one class greatly outnumbers the other. Consider a rare disease affecting 0.1% of the population. Even a good model can have low precision because the large number of healthy individuals generates many false positives relative to the small number of true positives.

For example, a test with 99% sensitivity (recall) and 99% specificity applied to 100,000 people where 100 are actually sick:

	Predicted Sick	Predicted Healthy
Actually Sick (100)	99 (TP)	1 (FN)
Actually Healthy (99,900)	999 (FP)	98,901 (TN)

Precision = 99 / (99 + 999) = 99 / 1,098 = 0.090

Despite excellent sensitivity and specificity, the precision is only 9%. This is known as the base rate fallacy or the false positive paradox: when the condition is rare, even a highly specific test will produce more false positives than true positives in absolute terms.

This phenomenon is important to understand in fields like medical testing, where a positive result on a screening test often requires a second, more specific confirmatory test. It also explains why the precision-recall curve is preferred over the ROC curve for evaluating classifiers on heavily imbalanced data, as Saito and Rehmsmeier (2015) demonstrated.

Precision in scikit-learn

The popular Python library scikit-learn provides several functions for computing precision:

precision_score

from sklearn.metrics import precision_score

y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0]

# Binary precision
precision = precision_score(y_true, y_pred)
print(f"Precision: {precision:.3f}")  # Output: 0.800

# Multi-class with different averaging
precision_macro = precision_score(y_true, y_pred, average='macro')
precision_micro = precision_score(y_true, y_pred, average='micro')
precision_weighted = precision_score(y_true, y_pred, average='weighted')

The average parameter controls the type of aggregation:

Value	Behavior
`'binary'`	Report precision for the positive class only (default for binary problems)
`'macro'`	Unweighted mean of per-class precision
`'micro'`	Global precision from total TP and FP
`'weighted'`	Weighted mean by class support
`None`	Return an array with precision for each class

classification_report

The classification_report function provides precision, recall, and F1 for all classes in a single call:

from sklearn.metrics import classification_report

y_true = [0, 0, 0, 1, 1, 1, 2, 2, 2]
y_pred = [0, 1, 0, 1, 1, 0, 2, 2, 1]

print(classification_report(y_true, y_pred,
      target_names=['Negative', 'Positive', 'Neutral']))

This produces a table like:

              precision    recall  f1-score   support

    Negative       0.67      0.67      0.67         3
    Positive       0.50      0.67      0.57         3
     Neutral       1.00      0.67      0.80         3

    accuracy                           0.67         9
   macro avg       0.72      0.67      0.68         9
weighted avg       0.72      0.67      0.68         9

PrecisionRecallDisplay

For visualization, scikit-learn provides PrecisionRecallDisplay to plot precision-recall curves directly from a classifier:

from sklearn.metrics import PrecisionRecallDisplay
import matplotlib.pyplot as plt

# Assuming you have a fitted classifier and test data
PrecisionRecallDisplay.from_estimator(classifier, X_test, y_test)
plt.title("Precision-Recall Curve")
plt.show()

Comparison with other metrics

Recall

Recall, also known as sensitivity or true positive rate, measures the proportion of actual positive instances that were correctly identified by the model:

Recall = TP / (TP + FN)

While precision asks "of the items called positive, how many truly are?", recall asks "of the items that truly are positive, how many did the model find?" A model can achieve perfect precision by making a single, highly confident positive prediction, but its recall would be near zero if it missed all other positives.

F1 score

The F1 score is the harmonic mean of precision and recall and serves as a single metric to assess the trade-off between these two performance measures. The mathematical formula is:

F1 Score = 2 * (Precision * Recall) / (Precision + Recall)

The harmonic mean is used rather than the arithmetic mean because it penalizes extreme imbalances. A model with precision = 1.0 and recall = 0.01 gets an F1 of only 0.02, whereas the arithmetic mean would be 0.505. The F1 score is particularly useful when dealing with imbalanced datasets, as it takes into account both false positives and false negatives.

F-beta score

The F-beta score generalizes the F1 score by introducing a parameter beta that controls the relative importance of precision vs. recall:

F_beta = (1 + beta^2) * (Precision * Recall) / (beta^2 * Precision + Recall)

beta < 1 weights precision more heavily (e.g., F0.5 is common when false positives are especially costly).
beta = 1 gives equal weight to precision and recall (the standard F1 score).
beta > 1 weights recall more heavily (e.g., F2 is common when false negatives are especially costly).

Van Rijsbergen (1979) originally designed the F-measure so that F-beta "measures the effectiveness of retrieval with respect to a user who attaches beta times as much importance to recall as precision."

Metric	Formula	Focus
Precision	TP / (TP + FP)	Quality of positive predictions
Recall	TP / (TP + FN)	Completeness of positive identification
F1 Score	2 * P * R / (P + R)	Balance between precision and recall
Accuracy	(TP + TN) / (TP + TN + FP + FN)	Overall correctness
Specificity	TN / (TN + FP)	Correctness of negative predictions
Positive Predictive Value (PPV)	Same as Precision	Term used in medicine and epidemiology
False Discovery Rate (FDR)	FP / (FP + TP) = 1 - Precision	Proportion of false alarms among positive predictions
Precision@K	Relevant in top K / K	Quality of top-K ranked results
Average Precision	Mean of P@k at relevant positions	Quality-weighted ranking

Explain like I'm 5 (ELI5)

Imagine you have a big basket of fruit, and your job is to pick out only the apples. You reach in and pull out some pieces of fruit. Precision tells you how many of the fruits you grabbed are actually apples. If you grabbed 10 fruits and 8 of them are apples (but 2 are oranges you picked by mistake), your precision is 8 out of 10, or 80%. The oranges you accidentally grabbed are "false positives." Higher precision means you are better at only grabbing apples and leaving the oranges alone.

Precision does not care about the apples still left in the basket (that would be recall). It only cares about whether the fruits you pulled out are the right ones.

References

Van Rijsbergen, C. J. (1979). *Information Retrieval*. 2nd edition. Butterworths.
Powers, D. M. W. (2011). "Evaluation: From Precision, Recall and F-measure to ROC, Informedness, Markedness and Correlation." *Journal of Machine Learning Technologies*, 2(1), 37-63.
Davis, J. and Goadrich, M. (2006). "The Relationship Between Precision-Recall and ROC Curves." *Proceedings of the 23rd International Conference on Machine Learning (ICML)*.
Manning, C. D., Raghavan, P., and Schutze, H. (2008). *Introduction to Information Retrieval*. Cambridge University Press. Chapter 8: Evaluation in Information Retrieval.
Sokolova, M. and Lapalme, G. (2009). "A Systematic Analysis of Performance Measures for Classification Tasks." *Information Processing and Management*, 45(4), 427-437.
Pedregosa, F. et al. (2011). "Scikit-learn: Machine Learning in Python." *Journal of Machine Learning Research*, 12, 2825-2830.
Saito, T. and Rehmsmeier, M. (2015). "The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets." *PLOS ONE*, 10(3), e0118432.
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., and Zisserman, A. (2010). "The Pascal Visual Object Classes (VOC) Challenge." *International Journal of Computer Vision*, 88(2), 303-338.
Lin, T.-Y. et al. (2014). "Microsoft COCO: Common Objects in Context." *European Conference on Computer Vision (ECCV)*.
Fawcett, T. (2006). "An Introduction to ROC Analysis." *Pattern Recognition Letters*, 27(8), 861-874.
Flach, P. and Kull, M. (2015). "Precision-Recall-Gain Curves: PR Analysis Done Right." *Advances in Neural Information Processing Systems (NeurIPS)*, 28.

Introduction

Formulation

Worked example

Relationship to the confusion matrix

Why false positives matter

The precision-recall tradeoff

Effect of classification threshold

Precision-recall curve

When to favor precision over recall

Precision at K (P@K)

Common values of K

Average precision and mean average precision

AP worked example

Precision in object detection

Macro, micro, and weighted precision

Macro precision

Micro precision

Practical example

Precision on imbalanced datasets

Precision in scikit-learn

precision_score

classification_report

PrecisionRecallDisplay

Comparison with other metrics

Recall

F1 score

F-beta score

Summary of related metrics

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

ARC-AGI 2

AUC (Area Under the ROC Curve)

Accuracy

Confusion Matrix

Decision Threshold

False Negative Rate

Introduction

Formulation

Worked example

Relationship to the confusion matrix

Why false positives matter

The precision-recall tradeoff

Effect of classification threshold

Precision-recall curve

When to favor precision over recall

Precision at K (P@K)

Common values of K

Average precision and mean average precision

AP worked example

Precision in object detection

Macro, micro, and weighted precision

Macro precision

Micro precision

Practical example

Precision on imbalanced datasets

Precision in scikit-learn

precision_score

classification_report

PrecisionRecallDisplay

Comparison with other metrics

Recall

F1 score

F-beta score

Summary of related metrics

Explain like I'm 5 (ELI5)

References

Related Articles

ARC-AGI 2

AUC (Area Under the ROC Curve)

Accuracy

Confusion Matrix

Decision Threshold

False Negative Rate