# Precision

> Source: https://aiwiki.ai/wiki/precision
> Updated: 2026-06-20
> Categories: Machine Learning, Model Evaluation
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

**Precision** is a [classification](/wiki/classification) metric defined as the fraction of positive predictions that are correct: Precision = TP / (TP + FP), where TP is the number of [true positives](/wiki/true_positive_tp) and FP is the number of [false positives](/wiki/false_positive_fp).[2][4] In plain terms, it answers "of all the items the model labeled positive, how many actually are positive?" Precision ranges from 0 to 1, with 1.0 meaning every positive prediction was correct, and it is the metric that directly captures a model's false-alarm rate. The scikit-learn documentation describes it as "intuitively the ability of the classifier not to label as positive a sample that is negative," with a best value of 1 and a worst value of 0.[12] Precision is also called **positive predictive value (PPV)** in medicine and epidemiology.[2]

## Introduction

In the context of [machine learning](/wiki/machine_learning), **precision** is a fundamental metric used to evaluate the performance of [classification](/wiki/classification) models. Also known as **positive predictive value (PPV)** in medical and statistical literature, precision measures the proportion of [true positive](/wiki/true_positive_tp) instances among all instances classified as positive.[2] This metric is particularly important in cases where the cost of [false positives](/wiki/false_positive_fp) is high, such as in medical diagnosis, spam detection, or fraud alerting systems.

Precision answers the question: "Of all the items the model labeled as positive, how many actually are positive?" A model with high precision produces few false alarms, making it a critical metric when acting on a false positive carries significant consequences.

The concept of precision originates from the field of [information retrieval](/wiki/information_retrieval), where it was formalized by C. J. van Rijsbergen in his 1979 textbook *Information Retrieval*.[1][4] It has since become one of the most widely reported evaluation metrics across all branches of machine learning, alongside [recall](/wiki/recall), [accuracy](/wiki/accuracy), and the [F1 score](/wiki/f1_score).[2]

## How is precision calculated?

The precision of a [classification model](/wiki/classification_model) is mathematically defined as the ratio of true positive predictions (TP) to the sum of true positive predictions and false positive predictions (FP). Mathematically, it can be expressed as:[2][4]

**Precision = TP / (TP + FP)**

Where:

- **TP (True Positives)** represents the number of instances that are actually positive and were correctly predicted as positive by the model.
- **FP (False Positives)** represents the number of instances that are actually negative but were incorrectly predicted as positive by the model. False positives are also known as Type I errors.[2]

Precision values range from 0 to 1 (or equivalently, 0% to 100%). A precision of 1.0 means that every instance the model predicted as positive was indeed positive (no false positives at all). A precision of 0.0 would mean that none of the instances predicted as positive were actually positive.

Precision is undefined when the denominator (TP + FP) equals zero, which occurs when the model makes no positive predictions at all. In practice, libraries like [scikit-learn](/wiki/scikit-learn) handle this edge case by returning 0.0 and issuing a warning: per the official documentation, "When true positive + false positive == 0, precision returns 0 and raises UndefinedMetricWarning," a behavior that can be overridden with the zero_division parameter.[6][12]

### Worked example

Consider a spam filter that classifies emails as either "spam" (positive) or "not spam" (negative). After processing 1,000 emails, the results are:

| | Predicted Spam | Predicted Not Spam |
|--|--|--|
| **Actually Spam** | 80 (TP) | 20 (FN) |
| **Actually Not Spam** | 10 (FP) | 890 (TN) |

The precision of this spam filter is:

*Precision = TP / (TP + FP) = 80 / (80 + 10) = 80 / 90 = 0.889*

This means that 88.9% of the emails flagged as spam were actually spam. The remaining 11.1% were legitimate emails incorrectly flagged (false positives).

## Relationship to the confusion matrix

To fully understand precision, it helps to place TP and FP within the context of the full [confusion matrix](/wiki/confusion_matrix):

| | Predicted Positive | Predicted Negative |
|--|--|--|
| **Actually Positive** | True Positive (TP) | False Negative (FN) |
| **Actually Negative** | False Positive (FP) | True Negative (TN) |

Precision focuses on the left column of this matrix: of everything the model called positive (TP + FP), what fraction was correct (TP)? Note that precision does not consider false negatives at all. A model can miss many positive instances (low recall) and still have high precision, as long as the predictions it does make are correct.[2][5]

The four cells of the confusion matrix give rise to several related metrics. Precision uses the "predicted positive" column, while recall uses the "actually positive" row. Specificity and the false positive rate use the "actually negative" row.[2][10] Understanding which cells each metric draws from helps clarify what each one measures and where its blind spots lie.

### Why false positives matter

False positives can be costly in many real-world applications:

| Application | What a False Positive Means | Consequence |
|-------------|----------------------------|-------------|
| Email [spam filtering](/wiki/spam_filtering) | Legitimate email marked as spam | Important messages lost; missed business opportunities |
| [Fraud detection](/wiki/fraud_detection) | Legitimate transaction flagged as fraud | Customer inconvenience; blocked purchases |
| Legal document review | Irrelevant document flagged as relevant | Wasted attorney time; increased costs |
| Manufacturing quality control | Good product rejected | Wasted materials; reduced throughput |
| Criminal justice risk scoring | Low-risk individual flagged as high-risk | Unjust detention or denial of bail |
| Automated content moderation | Legitimate post removed as policy violation | User frustration; censorship concerns |

In all of these scenarios, precision is the metric that directly captures the rate of false alarms.

## What is the precision-recall tradeoff?

Precision and recall are inversely related in most practical settings. Improving one typically comes at the expense of the other. This relationship is known as the **precision-recall tradeoff**.[4]

Most classifiers produce a continuous score or probability rather than a hard binary decision. A [classification threshold](/wiki/classification_threshold) determines the cutoff above which an instance is classified as positive. Adjusting this threshold shifts the balance between precision and recall:

- **Raising the threshold** makes the model more selective: fewer instances are classified as positive, so false positives decrease (precision goes up), but more true positives are missed (recall goes down).
- **Lowering the threshold** makes the model more permissive: more instances are classified as positive, catching more true positives (recall goes up), but also admitting more false positives (precision goes down).

### Effect of classification threshold

To illustrate concretely, consider a [binary classification](/wiki/binary_classification) model that outputs a probability score between 0 and 1. Suppose we have 200 test samples and evaluate precision and recall at different threshold values:

| Threshold | TP | FP | FN | Precision | Recall |
|-----------|----|----|-----|-----------|--------|
| 0.3 | 90 | 40 | 5 | 0.692 | 0.947 |
| 0.5 | 82 | 18 | 13 | 0.820 | 0.863 |
| 0.7 | 68 | 7 | 27 | 0.907 | 0.716 |
| 0.9 | 45 | 2 | 50 | 0.957 | 0.474 |

As the threshold rises from 0.3 to 0.9, precision climbs from 0.692 to 0.957 while recall drops from 0.947 to 0.474. The optimal threshold depends on the application's tolerance for false positives versus false negatives.

### Precision-recall curve

The [precision-recall curve](/wiki/precision-recall_curve) plots precision on the y-axis against recall on the x-axis at various threshold settings. A model that maintains high precision across all recall levels is superior.[3] The **area under the precision-recall curve (AUPRC)** summarizes this tradeoff in a single number, with 1.0 representing a perfect classifier.[11]

The AUPRC is especially useful for evaluating models on **imbalanced datasets**. Unlike the [AUC-ROC](/wiki/auc_area_under_the_curve), which can appear deceptively high when the negative class is much larger than the positive class, the AUPRC focuses exclusively on performance with respect to the positive class.[7][10] Davis and Goadrich (2006) proved a precise correspondence between the two: "a curve dominates in ROC space if and only if it dominates in PR space," while noting that PR curves give a more accurate picture of performance on highly skewed datasets.[3][13]

### When to favor precision over recall

Precision should be prioritized when the cost of a false positive is much higher than the cost of a false negative.[5] Examples include:

- **Spam filtering:** Marking a legitimate email as spam (false positive) may cause the user to miss critical information. Missing a spam email (false negative) is merely a minor inconvenience.
- **Search result ranking:** Returning irrelevant results at the top of a list erodes user trust. Users are more tolerant of missing a relevant result buried deeper in the rankings.
- **Content recommendation:** Recommending irrelevant content (false positive) degrades user trust and engagement. Not recommending a relevant item (false negative) simply means the user does not see it.
- **Judicial and law enforcement systems:** Falsely flagging an innocent person has severe ethical and legal consequences.
- **Automated trading signals:** A false positive buy signal on a financial instrument can result in direct monetary loss.

Conversely, recall should be prioritized when the cost of a false negative outweighs the cost of a false positive. Medical screening and safety-critical systems are classic examples: missing a disease or a structural defect can be life-threatening, so catching every positive case matters more than avoiding false alarms.

| Scenario | Priority Metric | Rationale |
|----------|----------------|-----------|
| Email spam filtering | Precision | Losing a real email is worse than seeing spam |
| Cancer screening | Recall | Missing a malignant tumor is worse than a false alarm |
| Search engine results | Precision | Users expect top results to be relevant |
| Airport security | Recall | Missing a threat is unacceptable |
| Fraud detection | Precision | Blocking legitimate transactions frustrates customers |
| Manufacturing defect detection | Recall | Shipping a defective product has high liability |

## Precision at K (P@K)

**Precision@K** (also written P@K) is a variant of precision used in information retrieval and ranking systems. It measures the proportion of relevant items among the top K results returned by a system:[4]

**Precision@K = (Number of relevant items in the top K results) / K**

For example, if a search engine returns 10 results (K=10) and 7 of them are relevant, then Precision@10 = 7/10 = 0.70.

Precision@K is widely used in evaluating:

- Web search engines (e.g., Precision@10 for the first page of results)
- [Recommendation systems](/wiki/recommender_system) (e.g., Precision@5 for a "top 5 picks for you" list)
- Document retrieval in legal and medical literature search
- Retrieval-augmented generation (RAG) pipelines, where the quality of retrieved context directly affects output quality

A limitation of Precision@K is that it does not account for the position of relevant items within the top K results. A system that places all relevant items at the top of the list and one that scatters them throughout receive the same Precision@K score. Rank-aware metrics such as Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG) address this limitation.[4]

### Common values of K

| K Value | Typical Use Case |
|---------|------------------|
| P@1 | Voice assistants, question answering (top result must be correct) |
| P@3 | Mobile search results (small screen, few visible results) |
| P@5 | Short recommendation lists |
| P@10 | First page of web search results |
| P@20 | Extended search result pages |
| P@100 | Batch retrieval, recall-oriented systems |

## Average precision and mean average precision

**Average Precision (AP)** addresses the positional limitation of Precision@K by computing the average of precision values at each rank position where a relevant item is found:[4]

**AP = (1 / number of relevant items) * sum of Precision@k for each k where item k is relevant**

AP rewards systems that place relevant items higher in the ranked list. A system that returns all relevant items at the very top achieves AP = 1.0, while a system that scatters relevant items throughout the list achieves a lower AP even if the total number of relevant items retrieved is the same.

### AP worked example

Suppose a query has 4 relevant documents in the collection, and a system returns 10 results. Relevant items appear at positions 1, 3, 5, and 10:

| Position | Relevant? | Precision@k |
|----------|-----------|-------------|
| 1 | Yes | 1/1 = 1.000 |
| 2 | No | 1/2 = 0.500 |
| 3 | Yes | 2/3 = 0.667 |
| 4 | No | 2/4 = 0.500 |
| 5 | Yes | 3/5 = 0.600 |
| 6-9 | No | ... |
| 10 | Yes | 4/10 = 0.400 |

AP = (1.000 + 0.667 + 0.600 + 0.400) / 4 = 2.667 / 4 = 0.667

**Mean Average Precision (MAP)** is the mean of AP values across multiple queries or users. It is one of the most commonly used metrics for evaluating ranked retrieval systems and recommendation systems.[4]

## Precision in object detection

In [object detection](/wiki/object_detection), precision takes on a slightly different meaning because predictions are bounding boxes rather than binary labels. A predicted bounding box is considered a true positive if its **Intersection over Union (IoU)** with a ground-truth box exceeds a specified threshold.[8]

The IoU is computed as the area of overlap between the predicted and ground-truth boxes divided by the area of their union. Different benchmarks set different IoU thresholds:

| Benchmark | IoU Threshold | Metric Name |
|-----------|--------------|-------------|
| PASCAL VOC | 0.5 | AP@0.5 (also called AP50) |
| COCO (primary) | 0.5 to 0.95, step 0.05 | AP@[.5:.95] |
| COCO (loose) | 0.5 | AP@0.5 |
| COCO (strict) | 0.75 | AP@0.75 |

The PASCAL VOC challenge uses AP averaged at IoU = 0.5 as its primary metric, which is relatively lenient since predicted boxes only need to overlap half of the ground-truth area.[8] The COCO benchmark reports AP averaged over ten IoU thresholds (0.5 to 0.95 in steps of 0.05), denoted AP@[.5:.95]; this primary metric is computed across all 80 object categories. This stricter evaluation penalizes detections that are only roughly localized, rewarding models that produce tighter bounding boxes.[9]

In both benchmarks, mAP (mean Average Precision) is computed by averaging AP across all object classes. A model with high mAP achieves both high precision and good ranking of its detections across all categories.

## Macro, micro, and weighted precision

When dealing with [multi-class classification](/wiki/multi-class_classification) problems (more than two classes), precision must be aggregated across classes. There are three main approaches:[5]

| Averaging Method | How It Works | When to Use |
|-----------------|--------------|-------------|
| **Macro precision** | Compute precision independently for each class, then take the unweighted mean | When all classes are equally important, regardless of size |
| **Micro precision** | Sum all true positives and false positives across all classes, then compute a single precision value | When you care about overall correctness across all predictions |
| **Weighted precision** | Compute precision for each class, then take the weighted mean (weighted by the number of true instances per class) | When class imbalance exists and you want to reflect each class proportionally |

### Macro precision

Macro precision treats all classes equally. For a problem with C classes:

*Macro Precision = (1/C) * sum of Precision_c for c = 1 to C*

where Precision_c = TP_c / (TP_c + FP_c) for each class c.

Macro precision can be strongly affected by classes with very few instances. If a minority class has just 2 true positives and 1 false positive, its precision of 0.67 carries the same weight in the macro average as a majority class with thousands of predictions.

### Micro precision

Micro precision aggregates all predictions before computing the metric:

*Micro Precision = (sum of TP_c for all c) / (sum of TP_c + FP_c for all c)*

For a standard multi-class problem (where each instance belongs to exactly one class), micro precision equals the overall accuracy of the model. This equivalence arises because every false positive for one class is simultaneously a false negative for another class, so the total false positives across all classes equal the total false negatives.[5] Micro precision is dominated by the performance on large classes, so it may mask poor performance on minority classes.

### Practical example

Consider a sentiment classifier with three classes:

| Class | TP | FP | Precision |
|-------|----|----|----------|
| Positive | 80 | 10 | 80/90 = 0.889 |
| Neutral | 50 | 20 | 50/70 = 0.714 |
| Negative | 30 | 5 | 30/35 = 0.857 |

- **Macro precision** = (0.889 + 0.714 + 0.857) / 3 = 0.820
- **Micro precision** = (80 + 50 + 30) / (90 + 70 + 35) = 160 / 195 = 0.821
- **Weighted precision** (weighted by support: 90, 70, 35) = (0.889 x 90 + 0.714 x 70 + 0.857 x 35) / 195 = 0.834

In this example, the three averages are close because the class sizes are not drastically different. When class imbalance is severe, the differences become much larger.

## Precision on imbalanced datasets

Precision behavior changes significantly on **imbalanced datasets**, where one class greatly outnumbers the other.[7][11] Consider a rare disease affecting 0.1% of the population. Even a good model can have low precision because the large number of healthy individuals generates many false positives relative to the small number of true positives.

For example, a test with 99% sensitivity (recall) and 99% specificity applied to 100,000 people where 100 are actually sick:

| | Predicted Sick | Predicted Healthy |
|--|--|--|
| **Actually Sick (100)** | 99 (TP) | 1 (FN) |
| **Actually Healthy (99,900)** | 999 (FP) | 98,901 (TN) |

Precision = 99 / (99 + 999) = 99 / 1,098 = 0.090

Despite excellent sensitivity and specificity, the precision is only 9%. This is known as the **base rate fallacy** or the **false positive paradox**: when the condition is rare, even a highly specific test will produce more false positives than true positives in absolute terms.

This phenomenon is important to understand in fields like medical testing, where a positive result on a screening test often requires a second, more specific confirmatory test. It also explains why the precision-recall curve is preferred over the ROC curve for evaluating classifiers on heavily imbalanced data, as Saito and Rehmsmeier (2015) demonstrated.[7]

## Precision in scikit-learn

The popular Python library scikit-learn provides several functions for computing precision:[6]

### precision_score

```python
from sklearn.metrics import precision_score

y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 0]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0]

# Binary precision
precision = precision_score(y_true, y_pred)
print(f"Precision: {precision:.3f}")  # Output: 0.800

# Multi-class with different averaging
precision_macro = precision_score(y_true, y_pred, average='macro')
precision_micro = precision_score(y_true, y_pred, average='micro')
precision_weighted = precision_score(y_true, y_pred, average='weighted')
```

The `average` parameter controls the type of aggregation:

| Value | Behavior |
|-------|----------|
| `'binary'` | Report precision for the positive class only (default for binary problems) |
| `'macro'` | Unweighted mean of per-class precision |
| `'micro'` | Global precision from total TP and FP |
| `'weighted'` | Weighted mean by class support |
| `None` | Return an array with precision for each class |

### classification_report

The `classification_report` function provides precision, recall, and F1 for all classes in a single call:[6]

```python
from sklearn.metrics import classification_report

y_true = [0, 0, 0, 1, 1, 1, 2, 2, 2]
y_pred = [0, 1, 0, 1, 1, 0, 2, 2, 1]

print(classification_report(y_true, y_pred,
      target_names=['Negative', 'Positive', 'Neutral']))
```

This produces a table like:

```
              precision    recall  f1-score   support

    Negative       0.67      0.67      0.67         3
    Positive       0.50      0.67      0.57         3
     Neutral       1.00      0.67      0.80         3

    accuracy                           0.67         9
   macro avg       0.72      0.67      0.68         9
weighted avg       0.72      0.67      0.68         9
```

### PrecisionRecallDisplay

For visualization, scikit-learn provides `PrecisionRecallDisplay` to plot precision-recall curves directly from a classifier:[6]

```python
from sklearn.metrics import PrecisionRecallDisplay
import matplotlib.pyplot as plt

# Assuming you have a fitted classifier and test data
PrecisionRecallDisplay.from_estimator(classifier, X_test, y_test)
plt.title("Precision-Recall Curve")
plt.show()
```

## How does precision differ from recall and F1?

### Recall

Recall, also known as *sensitivity* or *true positive rate*, measures the proportion of actual positive instances that were correctly identified by the model:[2][10]

*Recall = TP / (TP + FN)*

While precision asks "of the items called positive, how many truly are?", recall asks "of the items that truly are positive, how many did the model find?" A model can achieve perfect precision by making a single, highly confident positive prediction, but its recall would be near zero if it missed all other positives.

### F1 score

The F1 score is the harmonic mean of precision and recall and serves as a single metric to assess the trade-off between these two performance measures.[1][2] The mathematical formula is:

*F1 Score = 2 * (Precision * Recall) / (Precision + Recall)*

The harmonic mean is used rather than the arithmetic mean because it penalizes extreme imbalances. A model with precision = 1.0 and recall = 0.01 gets an F1 of only 0.02, whereas the arithmetic mean would be 0.505. The F1 score is particularly useful when dealing with imbalanced datasets, as it takes into account both false positives and false negatives.

### F-beta score

The F-beta score generalizes the F1 score by introducing a parameter beta that controls the relative importance of precision vs. recall:

*F_beta = (1 + beta^2) * (Precision * Recall) / (beta^2 * Precision + Recall)*

- **beta < 1** weights precision more heavily (e.g., F0.5 is common when false positives are especially costly).
- **beta = 1** gives equal weight to precision and recall (the standard F1 score).
- **beta > 1** weights recall more heavily (e.g., F2 is common when false negatives are especially costly).

Van Rijsbergen (1979) originally designed the F-measure so that F-beta "measures the effectiveness of retrieval with respect to a user who attaches beta times as much importance to recall as precision."[1]

### Summary of related metrics

| Metric | Formula | Focus |
|--------|---------|-------|
| [Precision](/wiki/precision) | TP / (TP + FP) | Quality of positive predictions |
| [Recall](/wiki/recall) | TP / (TP + FN) | Completeness of positive identification |
| [F1 Score](/wiki/f1_score) | 2 * P * R / (P + R) | Balance between precision and recall |
| [Accuracy](/wiki/accuracy) | (TP + TN) / (TP + TN + FP + FN) | Overall correctness |
| Specificity | TN / (TN + FP) | Correctness of negative predictions |
| Positive Predictive Value (PPV) | Same as Precision | Term used in medicine and epidemiology |
| False Discovery Rate (FDR) | FP / (FP + TP) = 1 - Precision | Proportion of false alarms among positive predictions |
| Precision@K | Relevant in top K / K | Quality of top-K ranked results |
| Average Precision | Mean of P@k at relevant positions | Quality-weighted ranking |

## Explain like I'm 5 (ELI5)

Imagine you have a big basket of fruit, and your job is to pick out only the apples. You reach in and pull out some pieces of fruit. Precision tells you how many of the fruits you grabbed are actually apples. If you grabbed 10 fruits and 8 of them are apples (but 2 are oranges you picked by mistake), your precision is 8 out of 10, or 80%. The oranges you accidentally grabbed are "false positives." Higher precision means you are better at only grabbing apples and leaving the oranges alone.

Precision does not care about the apples still left in the basket (that would be recall). It only cares about whether the fruits you pulled out are the right ones.

## References

1. Van Rijsbergen, C. J. (1979). *Information Retrieval*. 2nd edition. Butterworths.
2. Powers, D. M. W. (2011). "Evaluation: From Precision, Recall and F-measure to ROC, Informedness, Markedness and Correlation." *Journal of Machine Learning Technologies*, 2(1), 37-63.
3. Davis, J. and Goadrich, M. (2006). "The Relationship Between Precision-Recall and ROC Curves." *Proceedings of the 23rd International Conference on Machine Learning (ICML)*.
4. Manning, C. D., Raghavan, P., and Schutze, H. (2008). *Introduction to Information Retrieval*. Cambridge University Press. Chapter 8: Evaluation in Information Retrieval.
5. Sokolova, M. and Lapalme, G. (2009). "A Systematic Analysis of Performance Measures for Classification Tasks." *Information Processing and Management*, 45(4), 427-437.
6. Pedregosa, F. et al. (2011). "Scikit-learn: Machine Learning in Python." *Journal of Machine Learning Research*, 12, 2825-2830.
7. Saito, T. and Rehmsmeier, M. (2015). "The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets." *PLOS ONE*, 10(3), e0118432.
8. Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., and Zisserman, A. (2010). "The Pascal Visual Object Classes (VOC) Challenge." *International Journal of Computer Vision*, 88(2), 303-338.
9. Lin, T.-Y. et al. (2014). "Microsoft COCO: Common Objects in Context." *European Conference on Computer Vision (ECCV)*.
10. Fawcett, T. (2006). "An Introduction to ROC Analysis." *Pattern Recognition Letters*, 27(8), 861-874.
11. Flach, P. and Kull, M. (2015). "Precision-Recall-Gain Curves: PR Analysis Done Right." *Advances in Neural Information Processing Systems (NeurIPS)*, 28.
12. scikit-learn developers. "sklearn.metrics.precision_score." scikit-learn 1.x documentation. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html
13. Davis, J. and Goadrich, M. (2006). "The Relationship Between Precision-Recall and ROC Curves." Camera-ready manuscript. https://mark.goadrich.com/articles/davisgoadrichcamera2.pdf