# Confusion Matrix

> Source: https://aiwiki.ai/wiki/confusion_matrix
> Updated: 2026-07-12
> Categories: Machine Learning, Model Evaluation
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

A **confusion matrix** is a table that summarizes the performance of a [classification model](/wiki/classification_model) by tabulating its predicted class labels against the actual class labels, with correct predictions on the diagonal and errors off the diagonal. Also called an error matrix or classification matrix, it is one of the most widely used tools in machine learning evaluation because it gives a full breakdown of correct and incorrect predictions, making it far more informative than a single aggregate metric like [accuracy](/wiki/accuracy). For a [binary classification](/wiki/binary_classification) problem it is a 2 x 2 table whose four cells, [true positive](/wiki/true_positive_tp), [false positive](/wiki/false_positive_fp), [false negative](/wiki/false_negative_fn), and [true negative](/wiki/true_negative_tn), are the building blocks from which [precision](/wiki/precision), [recall](/wiki/recall), specificity, F1 score, and most other classification metrics are derived.

The confusion matrix is not limited to binary classification. It generalizes to multi-class problems, where the matrix becomes an N x N table for N classes. By examining the entries of a confusion matrix, practitioners can calculate a wide range of performance metrics, diagnose specific failure modes of a classifier, and make informed decisions about model improvements.

## What is the origin of the confusion matrix?

The mathematical foundations of the confusion matrix trace back to 1904, when Karl Pearson published "On the theory of contingency and its relation to association and normal correlation" as part of the Drapers' Company Research Memoirs, Biometric Series I. Pearson introduced the concept of contingency tables to study associations between categorical variables, and his work laid the groundwork for what would later become the confusion matrix in classification analysis.[1]

The term "confusion matrix" itself emerged from the field of human perception studies. In the 1950s and 1960s, researchers studying auditory and visual stimuli used the term to describe tables that recorded how often subjects confused one stimulus with another. James Townsend's 1971 paper "Theoretical analysis of an alphabetic confusion matrix," published in *Perception and Psychophysics* (volume 9, pages 40-50), applied the concept to the classification of uppercase English letters, analyzing how frequently participants misidentified one letter as another. Townsend's study collected a confusion matrix for the full uppercase alphabet from six participants run 650 trials each, and compared three mathematical models of recognition in their ability to predict the observed confusions.[2] The tool was later adopted by early machine learning researchers, including Frank Rosenblatt, who used it to compare human and machine classification performance.

The term gained formal recognition in the machine learning community through the glossary published by Ron Kohavi and Foster Provost in 1998, which appeared in a special issue of the journal *Machine Learning* (volume 30, pages 271-274).[3]

## Structure of a binary confusion matrix

For a binary classifier that assigns instances to either a positive class or a negative class, the confusion matrix is a 2 x 2 table with four cells:

| | **Predicted positive** | **Predicted negative** |
|---|---|---|
| **Actual positive** | [True positive](/wiki/true_positive_tp) (TP) | [False negative](/wiki/false_negative_fn) (FN) |
| **Actual negative** | [False positive](/wiki/false_positive_fp) (FP) | [True negative](/wiki/true_negative_tn) (TN) |

Each cell counts the number of instances that fall into that category:

- **True positive (TP):** The model correctly predicts the positive class. The instance is actually positive, and the classifier labels it as positive.
- **True negative (TN):** The model correctly predicts the negative class. The instance is actually negative, and the classifier labels it as negative.
- **False positive (FP):** The model incorrectly predicts the positive class. The instance is actually negative, but the classifier labels it as positive. This is also known as a **Type I error** or a "false alarm."
- **False negative (FN):** The model incorrectly predicts the negative class. The instance is actually positive, but the classifier labels it as negative. This is also known as a **Type II error** or a "miss."[4]

The diagonal of the matrix (TP and TN) represents correct predictions, while the off-diagonal entries (FP and FN) represent misclassifications.

There is no single universal cell ordering. The 2 x 2 layout above, with the positive class first, is the convention common in textbooks and medical statistics. Software libraries do not always follow it, which is a frequent source of bugs (see "Why does scikit-learn order the cells differently?" below). When reading any confusion matrix, always confirm which axis is the actual class, which is the predicted class, and how the classes are ordered before computing metrics.

### Worked example

Suppose a medical test is used to screen 200 patients for a disease. Of these, 50 actually have the disease and 150 do not. The test produces the following results:

| | **Predicted: Disease** | **Predicted: No disease** | **Total** |
|---|---|---|---|
| **Actual: Disease** | 45 (TP) | 5 (FN) | 50 |
| **Actual: No disease** | 10 (FP) | 140 (TN) | 150 |
| **Total** | 55 | 145 | 200 |

From this table, the test correctly identified 45 of the 50 patients who have the disease and correctly cleared 140 of the 150 patients who do not. However, it produced 10 false alarms and missed 5 actual cases.

## Type I and Type II errors

The confusion matrix provides a direct way to count and analyze the two fundamental types of classification errors from statistical hypothesis testing:

| Error type | Confusion matrix cell | Description | Also known as |
|---|---|---|---|
| Type I error | False positive (FP) | Rejecting a true null hypothesis; predicting positive when the actual class is negative | False alarm, alpha error |
| Type II error | False negative (FN) | Failing to reject a false null hypothesis; predicting negative when the actual class is positive | Miss, beta error |

In many real-world applications, the two error types carry different costs. A false negative in cancer screening (missing a cancer case) is generally far more harmful than a false positive (an unnecessary follow-up test). Understanding the relative costs of Type I and Type II errors is critical for selecting an appropriate [classification threshold](/wiki/classification_threshold) and evaluating model fitness for a given task.

## Derived performance metrics

One of the most valuable aspects of the confusion matrix is that it serves as the basis for computing a wide range of performance metrics. Each metric highlights a different aspect of classifier performance.[5]

### Primary metrics

| Metric | Formula | Description |
|---|---|---|
| [Accuracy](/wiki/accuracy) | $$\frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}$$ | Proportion of all predictions that are correct |
| [Precision](/wiki/precision) (positive predictive value) | $$\frac{\text{TP}}{\text{TP} + \text{FP}}$$ | Proportion of positive predictions that are actually positive |
| [Recall](/wiki/recall) (sensitivity, true positive rate) | $$\frac{\text{TP}}{\text{TP} + \text{FN}}$$ | Proportion of actual positives that are correctly identified |
| Specificity (true negative rate) | $$\frac{\text{TN}}{\text{TN} + \text{FP}}$$ | Proportion of actual negatives that are correctly identified |
| F1 score | $$\frac{2 \times (\text{Precision} \times \text{Recall})}{\text{Precision} + \text{Recall}}$$ | Harmonic mean of precision and recall |
| Negative predictive value (NPV) | $$\frac{\text{TN}}{\text{TN} + \text{FN}}$$ | Proportion of negative predictions that are actually negative |

### Error rate metrics

| Metric | Formula | Description |
|---|---|---|
| False positive rate (FPR) | $$\frac{\text{FP}}{\text{FP} + \text{TN}}$$ | Proportion of actual negatives incorrectly classified as positive; equals 1 - specificity |
| False negative rate (FNR) | $$\frac{\text{FN}}{\text{FN} + \text{TP}}$$ | Proportion of actual positives incorrectly classified as negative; equals 1 - recall |
| False discovery rate (FDR) | $$\frac{\text{FP}}{\text{FP} + \text{TP}}$$ | Proportion of positive predictions that are actually negative; equals 1 - precision |
| False omission rate (FOR) | $$\frac{\text{FN}}{\text{FN} + \text{TN}}$$ | Proportion of negative predictions that are actually positive; equals 1 - NPV |[5]

Using the medical screening example above: $$\text{accuracy} = (45 + 140) / 200 = 0.925$$, $$\text{precision} = 45 / 55 = 0.818$$, $$\text{recall} = 45 / 50 = 0.90$$, $$\text{specificity} = 140 / 150 = 0.933$$, and $$\text{F1 score} = 2 \times (0.818 \times 0.90) / (0.818 + 0.90) = 0.857$$.

## Matthews correlation coefficient

The Matthews correlation coefficient (MCC) is a metric that uses all four cells of the confusion matrix to produce a single score between -1 and +1. It is calculated as:

$$
\text{MCC} = \frac{\text{TP} \times \text{TN} - \text{FP} \times \text{FN}}{\sqrt{(\text{TP} + \text{FP})(\text{TP} + \text{FN})(\text{TN} + \text{FP})(\text{TN} + \text{FN})}}
$$

A value of +1 indicates perfect classification, 0 indicates performance no better than random guessing, and -1 indicates complete inverse classification. The MCC is widely regarded as one of the most reliable single-number metrics for binary classification evaluation. Unlike accuracy, it remains informative even when classes are severely imbalanced. As Chicco and Jurman put it in their 2020 study, the MCC "is a more reliable statistical rate which produces a high score only if the prediction obtained good results in all of the four confusion matrix categories," proportionally to the size of the positive and negative elements.[6] A 2021 follow-up by Chicco, Toetsch, and Jurman found that the MCC is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation.[12] A further study (Chicco and Jurman, 2023) argued that "the Matthews correlation coefficient should replace the ROC AUC as standard statistic in all the scientific studies involving a binary classification."[7]

The MCC is mathematically equivalent to the phi coefficient used in statistics, which measures the association between two binary variables.

## Cohen's kappa

Cohen's kappa is another metric derived from the confusion matrix that accounts for the possibility of correct predictions occurring by chance. It is computed as:

$$
\text{Kappa} = \frac{\text{observed accuracy} - \text{expected accuracy}}{1 - \text{expected accuracy}}
$$

Here, "observed accuracy" is the overall accuracy of the model, and "expected accuracy" is the probability that the model and the actual labels agree by random chance alone. The coefficient ranges from -1 to +1, where +1 represents perfect agreement and 0 represents agreement no better than chance.

Landis and Koch (1977) proposed the following interpretation guidelines:[8]

| Kappa value | Interpretation |
|---|---|
| < 0.00 | Poor |
| 0.00 - 0.20 | Slight |
| 0.21 - 0.40 | Fair |
| 0.41 - 0.60 | Moderate |
| 0.61 - 0.80 | Substantial |
| 0.81 - 1.00 | Almost perfect |

Cohen's kappa is particularly useful when dealing with imbalanced datasets because it penalizes classifiers that achieve high accuracy simply by predicting the majority class. However, some researchers have cautioned that kappa can be difficult to interpret and depends on the prevalence of each class. Delgado and Tibau (2019) argued that kappa should be avoided as a standalone performance metric and used alongside other measures.[10]

## When accuracy is misleading: imbalanced datasets

Accuracy, despite being the most intuitive metric, can be deeply misleading when class distributions are imbalanced.[11] Consider a dataset where 95% of instances belong to the negative class and only 5% belong to the positive class. A naive classifier that always predicts "negative" would achieve 95% accuracy while failing to identify a single positive instance.

The confusion matrix for such a classifier would look like this:

| | **Predicted positive** | **Predicted negative** |
|---|---|---|
| **Actual positive** | 0 | 50 |
| **Actual negative** | 0 | 950 |

$$\text{Accuracy} = 950 / 1000 = 0.95$$, but $$\text{recall} = 0 / 50 = 0$$, precision is undefined, and $$\text{MCC} = 0$$. The confusion matrix immediately exposes the problem that a single accuracy number hides.

For imbalanced datasets, metrics such as precision, recall, F1 score, and MCC are generally more informative.[6][11] The confusion matrix itself remains the best starting point for understanding model behavior because it shows exactly how predictions are distributed across all classes.

## Cost-sensitive analysis

In many practical applications, different types of misclassifications carry different costs. A confusion matrix can be paired with a cost matrix to compute the total expected cost of a classifier's predictions.

A cost matrix assigns a numerical penalty to each cell of the confusion matrix:

| | **Predicted positive** | **Predicted negative** |
|---|---|---|
| **Actual positive** | C(TP) = 0 | C(FN) = cost of missing a positive |
| **Actual negative** | C(FP) = cost of a false alarm | C(TN) = 0 |

The total cost is calculated as: $$\text{Total Cost} = C(\text{FP}) \times \text{FP} + C(\text{FN}) \times \text{FN}$$.

For example, in fraud detection, a false negative (missing a fraudulent transaction) might cost $10,000 on average, while a false positive (flagging a legitimate transaction) might cost $50 in customer inconvenience. A model with 20 false negatives and 200 false positives would have a total cost of 20 x $10,000 + 200 x $50 = $210,000. Cost-sensitive learning algorithms use this framework to optimize classifiers for minimum total cost rather than maximum accuracy.

## Multi-class confusion matrix

When a classifier assigns instances to one of N classes (where N > 2), the confusion matrix expands to an N x N table. Each row corresponds to the actual class, and each column corresponds to the predicted class. The diagonal elements represent correct classifications, while off-diagonal elements represent misclassifications.

For example, a three-class classifier (classes A, B, C) might produce:

| | **Predicted A** | **Predicted B** | **Predicted C** |
|---|---|---|---|
| **Actual A** | 40 | 5 | 5 |
| **Actual B** | 3 | 42 | 5 |
| **Actual C** | 2 | 8 | 40 |

From this matrix, several observations can be made. Class A has the highest overall accuracy (40 out of 50 correct). Class C is most often confused with class B (8 instances), suggesting these two classes share similar features. Class B also shows some confusion with class C (5 instances), further supporting the similarity between those classes.

For multi-class problems, per-class metrics can be derived by treating each class as a one-vs-rest binary problem. For class A, TP = 40, FP = 5 (instances predicted as A but actually B or C), FN = 10 (instances actually A but predicted as B or C), and TN = all remaining entries. These per-class metrics can then be averaged across classes using macro-averaging (unweighted mean), micro-averaging (aggregate TP, FP, FN across all classes), or weighted averaging (weighted by class support).[9]

## Normalized confusion matrix

Raw counts in a confusion matrix can be hard to interpret when classes have different numbers of instances. Normalization converts the raw counts to proportions, making it easier to compare performance across classes.

### Row normalization (recall-based)

Dividing each row by its row sum converts each entry to the proportion of instances in that actual class that were classified into each predicted class. The diagonal entries become the per-class recall values. This normalization is useful for answering: "Given an instance from class X, what is the probability the model predicts each class?"

### Column normalization (precision-based)

Dividing each column by its column sum converts each entry to the proportion of instances predicted as that class that actually belong to each actual class. The diagonal entries become the per-class precision values. This normalization answers: "Given a prediction of class X, what is the probability the instance actually belongs to each class?"

Using the three-class example above with row normalization:

| | **Predicted A** | **Predicted B** | **Predicted C** |
|---|---|---|---|
| **Actual A** | 0.80 | 0.10 | 0.10 |
| **Actual B** | 0.06 | 0.84 | 0.10 |
| **Actual C** | 0.04 | 0.16 | 0.80 |

Normalized confusion matrices are especially helpful when classes are imbalanced, as they prevent large classes from dominating the visual impression of the matrix.

## Confusion matrix in medical diagnosis

The confusion matrix has a long history of use in clinical medicine, where sensitivity and specificity are the primary measures of diagnostic test performance.

| Medical term | Statistical equivalent | Formula |
|---|---|---|
| Sensitivity | Recall / true positive rate | $$\frac{\text{TP}}{\text{TP} + \text{FN}}$$ |
| Specificity | True negative rate | $$\frac{\text{TN}}{\text{TN} + \text{FP}}$$ |
| Positive predictive value (PPV) | Precision | $$\frac{\text{TP}}{\text{TP} + \text{FP}}$$ |
| Negative predictive value (NPV) | -- | $$\frac{\text{TN}}{\text{TN} + \text{FN}}$$ |

Screening tests (such as mammography for breast cancer or PCR tests for infectious diseases) are typically designed with high sensitivity to minimize the chance of missing a true case, even at the expense of lower specificity. Confirmatory diagnostic tests, by contrast, are designed with high specificity to minimize false positives.

The trade-off between sensitivity and specificity is inherent in any diagnostic or classification system. Lowering the classification threshold increases sensitivity (catching more true positives) but decreases specificity (producing more false positives). Clinicians and machine learning practitioners must select a threshold that balances these concerns based on the clinical or business context.

## Relationship to ROC curves

A receiver operating characteristic (ROC) curve is closely related to the confusion matrix. Each point on an ROC curve corresponds to a specific classification threshold, and at each threshold, a distinct confusion matrix is generated. The ROC curve plots the true positive rate (recall) on the y-axis against the false positive rate (1 - specificity) on the x-axis across all possible thresholds.[4]

While a confusion matrix captures model performance at a single threshold, the ROC curve provides a comprehensive view of performance across all thresholds. The area under the ROC curve (AUC) summarizes this into a single number between 0 and 1, where 1 represents perfect classification and 0.5 represents random guessing.

The two tools complement each other. The confusion matrix is essential when a specific operating point (threshold) has been chosen and the exact counts of correct and incorrect predictions matter. The ROC curve is more useful during model selection and threshold tuning, when the goal is to understand how performance changes as the threshold varies.

## Visualization

Confusion matrices are commonly visualized as heatmaps, where the intensity of each cell's color corresponds to its value. This makes it easy to spot patterns at a glance: bright diagonal cells indicate strong classification, while bright off-diagonal cells indicate systematic confusion between specific classes.

Most machine learning libraries provide built-in support for confusion matrix visualization:

**scikit-learn (Python):**

```python
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

y_true = [0, 1, 0, 1, 0, 1, 0, 0, 1, 1]
y_pred = [0, 1, 0, 0, 0, 1, 1, 0, 1, 1]

cm = confusion_matrix(y_true, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=["Negative", "Positive"])
disp.plot(cmap=plt.cm.Blues)
plt.title("Confusion Matrix")
plt.show()
```

The `sklearn.metrics.confusion_matrix` function accepts `y_true` and `y_pred` arrays and returns a NumPy array. The optional `normalize` parameter can be set to `'true'` (row normalization), `'pred'` (column normalization), or `'all'` (normalization over the entire matrix).

**PyTorch (via torchmetrics):**

```python
from torchmetrics.classification import BinaryConfusionMatrix
import torch

metric = BinaryConfusionMatrix()
preds = torch.tensor([0, 1, 0, 0, 0, 1, 1, 0, 1, 1])
target = torch.tensor([0, 1, 0, 1, 0, 1, 0, 0, 1, 1])
cm = metric(preds, target)
print(cm)
```

**TensorFlow:**

```python
import tensorflow as tf

y_true = [0, 1, 0, 1, 0, 1, 0, 0, 1, 1]
y_pred = [0, 1, 0, 0, 0, 1, 1, 0, 1, 1]

cm = tf.math.confusion_matrix(y_true, y_pred)
print(cm)
```

## Why does scikit-learn order the cells differently?

A common source of confusion is that the scikit-learn layout does not match the positive-first textbook table shown above. The scikit-learn documentation defines the matrix so that "$$C_{i, j}$$ is equal to the number of observations known to be in group $$i$$ and predicted to be in group $$j$$."[13] Because scikit-learn sorts the class labels in ascending order and treats the negative class (label 0) as group 0, the binary matrix is laid out with the negative class first:

| | **Predicted negative (0)** | **Predicted positive (1)** |
|---|---|---|
| **Actual negative (0)** | $$\text{TN} = C[0,0]$$ | $$\text{FP} = C[0,1]$$ |
| **Actual positive (1)** | $$\text{FN} = C[1,0]$$ | $$\text{TP} = C[1,1]$$ |

This is why the documented idiom for unpacking the four counts is `tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()`, with true negatives first and true positives last, the reverse of the order in which TP, FP, FN, and TN are usually introduced.[13] Reading the cells in the textbook order without checking the library convention is a frequent cause of swapped precision and recall values. The class ordering can be controlled explicitly with the `labels` parameter.

## Explain like I'm 5 (ELI5)

Imagine you have a big pile of toys, and your job is to sort them into two boxes: one box for cars and one box for dolls. After you finish sorting, your mom checks your work and makes a chart:

- **Cars in the car box:** You got these right. (True positive)
- **Dolls in the doll box:** You got these right too. (True negative)
- **Dolls that ended up in the car box:** Oops, you made a mistake. (False positive)
- **Cars that ended up in the doll box:** Another mistake. (False negative)

That chart is a confusion matrix. It shows you exactly what you got right and what you got wrong, so you can figure out where you need to be more careful next time. If you were sorting into three boxes (cars, dolls, and blocks), the chart would be bigger, with more rows and columns to track all the possible mix-ups.

## Limitations

While the confusion matrix is an essential evaluation tool, it has several limitations:

- **Single threshold:** A confusion matrix captures performance at one specific decision threshold. It does not reveal how the classifier would perform at other thresholds.
- **No probability information:** The matrix does not indicate the confidence of predictions. Two classifiers with identical confusion matrices may have very different probability calibration.
- **Does not explain reasoning:** A confusion matrix shows what the model got right and wrong, but not why. Correct predictions may result from sound pattern recognition or from spurious correlations in the data.
- **Scales poorly for many classes:** For problems with dozens or hundreds of classes, the N x N matrix becomes difficult to read and interpret without normalization or aggregation.
- **Requires labeled data:** Computing a confusion matrix requires ground truth labels, which may be expensive or difficult to obtain in some domains.

## References

1. Pearson, K. (1904). "On the theory of contingency and its relation to association and normal correlation." *Drapers' Company Research Memoirs*, Biometric Series I.
2. Townsend, J. T. (1971). "Theoretical analysis of an alphabetic confusion matrix." *Perception and Psychophysics*, 9(1), 40-50.
3. Kohavi, R. and Provost, F. (1998). "Glossary of terms." *Machine Learning*, 30, 271-274.
4. Fawcett, T. (2006). "An introduction to ROC analysis." *Pattern Recognition Letters*, 27(8), 861-874.
5. Powers, D. M. W. (2011). "Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation." *Journal of Machine Learning Technologies*, 2(1), 37-63.
6. Chicco, D. and Jurman, G. (2020). "The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation." *BMC Genomics*, 21, 6.
7. Chicco, D. and Jurman, G. (2023). "The Matthews correlation coefficient (MCC) should replace the ROC AUC as the standard metric for assessing binary classification." *BioData Mining*, 16, 4.
8. Landis, J. R. and Koch, G. G. (1977). "The measurement of observer agreement for categorical data." *Biometrics*, 33(1), 159-174.
9. Stehman, S. V. (1997). "Selecting and interpreting measures of thematic classification accuracy." *Remote Sensing of Environment*, 62(1), 77-89.
10. Delgado, R. and Tibau, X. A. (2019). "Why Cohen's kappa should be avoided as performance measure in classification." *PLoS ONE*, 14(9), e0222916.
11. Luque, A. et al. (2019). "The impact of class imbalance in classification performance metrics based on the binary confusion matrix." *Pattern Recognition*, 91, 216-231.
12. Chicco, D., Toetsch, N. and Jurman, G. (2021). "The Matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation." *BioData Mining*, 14, 13.
13. scikit-learn developers. "sklearn.metrics.confusion_matrix." scikit-learn documentation. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html