The F1 score (also written as F1-score or F-measure) is a widely used evaluation metric in machine learning and information retrieval that combines precision and recall into a single number using the harmonic mean. It provides a balanced measure of a classification model's performance, particularly when the costs of false positives and false negatives are roughly equal. The F1 score ranges from 0 (worst) to 1 (best).
The F1 score is defined as the harmonic mean of precision and recall:
F1 = 2 * (Precision * Recall) / (Precision + Recall)
Equivalently, using the components of the confusion matrix:
F1 = 2TP / (2TP + FP + FN)
Where:
| Term | Definition |
|---|---|
| TP (True Positives) | Correctly predicted positive instances |
| FP (False Positives) | Negative instances incorrectly predicted as positive (Type I error) |
| FN (False Negatives) | Positive instances incorrectly predicted as negative (Type II error) |
| TN (True Negatives) | Correctly predicted negative instances (not used in F1) |
Precision and recall measure different aspects of a classifier's performance:
Precision (also called positive predictive value) measures the fraction of predicted positives that are actually positive:
Precision = TP / (TP + FP)
Precision answers the question: "Of all instances the model labeled positive, how many were actually positive?"
Recall (also called sensitivity or true positive rate) measures the fraction of actual positives that were correctly identified:
Recall = TP / (TP + FN)
Recall answers the question: "Of all actually positive instances, how many did the model correctly identify?"
The F1 score reaches its maximum value of 1 when both precision and recall are perfect (no false positives and no false negatives). It reaches its minimum value of 0 when either precision or recall is 0.
A natural question is why the F1 score uses the harmonic mean rather than the arithmetic mean or geometric mean. The choice of harmonic mean has specific mathematical properties that make it particularly suitable for combining precision and recall.
Consider precision P and recall R. The three types of means produce different results:
| Mean Type | Formula | Value when P=0.9, R=0.1 | Value when P=0.5, R=0.5 |
|---|---|---|---|
| Arithmetic mean | (P + R) / 2 | 0.500 | 0.500 |
| Geometric mean | sqrt(P * R) | 0.300 | 0.500 |
| Harmonic mean (F1) | 2PR / (P + R) | 0.180 | 0.500 |
This example reveals the key property: when precision and recall are equal, all three means give the same value. But when they are highly imbalanced (P=0.9, R=0.1), the harmonic mean gives a much lower score (0.18) than the arithmetic mean (0.50).
The harmonic mean penalizes extreme imbalances between precision and recall. This is desirable for several reasons:
A classifier that sacrifices one metric is not useful. A model with 99% precision but 1% recall catches almost no positive cases. The arithmetic mean would give it a misleading score of 0.50, but the F1 score correctly assigns it only 0.02.
Trivial classifiers are properly penalized. A classifier that labels everything as positive achieves 100% recall but very low precision. The harmonic mean ensures such a classifier receives a low F1 score.
Both metrics must be good. To achieve a high F1 score, both precision and recall must be reasonably high. The harmonic mean is always less than or equal to the arithmetic mean, with equality only when both values are identical.
The harmonic mean H of two positive numbers a and b satisfies:
H(a, b) <= G(a, b) <= A(a, b)
where H is the harmonic mean, G is the geometric mean, and A is the arithmetic mean. Equality holds if and only if a = b. This inequality (known as the AM-GM-HM inequality) means the harmonic mean is the most conservative of the three, providing the strongest penalty for imbalanced values.
Additionally, the harmonic mean of two values equals zero if either value is zero. This ensures that a model with zero precision or zero recall receives an F1 score of exactly 0.
The F1 score treats precision and recall as equally important. In many applications, however, one metric matters more than the other. The F-beta score generalizes the F1 score by introducing a parameter beta that controls the relative weight given to precision versus recall.
F_beta = (1 + beta^2) * (Precision * Recall) / (beta^2 * Precision + Recall)
When beta = 1, this simplifies to the standard F1 score.
The parameter beta determines how much more recall is valued relative to precision:
| Beta Value | Name | Interpretation | Use Case |
|---|---|---|---|
| beta = 0.5 | F0.5 score | Precision is weighted twice as much as recall | Situations where false positives are costly (e.g., legal document review) |
| beta = 1 | F1 score | Precision and recall are weighted equally | General-purpose evaluation |
| beta = 2 | F2 score | Recall is weighted twice as much as precision | Situations where false negatives are costly (e.g., medical screening, cancer detection) |
| beta approaching 0 | - | Only precision matters | When false positives are extremely costly |
| beta approaching infinity | - | Only recall matters | When false negatives are extremely costly |
In cancer screening, missing a case of cancer (false negative) is far more serious than a false alarm (false positive). Using the F2 score emphasizes recall, ensuring the model is evaluated primarily on its ability to catch all positive cases, even at the expense of some false positives.
Conversely, in a search engine ranking scenario, presenting irrelevant results (false positives) degrades user experience more than missing some relevant results (false negatives). The F0.5 score would be more appropriate here.
The F-measure was introduced by Cornelis Joost van Rijsbergen in his 1979 book "Information Retrieval." Van Rijsbergen, widely regarded as one of the founding figures of the information retrieval field, originally defined a related measure he called the "effectiveness" function E, which measured how well a retrieval system performed with respect to a user who attached beta times as much importance to recall as to precision. The name "F-measure" was adopted later, reportedly at the Fourth Message Understanding Conference (MUC-4) in 1992, and has since become the standard terminology in both information retrieval and machine learning.
For multi-class classification problems (where there are more than two classes), the F1 score must be aggregated across classes. There are three common aggregation strategies: macro, micro, and weighted averaging.
Macro F1 computes the F1 score independently for each class and then takes the unweighted arithmetic mean:
Macro F1 = (1/C) * sum from c=1 to C of F1_c
Where C is the number of classes and F1_c is the F1 score for class c.
Properties:
| Property | Description |
|---|---|
| Treatment of classes | Each class contributes equally, regardless of its size |
| Effect of rare classes | Rare classes have the same influence as common classes |
| Sensitivity | Sensitive to performance on minority classes |
| When to use | When all classes are equally important, even if their sizes differ |
Example: In a three-class problem with F1 scores of 0.95 (class A, 1000 samples), 0.80 (class B, 100 samples), and 0.40 (class C, 10 samples), the macro F1 is (0.95 + 0.80 + 0.40) / 3 = 0.717.
Micro F1 aggregates the contributions of all classes by summing up the individual true positives, false positives, and false negatives across all classes, then computing a single F1 score:
Micro Precision = sum(TP_c) / sum(TP_c + FP_c)
Micro Recall = sum(TP_c) / sum(TP_c + FN_c)
Micro F1 = 2 * Micro Precision * Micro Recall / (Micro Precision + Micro Recall)
Properties:
| Property | Description |
|---|---|
| Treatment of classes | Larger classes dominate the score |
| Effect of rare classes | Rare classes have minimal influence |
| Equivalence | In multi-class single-label classification, micro F1 equals overall accuracy |
| When to use | When each individual prediction matters equally, regardless of class |
An important property of micro F1 in the multi-class single-label setting: because each sample belongs to exactly one class, every false positive for one class is a false negative for another class. This means micro precision equals micro recall, and both equal the overall accuracy. Consequently, micro F1 also equals accuracy in this setting.
Weighted F1 computes the F1 score for each class and takes a weighted average, where each class's weight is proportional to the number of true instances (support) of that class:
Weighted F1 = sum from c=1 to C of (n_c / N) * F1_c
Where n_c is the number of true instances of class c and N is the total number of instances.
Properties:
| Property | Description |
|---|---|
| Treatment of classes | Each class is weighted by its frequency |
| Effect of rare classes | Rare classes have proportionally less influence |
| Sensitivity | Reflects performance on the most common classes |
| When to use | When you want a single number that accounts for class distribution |
| Method | Equal class weight? | Dominates by large classes? | Equals accuracy? | Best for |
|---|---|---|---|---|
| Macro | Yes | No | No | Imbalanced datasets where all classes matter equally |
| Micro | No | Yes | Yes (single-label) | When each prediction is equally important |
| Weighted | No | Yes | No | When a single summary metric accounting for class size is needed |
Accuracy is the simplest classification metric: the fraction of all predictions that are correct. While intuitive, accuracy can be misleading, particularly on imbalanced datasets.
Consider a binary classification problem where 95% of samples belong to the negative class and 5% belong to the positive class. A trivial classifier that always predicts "negative" achieves 95% accuracy, which sounds impressive but is completely useless for identifying positive cases.
The F1 score provides a more meaningful evaluation in this scenario:
| Metric | Trivial "always negative" classifier | Reasonable classifier |
|---|---|---|
| Accuracy | 0.950 | 0.920 |
| Precision | Undefined (0/0) or 0 | 0.600 |
| Recall | 0.000 | 0.800 |
| F1 Score | 0.000 | 0.686 |
The trivial classifier gets an F1 of 0 because it has zero recall, correctly reflecting its failure to detect any positive cases. The reasonable classifier has lower accuracy (92% vs. 95%) but a much higher F1 score (0.686 vs. 0), correctly indicating that it is the better model for the task.
| Scenario | Recommended Metric | Reason |
|---|---|---|
| Balanced classes | Accuracy | Simple and interpretable when classes are roughly equal |
| Imbalanced classes | F1 score | Accounts for both false positives and false negatives |
| Cost of FP approximately equals cost of FN | F1 score | Treats precision and recall equally |
| Cost of FP differs from cost of FN | F-beta score | Allows weighting precision vs. recall |
| Multi-class, all classes matter equally | Macro F1 | Gives equal weight to minority classes |
| Multi-class, each sample matters equally | Micro F1 (= accuracy) | Naturally weights by class frequency |
| Ranking or retrieval tasks | F1 at threshold, or area metrics | F1 at a specific threshold; consider AUC-ROC or average precision for threshold-independent evaluation |
The F1 score has a deep relationship with class imbalance, which is one of the most common challenges in applied machine learning.
The F1 score ignores true negatives (TN). This is a feature, not a bug. In imbalanced datasets, true negatives typically dominate the confusion matrix, and including them in the metric (as accuracy does) masks poor performance on the minority class.
Consider a fraud detection system evaluating 10,000 transactions, of which 100 are fraudulent:
| Prediction | TP | FP | FN | TN | Accuracy | Precision | Recall | F1 |
|---|---|---|---|---|---|---|---|---|
| Predict all "not fraud" | 0 | 0 | 100 | 9900 | 0.990 | 0/0 | 0.000 | 0.000 |
| Predict all "fraud" | 100 | 9900 | 0 | 0 | 0.010 | 0.010 | 1.000 | 0.020 |
| Good model | 80 | 20 | 20 | 9880 | 0.996 | 0.800 | 0.800 | 0.800 |
The F1 score correctly distinguishes between these three scenarios, whereas accuracy fails to separate the trivial "predict all not fraud" model (99.0% accuracy) from the good model (99.6% accuracy) in a meaningful way.
In most classifiers, precision and recall are inversely related: adjusting the classification threshold to increase one typically decreases the other. The F1 score represents one particular balance point on this tradeoff. The threshold that maximizes F1 is not necessarily the same as the threshold that maximizes accuracy or that matches a specific business requirement.
The precision-recall curve plots precision against recall at various thresholds, providing a complete picture of the tradeoff. The area under the precision-recall curve (average precision, or AP) provides a threshold-independent summary metric that is often more informative than the F1 score at any single threshold.
Despite its widespread use, the F1 score has several known limitations:
| Limitation | Explanation |
|---|---|
| Ignores true negatives | The F1 score does not reward a model for correctly identifying negative cases, which may be important in some applications |
| Threshold-dependent | F1 is computed at a specific classification threshold; changing the threshold changes the F1 score |
| Not intuitive | The harmonic mean is less intuitive than the arithmetic mean; the F1 value does not have a simple probabilistic interpretation |
| Assumes equal cost of errors | The F1 score treats false positives and false negatives as equally bad, which is rarely true in practice |
| Sensitive to class definition | Which class is "positive" affects the score; swapping positive and negative classes produces a different F1 |
| Can be gamed | A model can be optimized to maximize F1 at a specific threshold without being well-calibrated or useful |
Powers (2011) and Chicco and Jurman (2020) have argued that the Matthews Correlation Coefficient (MCC) is a more reliable metric for binary classification because it uses all four cells of the confusion matrix and is invariant to which class is designated as positive. The MCC ranges from -1 to +1, where +1 indicates perfect prediction, 0 indicates no better than random, and -1 indicates complete disagreement.
The F1 score exists within a broader ecosystem of classification evaluation metrics:
| Metric | Formula | Relationship to F1 |
|---|---|---|
| Precision | TP / (TP + FP) | Component of F1 |
| Recall | TP / (TP + FN) | Component of F1 |
| Accuracy | (TP + TN) / (TP + TN + FP + FN) | Includes TN; can be misleading on imbalanced data |
| F-beta | (1+beta^2) * PR / (beta^2P + R) | Generalization of F1 |
| Matthews Correlation Coefficient | (TPTN - FPFN) / sqrt((TP+FP)(TP+FN)(TN+FP)(TN+FN)) | Uses all four confusion matrix cells |
| AUC-ROC | Area under ROC curve | Threshold-independent; uses TN |
| Average Precision | Area under precision-recall curve | Threshold-independent; ignores TN like F1 |
| Cohen's Kappa | Agreement adjusted for chance | Accounts for class imbalance differently |
| Jaccard Index | TP / (TP + FP + FN) | Monotonically related to F1; also called Intersection over Union |
The F1 score and the Jaccard index (also called Intersection over Union, IoU) are monotonically related:
F1 = 2 * J / (1 + J)
J = F1 / (2 - F1)
Where J is the Jaccard index. This means that ranking models by F1 produces the same ordering as ranking by the Jaccard index.
The scikit-learn library provides functions for computing F1 scores in various configurations:
| Function | Purpose |
|---|---|
| sklearn.metrics.f1_score(y_true, y_pred, average='binary') | Binary classification F1 |
| sklearn.metrics.f1_score(y_true, y_pred, average='macro') | Macro-averaged F1 |
| sklearn.metrics.f1_score(y_true, y_pred, average='micro') | Micro-averaged F1 |
| sklearn.metrics.f1_score(y_true, y_pred, average='weighted') | Weighted F1 |
| sklearn.metrics.classification_report(y_true, y_pred) | Full report with per-class and averaged metrics |
| sklearn.metrics.fbeta_score(y_true, y_pred, beta=2) | F-beta score with custom beta |
A common question in practice is which averaging method to use for multi-class problems: