F1 score

Machine Learning Model Evaluation Statistics

18 min read

Updated Jul 11, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 11, 2026

Fact-checked

In review queue

Sources

12 citations

Revision

v5 · 3,613 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

The F1 score (also written as F1-score, F-score, or F-measure) is the harmonic mean of precision and recall, calculated as $F_1 = \frac{2 \cdot (\text{Precision} \cdot \text{Recall})}{\text{Precision} + \text{Recall}}$ , and it ranges from 0 (worst) to 1 (best).^[7]^[11] It is one of the most widely used evaluation metrics in machine learning and information retrieval, condensing a classification model's correctness into a single number that rewards a model only when both precision and recall are high.^[5] Because it combines false positives and false negatives but ignores true negatives, the F1 score gives a more honest picture than accuracy on imbalanced data, where it is especially common.^[5] The scikit-learn reference defines it directly: "The F1 score can be interpreted as a harmonic mean of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0."^[11]

Definition and Formula

The F1 score is defined as the harmonic mean of precision and recall:

F_1 = \frac{2 \cdot (\text{Precision} \cdot \text{Recall})}{\text{Precision} + \text{Recall}}

Equivalently, using the components of the confusion matrix:

F_1 = \frac{2\,\text{TP}}{2\,\text{TP} + \text{FP} + \text{FN}}

This confusion-matrix form is the exact formula given in the scikit-learn documentation, where TP is the number of true positives, FN the number of false negatives, and FP the number of false positives.^[11]

Where:

Term	Definition
TP (True Positives)	Correctly predicted positive instances
FP (False Positives)	Negative instances incorrectly predicted as positive (Type I error)
FN (False Negatives)	Positive instances incorrectly predicted as negative (Type II error)
TN (True Negatives)	Correctly predicted negative instances (not used in F1)

Precision and Recall

Precision and recall measure different aspects of a classifier's performance:

Precision (also called positive predictive value) measures the fraction of predicted positives that are actually positive:

\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}

Precision answers the question: "Of all instances the model labeled positive, how many were actually positive?"

Recall (also called sensitivity or true positive rate) measures the fraction of actual positives that were correctly identified:

\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}

Recall answers the question: "Of all actually positive instances, how many did the model correctly identify?"

The F1 score reaches its maximum value of 1 when both precision and recall are perfect (no false positives and no false negatives). It reaches its minimum value of 0 when either precision or recall is 0.^[4] By convention, scikit-learn returns an F1 of 0.0 when there are no true positives, false negatives, or false positives, that is, when the score is otherwise undefined.^[11]

Why Does the F1 Score Use the Harmonic Mean?

A natural question is why the F1 score uses the harmonic mean rather than the arithmetic mean or geometric mean. The choice of harmonic mean has specific mathematical properties that make it particularly suitable for combining precision and recall.^[4]

Comparison of Mean Types

Consider precision P and recall R. The three types of means produce different results:

Mean Type	Formula	Value when P=0.9, R=0.1	Value when P=0.5, R=0.5
Arithmetic mean	$(P + R) / 2$	0.500	0.500
Geometric mean	$\sqrt{P \cdot R}$	0.300	0.500
Harmonic mean (F1)	$\frac{2PR}{P + R}$	0.180	0.500

This example reveals the key property: when precision and recall are equal, all three means give the same value. But when they are highly imbalanced (P=0.9, R=0.1), the harmonic mean gives a much lower score (0.18) than the arithmetic mean (0.50).

Why This Matters

The harmonic mean penalizes extreme imbalances between precision and recall. This is desirable for several reasons:

A classifier that sacrifices one metric is not useful. A model with 99% precision but 1% recall catches almost no positive cases. The arithmetic mean would give it a misleading score of 0.50, but the F1 score correctly assigns it only 0.02.
Trivial classifiers are properly penalized. A classifier that labels everything as positive achieves 100% recall but very low precision. The harmonic mean ensures such a classifier receives a low F1 score.
Both metrics must be good. To achieve a high F1 score, both precision and recall must be reasonably high. The harmonic mean is always less than or equal to the arithmetic mean, with equality only when both values are identical.^[4]

Mathematical Properties of the Harmonic Mean

The harmonic mean H of two positive numbers a and b satisfies:

H(a, b) \le G(a, b) \le A(a, b)

where H is the harmonic mean, G is the geometric mean, and A is the arithmetic mean. Equality holds if and only if a = b. This inequality (known as the AM-GM-HM inequality) means the harmonic mean is the most conservative of the three, providing the strongest penalty for imbalanced values.

Additionally, the harmonic mean of two values equals zero if either value is zero. This ensures that a model with zero precision or zero recall receives an F1 score of exactly 0.

F-beta Score: Generalizing the F1

The F1 score treats precision and recall as equally important. In many applications, however, one metric matters more than the other. The F-beta score generalizes the F1 score by introducing a parameter beta that controls the relative weight given to precision versus recall.^[1]

Formula

F_\beta = \frac{(1 + \beta^2) \cdot (\text{Precision} \cdot \text{Recall})}{\beta^2 \cdot \text{Precision} + \text{Recall}}

When $\beta = 1$ , this simplifies to the standard F1 score.

Interpretation of Beta

The parameter beta determines how much more recall is valued relative to precision:^[1]

Beta Value	Name	Interpretation	Use Case
$\beta = 0.5$	F0.5 score	Precision is weighted twice as much as recall	Situations where false positives are costly (e.g., legal document review)
$\beta = 1$	F1 score	Precision and recall are weighted equally	General-purpose evaluation
$\beta = 2$	F2 score	Recall is weighted twice as much as precision	Situations where false negatives are costly (e.g., medical screening, cancer detection)
$\beta \to 0$	-	Only precision matters	When false positives are extremely costly
$\beta \to \infty$	-	Only recall matters	When false negatives are extremely costly

Example: Medical Screening

In cancer screening, missing a case of cancer (false negative) is far more serious than a false alarm (false positive). Using the F2 score emphasizes recall, ensuring the model is evaluated primarily on its ability to catch all positive cases, even at the expense of some false positives.

Conversely, in a search engine ranking scenario, presenting irrelevant results (false positives) degrades user experience more than missing some relevant results (false negatives). The F0.5 score would be more appropriate here.

Who Invented the F-measure?

The F-measure traces directly to Cornelis Joost (Keith) van Rijsbergen and his 1979 book "Information Retrieval."^[1] Van Rijsbergen, widely regarded as one of the founding figures of the information retrieval field, did not define "F" directly. He defined an effectiveness function $E = 1 - \frac{1}{\alpha/P + (1 - \alpha)/R}$ , and the modern F-beta is its complement: $F_\beta = 1 - E$ , with $\alpha = 1 / (1 + \beta^2)$ .^[1]^[12] Van Rijsbergen derived the measure, in his words, so that it "measures the effectiveness of retrieval with respect to a user who attaches beta times as much importance to recall as precision."^[1]^[12]

The name "F-measure" itself arrived later and somewhat by accident. It was introduced to evaluation tasks at the Fourth Message Understanding Conference (MUC-4) in 1992, and according to Wikipedia's account of the metric, "the name F-measure is believed to be named after a different F function in Van Rijsbergen's book, when introduced to the Fourth Message Understanding Conference."^[12] Yutaka Sasaki's 2007 note "The truth of the F-measure" traces much of the long-standing confusion over the metric's definition back to this point in its history.^[7] Despite the muddled naming, the term became standard terminology in both information retrieval and machine learning.

What Are Macro, Micro, and Weighted F1?

For multi-class classification problems (where there are more than two classes), the F1 score must be aggregated across classes. There are three common aggregation strategies: macro, micro, and weighted averaging.^[8]

Macro F1

Macro F1 computes the F1 score independently for each class and then takes the unweighted arithmetic mean:^[8]

\text{Macro F1} = \frac{1}{C} \sum_{c=1}^{C} \text{F1}_c

Where $C$ is the number of classes and $\text{F1}_c$ is the F1 score for class c. This is exactly how scikit-learn's average='macro' option behaves: it calculates the metric for each label and finds their unweighted mean, which "does not take label imbalance into account."^[11]

Properties:

Property	Description
Treatment of classes	Each class contributes equally, regardless of its size
Effect of rare classes	Rare classes have the same influence as common classes
Sensitivity	Sensitive to performance on minority classes
When to use	When all classes are equally important, even if their sizes differ

Example: In a three-class problem with F1 scores of 0.95 (class A, 1000 samples), 0.80 (class B, 100 samples), and 0.40 (class C, 10 samples), the macro F1 is (0.95 + 0.80 + 0.40) / 3 = 0.717.

Micro F1

Micro F1 aggregates the contributions of all classes by summing up the individual true positives, false positives, and false negatives across all classes, then computing a single F1 score:^[8]

\text{Micro Precision} = \frac{\sum(\text{TP}_c)}{\sum(\text{TP}_c + \text{FP}_c)}

\text{Micro Recall} = \frac{\sum(\text{TP}_c)}{\sum(\text{TP}_c + \text{FN}_c)}

\text{Micro F1} = \frac{2 \cdot \text{Micro Precision} \cdot \text{Micro Recall}}{\text{Micro Precision} + \text{Micro Recall}}

Properties:

Property	Description
Treatment of classes	Larger classes dominate the score
Effect of rare classes	Rare classes have minimal influence
Equivalence	In multi-class single-label classification, micro F1 equals overall accuracy
When to use	When each individual prediction matters equally, regardless of class

An important property of micro F1 in the multi-class single-label setting: because each sample belongs to exactly one class, every false positive for one class is a false negative for another class. This means micro precision equals micro recall, and both equal the overall accuracy. Consequently, micro F1 also equals accuracy in this setting.^[8] In scikit-learn, average='micro' calculates metrics globally by counting the total true positives, false negatives, and false positives.^[11]

Weighted F1

Weighted F1 computes the F1 score for each class and takes a weighted average, where each class's weight is proportional to the number of true instances (support) of that class:^[8]

\text{Weighted F1} = \sum_{c=1}^{C} \frac{n_c}{N} \cdot \text{F1}_c

Where $n_c$ is the number of true instances of class c and $N$ is the total number of instances. The scikit-learn average='weighted' option implements exactly this, altering macro averaging to account for label imbalance by weighting each label by its support.^[11]

Properties:

Property	Description
Treatment of classes	Each class is weighted by its frequency
Effect of rare classes	Rare classes have proportionally less influence
Sensitivity	Reflects performance on the most common classes
When to use	When you want a single number that accounts for class distribution

Comparison of Averaging Methods

Method	Equal class weight?	Dominates by large classes?	Equals accuracy?	Best for
Macro	Yes	No	No	Imbalanced datasets where all classes matter equally
Micro	No	Yes	Yes (single-label)	When each prediction is equally important
Weighted	No	Yes	No	When a single summary metric accounting for class size is needed

How Does the F1 Score Differ from Accuracy?

Accuracy is the simplest classification metric: the fraction of all predictions that are correct. While intuitive, accuracy can be misleading, particularly on imbalanced datasets.^[5] Chicco and Jurman (2020) put the warning bluntly, observing that accuracy and F1 score "can dangerously show overoptimistic inflated results, especially on imbalanced datasets."^[3]

The Class Imbalance Problem

Consider a binary classification problem where 95% of samples belong to the negative class and 5% belong to the positive class. A trivial classifier that always predicts "negative" achieves 95% accuracy, which sounds impressive but is completely useless for identifying positive cases.

The F1 score provides a more meaningful evaluation in this scenario:

Metric	Trivial "always negative" classifier	Reasonable classifier
Accuracy	0.950	0.920
Precision	Undefined (0/0) or 0	0.600
Recall	0.000	0.800
F1 Score	0.000	0.686

The trivial classifier gets an F1 of 0 because it has zero recall, correctly reflecting its failure to detect any positive cases. The reasonable classifier has lower accuracy (92% vs. 95%) but a much higher F1 score (0.686 vs. 0), correctly indicating that it is the better model for the task.

When to Use F1 vs. Accuracy

Scenario	Recommended Metric	Reason
Balanced classes	Accuracy	Simple and interpretable when classes are roughly equal
Imbalanced classes	F1 score	Accounts for both false positives and false negatives
Cost of FP approximately equals cost of FN	F1 score	Treats precision and recall equally
Cost of FP differs from cost of FN	F-beta score	Allows weighting precision vs. recall
Multi-class, all classes matter equally	Macro F1	Gives equal weight to minority classes
Multi-class, each sample matters equally	Micro F1 (= accuracy)	Naturally weights by class frequency
Ranking or retrieval tasks	F1 at threshold, or area metrics	F1 at a specific threshold; consider AUC-ROC or average precision for threshold-independent evaluation

Connection to Class Imbalance

The F1 score has a deep relationship with class imbalance, which is one of the most common challenges in applied machine learning.

Why F1 Helps with Imbalanced Data

The F1 score ignores true negatives (TN). This is a feature, not a bug. In imbalanced datasets, true negatives typically dominate the confusion matrix, and including them in the metric (as accuracy does) masks poor performance on the minority class.^[3]

Consider a fraud detection system evaluating 10,000 transactions, of which 100 are fraudulent:

Prediction	TP	FP	FN	TN	Accuracy	Precision	Recall	F1
Predict all "not fraud"	0	0	100	9900	0.990	0/0	0.000	0.000
Predict all "fraud"	100	9900	0	0	0.010	0.010	1.000	0.020
Good model	80	20	20	9880	0.996	0.800	0.800	0.800

The F1 score correctly distinguishes between these three scenarios, whereas accuracy fails to separate the trivial "predict all not fraud" model (99.0% accuracy) from the good model (99.6% accuracy) in a meaningful way.

Precision-Recall Tradeoff

In most classifiers, precision and recall are inversely related: adjusting the classification threshold to increase one typically decreases the other. The F1 score represents one particular balance point on this tradeoff. The threshold that maximizes F1 is not necessarily the same as the threshold that maximizes accuracy or that matches a specific business requirement.^[6]

The precision-recall curve plots precision against recall at various thresholds, providing a complete picture of the tradeoff.^[7] The area under the precision-recall curve (average precision, or AP) provides a threshold-independent summary metric that is often more informative than the F1 score at any single threshold.

What Are the Limitations of the F1 Score?

Despite its widespread use, the F1 score has several known limitations:

Limitation	Explanation
Ignores true negatives	The F1 score does not reward a model for correctly identifying negative cases, which may be important in some applications
Threshold-dependent	F1 is computed at a specific classification threshold; changing the threshold changes the F1 score
Not intuitive	The harmonic mean is less intuitive than the arithmetic mean; the F1 value does not have a simple probabilistic interpretation
Assumes equal cost of errors	The F1 score treats false positives and false negatives as equally bad, which is rarely true in practice
Sensitive to class definition	Which class is "positive" affects the score; swapping positive and negative classes produces a different F1
Can be gamed	A model can be optimized to maximize F1 at a specific threshold without being well-calibrated or useful

Powers (2011) and Chicco and Jurman (2020) have argued that the Matthews Correlation Coefficient (MCC) is a more reliable metric for binary classification because it uses all four cells of the confusion matrix and is invariant to which class is designated as positive.^[2]^[3] Chicco and Jurman describe the MCC as a statistical rate that "produces a high score only if the prediction obtained good results in all four confusion matrix categories (true positives, false negatives, true negatives, and false positives), proportionally both to the size of positive elements and the size of negative elements in the dataset."^[3] The MCC ranges from -1 to +1, where +1 indicates perfect prediction, 0 indicates no better than random, and -1 indicates complete disagreement.^[3]

The F1 score exists within a broader ecosystem of classification evaluation metrics:^[5]

Metric	Formula	Relationship to F1
Precision	$\text{TP} / (\text{TP} + \text{FP})$	Component of F1
Recall	$\text{TP} / (\text{TP} + \text{FN})$	Component of F1
Accuracy	$(\text{TP} + \text{TN}) / (\text{TP} + \text{TN} + \text{FP} + \text{FN})$	Includes TN; can be misleading on imbalanced data
F-beta	$\frac{(1+\beta^2) \cdot P \cdot R}{\beta^2 \cdot P + R}$	Generalization of F1
Matthews Correlation Coefficient	$\frac{\text{TP} \cdot \text{TN} - \text{FP} \cdot \text{FN}}{\sqrt{(\text{TP}+\text{FP})(\text{TP}+\text{FN})(\text{TN}+\text{FP})(\text{TN}+\text{FN})}}$	Uses all four confusion matrix cells
AUC-ROC	Area under ROC curve	Threshold-independent; uses TN
Average Precision	Area under precision-recall curve	Threshold-independent; ignores TN like F1
Cohen's Kappa	Agreement adjusted for chance	Accounts for class imbalance differently
Jaccard Index	$\text{TP} / (\text{TP} + \text{FP} + \text{FN})$	Monotonically related to F1; also called Intersection over Union

Relationship Between F1 and Jaccard Index

The F1 score and the Jaccard index (also called Intersection over Union, IoU) are monotonically related:

F_1 = \frac{2J}{1 + J}

J = \frac{F_1}{2 - F_1}

Where J is the Jaccard index. This means that ranking models by F1 produces the same ordering as ranking by the Jaccard index.

Practical Usage

Computing F1 in scikit-learn

The scikit-learn library provides functions for computing F1 scores in various configurations:^[10]^[11]

Function	Purpose
sklearn.metrics.f1_score(y_true, y_pred, average='binary')	Binary classification F1
sklearn.metrics.f1_score(y_true, y_pred, average='macro')	Macro-averaged F1
sklearn.metrics.f1_score(y_true, y_pred, average='micro')	Micro-averaged F1
sklearn.metrics.f1_score(y_true, y_pred, average='weighted')	Weighted F1
sklearn.metrics.classification_report(y_true, y_pred)	Full report with per-class and averaged metrics
sklearn.metrics.fbeta_score(y_true, y_pred, beta=2)	F-beta score with custom beta

In the binary case, average='binary' reports results only for the class specified by pos_label, and it applies only when targets are binary.^[11]

Choosing the Right F1 Variant

A common question in practice is which averaging method to use for multi-class problems:

Use macro F1 when the goal is to perform well on all classes equally, even rare ones. This is common in medical diagnosis where each disease matters.
Use micro F1 when each individual prediction matters equally. In single-label multi-class problems, micro F1 equals accuracy, so it provides no additional information.^[8]
Use weighted F1 as a compromise that accounts for class distribution while providing a single summary number.
Report per-class F1 whenever possible, as aggregated metrics can hide poor performance on individual classes.^[9]

References

Van Rijsbergen, C. J. (1979). "Information Retrieval." 2nd Edition. Butterworths, London. https://openlib.org/home/krichel/courses/lis618/readings/rijsbergen79_infor_retriev.pdf ↩
Powers, D. M. W. (2011). "Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness and Correlation." Journal of Machine Learning Technologies, 2(1), 37-63. ↩
Chicco, D., & Jurman, G. (2020). "The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation." BMC Genomics, 21(1), 6. https://doi.org/10.1186/s12864-019-6413-7 ↩
Sasaki, Y. (2007). "The truth of the F-measure." Teach Tutor Material, University of Manchester. ↩
Sokolova, M., & Lapalme, G. (2009). "A systematic analysis of performance measures for classification tasks." Information Processing & Management, 45(4), 427-437. ↩
Lipton, Z. C., Elkan, C., & Naryanaswamy, B. (2014). "Optimal Thresholding of Classifiers to Maximize F1 Measure." Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD). ↩
Manning, C. D., Raghavan, P., & Schutze, H. (2008). "Introduction to Information Retrieval." Cambridge University Press. Chapter 8: Evaluation in Information Retrieval. ↩
Grandini, M., Bagli, E., & Visani, G. (2020). "Metrics for Multi-Class Classification: An Overview." arXiv preprint arXiv:2008.05756. ↩
Opitz, J., & Burst, S. (2019). "Macro F1 and Macro F1." arXiv preprint arXiv:1911.03347. ↩
Pedregosa, F., et al. (2011). "Scikit-learn: Machine Learning in Python." Journal of Machine Learning Research, 12, 2825-2830. ↩
scikit-learn developers. "sklearn.metrics.f1_score." scikit-learn documentation (v1.9). https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html ↩
Wikipedia contributors. "F-score." Wikipedia. https://en.wikipedia.org/wiki/F-score ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

F1 score

Definition and Formula

Precision and Recall

Why Does the F1 Score Use the Harmonic Mean?

Comparison of Mean Types

Why This Matters

Mathematical Properties of the Harmonic Mean

F-beta Score: Generalizing the F1

Formula

Interpretation of Beta

Example: Medical Screening

Who Invented the F-measure?

What Are Macro, Micro, and Weighted F1?

Macro F1

Micro F1

Weighted F1

Comparison of Averaging Methods

How Does the F1 Score Differ from Accuracy?

The Class Imbalance Problem

When to Use F1 vs. Accuracy

Connection to Class Imbalance

Why F1 Helps with Imbalanced Data

Precision-Recall Tradeoff

What Are the Limitations of the F1 Score?

Relationship Between F1 and Jaccard Index

Practical Usage

Computing F1 in scikit-learn

Choosing the Right F1 Variant

References

Improve this article

What links here (24 of 66)

What links here (24 of 66)

Definition and Formula

Precision and Recall

Why Does the F1 Score Use the Harmonic Mean?

Comparison of Mean Types

Why This Matters

Mathematical Properties of the Harmonic Mean

F-beta Score: Generalizing the F1

Formula

Interpretation of Beta

Example: Medical Screening

Who Invented the F-measure?

What Are Macro, Micro, and Weighted F1?

Macro F1

Micro F1

Weighted F1

Comparison of Averaging Methods

How Does the F1 Score Differ from Accuracy?

The Class Imbalance Problem

When to Use F1 vs. Accuracy

Connection to Class Imbalance

Why F1 Helps with Imbalanced Data

Precision-Recall Tradeoff

What Are the Limitations of the F1 Score?

Related Metrics

Relationship Between F1 and Jaccard Index

Practical Usage

Computing F1 in scikit-learn

Choosing the Right F1 Variant

References

Improve this article

Related Articles

AUC-ROC

Area under the curve

False negative

False Negative Rate

False positive

False Positive Rate (FPR)

What links here (24 of 66)

Related Articles

AUC-ROC

Area under the curve

False negative

False Negative Rate

False positive

False Positive Rate (FPR)

What links here (24 of 66)