# Cost-sensitive learning

> Source: https://aiwiki.ai/wiki/cost-sensitive_learning
> Updated: 2026-06-25
> Categories: Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Cost-sensitive learning** is a family of [machine learning](/wiki/machine_learning) methods that minimise the expected misclassification *cost* rather than the misclassification *rate*, by assigning different penalties to different kinds of error through a cost matrix instead of treating every mistake as equally bad [1]. Because a false negative in cancer screening, fraud detection, or intrusion detection is usually far more costly than a false positive, a cost-sensitive classifier predicts the class with the lowest expected cost under the model's posterior probabilities, which shifts the decision threshold away from the default 0.5 toward the rarer, costlier class [1]. The classical formulation is Charles Elkan's "The Foundations of Cost-Sensitive Learning" (IJCAI 2001), in which "an example should be predicted to have the class that leads to the lowest expected cost" [1]; in practice it is implemented with a cost matrix, cost-sensitive thresholding, MetaCost-style relabelling [2], or class weighting (for example scikit-learn's class_weight) [22].

The framework recognises that, in most real applications, different errors have very different real world consequences. Missing a fraudulent transaction is more expensive than briefly inconveniencing a legitimate cardholder; missing a cancer diagnosis is worse than calling a healthy patient back for a second look; routing a phishing email to the inbox is worse than dropping a marketing newsletter into the spam folder. Standard learners that minimise the 0-1 loss treat all of these mistakes as if they were equivalent. Cost-sensitive learning replaces that assumption with an explicit cost matrix and picks the prediction that minimises expected cost under the model's posterior probability [1].

The field developed since the late 1990s along three lines: reweighting and resampling at the data level, modified loss functions and split criteria at the algorithm level, and post-hoc thresholding at the output level. The classical formal treatment is Charles Elkan's IJCAI paper "The Foundations of Cost-Sensitive Learning" (2001), which set out the expected cost criterion and a rescaling theorem connecting cost-sensitive prediction to class rebalancing [1]. Pedro Domingos's MetaCost (KDD 1999) introduced a black box wrapper that relabels training examples [2], and Bianca Zadrozny, John Langford and Naoki Abe (ICDM 2003) showed that cost-proportionate weighting and rejection sampling are also general purpose conversions [3]. Surveys by Ling and Sheng (Encyclopedia of Machine Learning, 2008) [4] and Krawczyk (Progress in Artificial Intelligence, 2016) [5] cover later developments inside the broader imbalanced learning literature.

## What is cost-sensitive learning and why does it matter?

A standard [supervised learning](/wiki/supervised_learning) algorithm chooses the [classification](/wiki/classification) hypothesis that minimises the empirical 0-1 loss on the training set, treating every misclassification as one unit of penalty. That criterion is appropriate only when errors really are interchangeable. In medical screening, fraud control, credit decisioning, churn modelling and intrusion detection, the costs are wildly asymmetric. The opening paragraph of the SMOTE paper makes the point bluntly: when the [minority class](/wiki/minority_class) represents oil spills, malignant tumours, or fraudulent claims, even a 99 percent accurate classifier can be useless if it gets the rare class wrong (Chawla et al., 2002, JAIR 16, 321 to 357) [6].

[Class imbalance](/wiki/class_imbalance) and cost-sensitive learning are related but not identical. Imbalance is a property of the data; asymmetric cost is a property of the decision. Elkan (2001) shows that the two are formally connected: rebalancing classes is equivalent to applying a particular cost ratio that depends on the inverse of the class priors [1]. Cost-sensitive learning generalises class rebalancing by allowing arbitrary cost matrices, including matrices in which costs vary from one example to the next.

## What is a cost matrix?

The central object in cost-sensitive learning is the **cost matrix** C. For a problem with K classes, C is a K by K table whose entry C(i, j) gives the cost incurred when an instance whose true class is j is predicted as class i. Elkan (2001) defines it as "the cost of predicting class [i] when the true class is [j]" and notes the modern convention that "cost matrix rows correspond to alternative predicted classes, while columns correspond to actual classes, i.e. row/column = predicted/actual" [1]. Many textbooks transpose the table; the convention does not matter as long as it is stated. Diagonal entries are usually zero and off-diagonal entries hold the misclassification costs. The cost matrix is the cost-weighted analogue of the [confusion matrix](/wiki/confusion_matrix), which simply counts the four outcome types.

In binary problems the matrix collapses to a 2 by 2 table. A canonical example for [fraud detection](/wiki/fraud_detection) might look like this.

| Predicted \\ Actual | Legitimate (0) | Fraud (1) |
| --- | --- | --- |
| Legitimate (0) | 0 | 100 |
| Fraud (1) | 1 | 0 |

Predicting legitimate when the transaction is actually fraud (a false negative) costs 100 units; flagging a legitimate transaction (a false positive) costs only 1 unit. The 100 to 1 ratio of these off-diagonal costs is what drives the cost-sensitive decision rule. Elkan (2001) cautions that cost matrices must be "reasonable": as he puts it, "the cost of labeling an example incorrectly should always be greater than the cost of labeling it correctly", and a matrix is incoherent if one row dominates another, since then one prediction is never preferred no matter what the posterior probability is [1]. Cost differences, not absolute values, determine the decision: scaling every entry by a positive constant or adding the same constant to a column of the matrix does not change the optimal prediction [1].

## How does cost-sensitive learning make the optimal decision?

Given a cost matrix C and a probabilistic classifier that estimates posterior class probabilities P(y | x), the **Bayes optimal cost-sensitive prediction** for an input x is the class y* that minimises the expected cost,

y*(x) = arg min over i of sum over j of P(y = j | x) C(i, j).

This is the criterion derived by Elkan (2001) [1]. With the 0-1 loss matrix, where C(i, j) is 1 for i not equal to j and 0 otherwise, the rule reduces to the familiar maximum a posteriori prediction. With an asymmetric matrix the decision boundary shifts toward the cheaper class.

In the binary case the rule simplifies further. Letting p = P(y = 1 | x), the optimal threshold turns out to be

p* = (C(1, 0) - C(0, 0)) / (C(1, 0) - C(0, 0) + C(0, 1) - C(1, 1)),

which, with the diagonal of C set to zero, collapses to p* = C(1, 0) / (C(1, 0) + C(0, 1)) [1]. Elkan notes that this formula shows "any 2x2 cost matrix has essentially only one degree of freedom from a decision-making perspective" [1]. For the fraud example above the cost-optimal threshold is 1 / (1 + 100) which is approximately 0.0099: a transaction should be flagged as fraud whenever the model is more than about one percent confident, far below the default 0.5 cutoff.

Elkan (2001) also derives a rescaling theorem that connects expected cost to class priors [1]. If the original training set has class prior b for the positive class, and an unbiased classifier trained on it predicts probability p_0 at the boundary, then the same decision rule can be reproduced by training on a resampled set with positive prior b' and using a new threshold p' satisfying p' / (1 - p') = (p_0 / (1 - p_0)) (b' (1 - b)) / ((1 - b') b). This is the formal link between resampling for imbalance and thresholding for cost.

## What are the three approaches to cost-sensitive learning?

The cost-sensitive literature is usually organised by where the cost information enters the pipeline. Ling and Sheng (2008) [4] and Krawczyk (2016) [5] both use roughly this taxonomy: data level methods reshape the training distribution, algorithm level methods modify the learner's loss or split criterion, and output level methods threshold a calibrated posterior. The table below summarises the trade-offs; the subsections that follow give the details.

| Approach | Cost enters at | Representative methods | Main trade-off |
| --- | --- | --- | --- |
| Data level | Training data | Cost-proportionate weighting (Zadrozny, Langford and Abe, 2003) [3], MetaCost (Domingos, 1999) [2], random over- and undersampling, SMOTE (Chawla et al., 2002) [6] | Works with any learner, but resampling can inflate variance |
| Algorithm level | Loss or split criterion | Cost-sensitive [decision tree](/wiki/decision_tree) splitting, AdaCost (Fan et al., 1999) [7], CSB1 and CSB2 (Ting, 2000) [8], per-class C in SVMs, weighted [cross-entropy loss](/wiki/cross_entropy_loss), [focal loss](/wiki/focal_loss) (Lin et al., 2017) [9], scikit-learn class_weight [22] | Best probability estimates, but needs per-algorithm implementation |
| Output level | Posterior threshold | Bayes minimum risk thresholding (Elkan, 2001) [1] over calibrated probabilities | One model serves many cost regimes, but only works if probabilities are calibrated |

### How do data level methods work?

The simplest data-level approach is to sample examples in proportion to their misclassification cost. Zadrozny, Langford and Abe (2003) prove a folk theorem that, for any cost-insensitive learner that converges to the Bayes optimal classifier under its training distribution, sampling examples with probability proportional to their cost converts that learner into a cost-sensitive one [3]. Because oversampling rare costly examples can blow up training time, they introduced **costing**, a method based on rejection sampling and ensemble aggregation that achieves the same effect with far less computation (ICDM 2003, pages 435 to 442) [3].

**MetaCost** (Domingos, 1999) wraps any classifier in a relabelling procedure [2]. It uses bagging to estimate posterior probabilities, applies the Bayesian minimum risk rule to assign new labels to the training examples, and then trains a single classifier on the relabelled data. The result is a cost-sensitive model that, in Domingos's words, treats "the underlying classifier as a black box, requiring no knowledge of its functioning or change to it" [2]. Domingos's experiments on a large suite of UCI data sets showed that MetaCost almost always produced large cost reductions compared to a cost-blind C4.5RULES baseline and to two forms of stratification [2].

For the special case of imbalance, **SMOTE** (Synthetic Minority Over-sampling Technique) creates synthetic examples in the feature space by interpolating between minority class instances and their nearest neighbours rather than duplicating existing examples (Chawla, Bowyer, Hall and Kegelmeyer, JAIR 2002) [6]. It is one of the most cited algorithms in the imbalanced learning literature. Variants such as Borderline-SMOTE, SVM-SMOTE and ADASYN ship with the imbalanced-learn package, a scikit-learn-contrib project [21].

### How do algorithm level methods work?

Classic algorithms can be made cost-sensitive by changing how they weight training examples or update parameters. In a cost-sensitive [decision tree](/wiki/decision_tree), splits are chosen to minimise expected cost rather than information gain, and leaves are labelled with the minimum-cost class. In a [random forest](/wiki/random_forest), the same idea applies inside each tree, and per-class weights are passed via the class_weight argument in scikit-learn's RandomForestClassifier [22].

Cost-sensitive boosting modifies AdaBoost so that example weights at each round depend on misclassification costs. AdaCost (Fan, Stolfo, Zhang and Chan, ICML 1999) inflates the weights of costly misclassified examples and deflates the weights of costly correctly classified examples, through a cost adjustment function added to the AdaBoost weight update [7]. The original paper proves a strict reduction of an upper bound on cumulative misclassification cost relative to AdaBoost [7]. Kai Ming Ting later proposed CSB1 and CSB2 (ICML 2000), which use cost factors directly in the weight update rather than through the boosting coefficient alpha [8]. A comparative study by Nikolaou, Edakunni, Kull, Flach and Brown (Machine Learning, 2016) found that AdaC2 and AdaMEC are the most theoretically justified variants and tend to perform best [18].

For support vector machines, the standard cost-sensitive trick is to use a different regularisation constant C+ and C- for the two classes; the scikit-learn LinearSVC and SVC classes expose this through class_weight, with "balanced" scaling the constants by inverse class frequencies [22]. The same parameter works on [logistic regression](/wiki/logistic_regression) and most other linear classifiers.

In deep learning, the most common cost-sensitive devices are weighted cross-entropy and focal loss. Focal loss (Lin, Goyal, Girshick, He and Dollar, ICCV 2017, arXiv:1708.02002) reshapes the cross-entropy by a factor of (1 - p_t)^gamma so that easy examples contribute very little and hard examples dominate training [9]. With gamma equal to zero it reduces to cross-entropy; with gamma equal to two (the value used in the original paper) an example with predicted probability 0.9 contributes about 100 times less to the gradient than one with predicted probability 0.5 [9]. Focal loss became standard in dense object detection because the foreground to background ratio in single-stage detectors can exceed 1 to 1000 [9].

### How do output level methods work?

If the underlying classifier outputs accurate posterior probabilities, the cost-optimal decision rule is simply thresholding. Elkan (2001) called this the correct way to make optimal decisions because it cleanly separates probability estimation from cost-driven action: train a model on the original distribution, calibrate its probabilities if necessary, then set the operating point so that expected cost is minimised [1]. The practical advantage is that the same trained model can serve many cost regimes; a bank can serve different transaction types by changing only the decision threshold, with no retraining. The drawback is that the rule depends critically on probability calibration. Random forests, naive Bayes and boosted trees often produce systematically distorted probabilities, and thresholding the raw scores will give suboptimal decisions even when the cost matrix is exactly right [15].

## Why does calibration matter for thresholding?

Thresholding works only if posterior probabilities are well calibrated, meaning that among examples assigned probability around p, roughly a fraction p really belong to the positive class. Two classical post-hoc methods sit on top of any base classifier.

**Platt scaling**, introduced by John Platt in 1999, fits a one-dimensional logistic regression to the classifier's scores using a held-out set [10]. It was originally developed for support vector machines but works well for boosted trees and naive Bayes too, and it assumes a sigmoid calibration map. **Isotonic regression**, applied to classifier calibration by Bianca Zadrozny and Charles Elkan (KDD 2002), fits a non-parametric monotone step function via the pair-adjacent violators algorithm [11]. It is more flexible than Platt scaling and corrects non-sigmoid distortions, but it overfits on small calibration sets. Both methods are available in scikit-learn through CalibratedClassifierCV [22]. The standard empirical comparison is Niculescu-Mizil and Caruana (ICML 2005, "Predicting Good Probabilities with Supervised Learning"): Platt scaling helps SVMs and boosted trees the most, isotonic regression usually wins when there is enough calibration data, and naturally well-calibrated learners such as plain logistic regression and bagged trees benefit little from either [15].

## How does cost-sensitive learning handle class imbalance?

[Imbalanced data](/wiki/imbalanced_data) is the situation where one class is much rarer than another in the training set. It is closely related to but distinct from cost-sensitive learning. The connection, formalised by Elkan (2001), is that **rebalancing the training set is equivalent to choosing a particular cost ratio** [1]. Specifically, if you train an unbiased classifier on a resampled set with positive class prior b' instead of b and keep the threshold at 0.5, the resulting decision rule matches training on the original set with cost ratio C(0, 1) / C(1, 0) equal to ((1 - b) b') / (b (1 - b')) [1].

Three consequences follow. First, class rebalancing methods such as random [oversampling](/wiki/oversampling) and [undersampling](/wiki/undersampling) are a special case of cost-sensitive learning where the implicit cost ratio is the inverse of the class prior ratio. Second, if the true cost ratio is unknown but is suspected to scale with rarity, balancing classes is a sensible default; if the true ratio is known and differs from the prior ratio, use it directly rather than relying on rebalancing. Third, pure imbalance with symmetric costs is itself a real problem because most learners regularise toward the majority class even when no asymmetric cost is intended. Krawczyk (2016) argues that the cleanest way to think about the relationship is to treat both imbalance and asymmetric cost as instances of a single decision-theoretic question about how the loss surface should be shaped at deployment time [5].

## How is cost-sensitive performance evaluated?

Accuracy is a misleading metric in cost-sensitive settings. A trivial classifier that always predicts the majority class can achieve 99 percent accuracy on a 1 percent imbalanced data set while incurring the maximum possible cost. Standard cost-aware metrics include:

| Metric | Definition | When to use |
| --- | --- | --- |
| Total cost | Sum of C(prediction, truth) over the test set | Cost matrix is known and stable |
| Cost curves | Expected cost as a function of operating cost ratio (Drummond and Holte, 2006) [14] | Deployment cost ratio is uncertain |
| ROC curve and AUC | Threshold-independent ranking metric [13] | Ranking matters more than calibration |
| Iso-cost lines on the ROC plane | Lines of constant expected cost (Provost and Fawcett, 2001) [12] | Choosing among classifiers under a range of cost regimes |
| ROC convex hull (ROCCH) | Upper envelope of all classifiers [12] | Same as above |
| Partial AUC | Area under the ROC restricted to a low false positive region | Only one part of the curve is operationally relevant |
| Precision, [recall](/wiki/recall), [F1 score](/wiki/f1_score) | Standard binary metrics on the minority class | Quick summary when one class is the focus |
| F-beta | Weighted harmonic mean of precision and recall | Beta encodes the cost trade-off in a single number |
| Lift and gain charts | Cumulative response of the top-scoring fraction | Direct marketing and credit scoring |

Provost and Fawcett's iso-cost analysis is particularly elegant. Each combination of class priors and cost ratio corresponds to a slope on the ROC plane, and the optimal classifier is the one whose ROC curve touches the highest iso-performance line of that slope [12]. The ROC convex hull collects every classifier that is optimal for some cost ratio, so the operating point can be chosen after the costs are known [12].

## What about multi-class and example-dependent costs?

For a problem with K classes the cost matrix is K by K with K(K - 1) free off-diagonal entries. The Bayes minimum risk rule generalises directly: predict the class i that minimises the sum over j of P(y = j | x) C(i, j) [1]. MetaCost (Domingos, 1999) handles arbitrary K and arbitrary cost matrices by design [2], while AdaCost and CSB are usually defined for the binary case and need extension for multi-class problems.

**Example-dependent cost-sensitive learning** generalises further so that each instance has its own cost matrix. This is natural in fraud detection and credit scoring, where the cost of a false negative is the value of the transaction or the size of the loan and varies from one case to the next. Bahnsen, Aouada and Ottersten (Expert Systems with Applications 2015) developed example-dependent cost-sensitive decision trees and a Bayes minimum risk wrapper that take per-example cost matrices as input; their experiments on real credit card transactions and credit scoring portfolios showed substantial savings over class-dependent and cost-blind baselines [16]. The CostSensitiveClassification library on GitHub implements these methods on top of scikit-learn [16].

## Where is cost-sensitive learning used?

Cost-sensitive learning is the rule rather than the exception in operational systems where errors translate directly into money or risk. Common application areas include:

- [Fraud detection](/wiki/fraud_detection) in payments and insurance: the cost of a missed fraud is the disputed amount, while a false alarm costs review time and customer friction. Bahnsen and colleagues (2014, 2015) is the standard reference for credit card fraud [16].
- Medical diagnosis and screening: missing a malignant tumour in mammography or a sepsis case in intensive care is far more dangerous than calling a benign case back. The cost-sensitive treatment of imbalance is surveyed by Krawczyk (2016) [5].
- Churn prediction: the cost of missing a leaving customer equals the lifetime value lost; the cost of intervening on a happy one is the marketing spend.
- Credit scoring: approving a defaulter loses the loan principal, denying a creditworthy applicant loses the foregone interest. These costs scale with loan size, making the problem naturally example-dependent [16].
- Network intrusion detection and spam filtering, where false positives and false negatives have very different operational and reputational consequences.
- Object detection in computer vision, where the foreground to background ratio is severe and focal-loss style cost shaping is now standard practice (Lin et al., 2017) [9].

## What tools support cost-sensitive learning?

Most mainstream machine learning libraries support cost-sensitive learning under the heading of class weighting.

scikit-learn exposes the class_weight argument on essentially every classifier, including logistic regression, random forest, [gradient boosting](/wiki/gradient_boosting), support vector machines and linear discriminant analysis [22]. It accepts None for uniform costs, "balanced" for inverse class frequencies, or a dictionary mapping labels to weights. The "balanced" mode sets each class weight to n_samples / (n_classes * np.bincount(y)), so a class that makes up one tenth of the data receives roughly ten times the weight of a class that makes up the remainder [22]. The compute_class_weight utility produces the balanced weights explicitly, and CalibratedClassifierCV applies Platt scaling or isotonic regression on top of any base estimator [22]. The imbalanced-learn project (scikit-learn-contrib) adds SMOTE and its Borderline, SVM and ADASYN variants, plus RandomUnderSampler, NearMiss, Tomek links, edited nearest neighbour cleaning, and BalancedRandomForestClassifier and BalancedBaggingClassifier for ensemble rebalancing [21].

XGBoost and LightGBM offer scale_pos_weight for binary classification and sample_weight for per-example weights. These are the standard hooks for both class-dependent and example-dependent cost-sensitive boosting on tabular data. The CostSensitiveClassification library by Bahnsen on GitHub implements the example-dependent decision trees and Bayes minimum risk wrappers from his 2015 papers [16]. In deep learning, focal loss appears as tensorflow.keras.losses.BinaryFocalCrossentropy and CategoricalFocalCrossentropy in Keras (from version 2.10) and as the standalone focal-loss package on PyPI; PyTorch users typically combine binary_cross_entropy_with_logits with a focusing factor by hand [9]. mlr3 in R provides a cost-sensitive classification module with measure objects that take a cost matrix as input.

## What are the common pitfalls?

1. Tuning hyperparameters under 0-1 loss and then deploying under cost. If grid search optimises accuracy or unweighted ROC AUC, the chosen hyperparameters will not minimise cost. Tune with the metric you will deploy with.
2. Thresholding uncalibrated probabilities. Random forests and boosted trees often produce confident but miscalibrated probabilities [15]. Calibrate first or use a naturally calibrated learner such as plain logistic regression.
3. Resampling without correcting for prior shift. If you train on a rebalanced set and deploy on the original distribution, the predicted probabilities are wrong even when the rankings are right. Elkan (2001) gives the rescaling formula [1]; Pozzolo, Caelen, Johnson and Bontempi (2015) give a practical recipe for credit card fraud [17].
4. Inflated variance from heavy reweighting. Cost-proportionate weighting that loads most of the weight on a few examples drives up variance. Costing's rejection sampling approach and ensembling are the standard mitigations [3].
5. Confusing cost ratio with class ratio. They are equal only by coincidence. Use the operational ratio when it is known; otherwise the inverse of the prior is a reasonable default [1].
6. Treating SMOTE as a panacea. SMOTE creates synthetic examples by linear interpolation in feature space, which can produce points outside the true support, especially in high dimensions or with categorical features [6].
7. Ignoring example-dependent costs when they exist. Treating a 10 dollar transaction and a 10000 dollar transaction as having the same fraud cost throws away most of the available signal [16].
8. Reporting accuracy on imbalanced cost-sensitive problems. A 99 percent accurate classifier on a 1 percent positive class is often the do-nothing baseline [6].

## Textbook treatments

Textbook coverage appears in Witten, Frank and Hall's Data Mining: Practical Machine Learning Tools and Techniques (Morgan Kaufmann, 3rd ed., 2011) [19] and in Charu Aggarwal's Data Classification: Algorithms and Applications (Chapman and Hall, 2014), which includes a dedicated chapter on cost-sensitive learning by Ling and Sheng [20].

## ELI5

Imagine a smoke alarm. If it stays quiet during a real fire, the result is a disaster; if it beeps when you only burned some toast, the result is a minor annoyance. The two mistakes are not equally bad, so a good alarm is set to be jumpy: it would rather give you a few false alarms than miss one real fire. Cost-sensitive learning teaches a computer to act like that smoke alarm. You write down how bad each kind of mistake is in a little grid (the cost matrix), and the computer then leans toward the answer that keeps the total badness as low as possible, instead of just trying to be right most of the time.

## See also

- [Class imbalance](/wiki/class_imbalance)
- [Imbalanced data](/wiki/imbalanced_data)
- [Confusion matrix](/wiki/confusion_matrix)
- [Classification](/wiki/classification)
- [Focal loss](/wiki/focal_loss)
- [Oversampling](/wiki/oversampling)
- [Undersampling](/wiki/undersampling)

## References

1. Charles Elkan. "The Foundations of Cost-Sensitive Learning". IJCAI, 2001, pages 973 to 978. https://cseweb.ucsd.edu/~elkan/rescale.pdf
2. Pedro Domingos. "MetaCost: A General Method for Making Classifiers Cost-Sensitive". KDD, 1999, pages 155 to 164. https://homes.cs.washington.edu/~pedrod/papers/kdd99.pdf
3. Bianca Zadrozny, John Langford and Naoki Abe. "Cost-Sensitive Learning by Cost-Proportionate Example Weighting". ICDM, 2003, pages 435 to 442.
4. Charles X. Ling and Victor S. Sheng. "Cost-Sensitive Learning and the Class Imbalance Problem". In Sammut and Webb (eds.), Encyclopedia of Machine Learning, Springer, 2008, pages 231 to 235. https://www.csd.uwo.ca/~xling/papers/cost_sensitive.pdf
5. Bartosz Krawczyk. "Learning from Imbalanced Data: Open Challenges and Future Directions". Progress in Artificial Intelligence, vol 5, no 4, 2016, pages 221 to 232. https://link.springer.com/article/10.1007/s13748-016-0094-0
6. Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall and W. Philip Kegelmeyer. "SMOTE: Synthetic Minority Over-sampling Technique". JAIR, vol 16, 2002, pages 321 to 357. https://www.jair.org/index.php/jair/article/view/10302
7. Wei Fan, Salvatore J. Stolfo, Junxin Zhang and Philip K. Chan. "AdaCost: Misclassification Cost-Sensitive Boosting". ICML, 1999, pages 97 to 105.
8. Kai Ming Ting. "A Comparative Study of Cost-Sensitive Boosting Algorithms". ICML, 2000, pages 983 to 990.
9. Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He and Piotr Dollar. "Focal Loss for Dense Object Detection". ICCV, 2017. arXiv:1708.02002. https://arxiv.org/abs/1708.02002
10. John C. Platt. "Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods". In Advances in Large Margin Classifiers, MIT Press, 1999.
11. Bianca Zadrozny and Charles Elkan. "Transforming Classifier Scores into Accurate Multiclass Probability Estimates". KDD, 2002, pages 694 to 699.
12. Foster Provost and Tom Fawcett. "Robust Classification for Imprecise Environments". Machine Learning, vol 42, no 3, 2001, pages 203 to 231.
13. Tom Fawcett. "An Introduction to ROC Analysis". Pattern Recognition Letters, vol 27, 2006, pages 861 to 874.
14. Chris Drummond and Robert C. Holte. "Cost Curves: An Improved Method for Visualizing Classifier Performance". Machine Learning, vol 65, 2006, pages 95 to 130.
15. Alexandru Niculescu-Mizil and Rich Caruana. "Predicting Good Probabilities with Supervised Learning". ICML, 2005, pages 625 to 632.
16. Alejandro Correa Bahnsen, Djamila Aouada and Bjorn Ottersten. "Example-Dependent Cost-Sensitive Decision Trees". Expert Systems with Applications, vol 42, no 19, 2015, pages 6609 to 6619.
17. Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. "Calibrating Probability with Undersampling for Unbalanced Classification". IEEE SSCI, 2015, pages 159 to 166.
18. Nikolaos Nikolaou, Narayanan Edakunni, Meelis Kull, Peter Flach and Gavin Brown. "Cost-Sensitive Boosting Algorithms: Do We Really Need Them?". Machine Learning, vol 104, 2016, pages 359 to 384.
19. Ian H. Witten, Eibe Frank and Mark A. Hall. Data Mining: Practical Machine Learning Tools and Techniques, 3rd edition, Morgan Kaufmann, 2011.
20. Charu C. Aggarwal (ed.). Data Classification: Algorithms and Applications. Chapman and Hall / CRC, 2014.
21. Guillaume Lemaitre, Fernando Nogueira and Christos K. Aridas. "Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning". JMLR, vol 18, no 17, 2017, pages 1 to 5. https://imbalanced-learn.org/
22. scikit-learn developers. "1.16. Probability calibration", "sklearn.utils.class_weight.compute_class_weight" and "class_weight". scikit-learn documentation. https://scikit-learn.org/stable/modules/calibration.html
23. Wikipedia contributors. "Cost-sensitive machine learning". https://en.wikipedia.org/wiki/Cost-sensitive_machine_learning