Cost-sensitive learning
Last reviewed
Apr 30, 2026
Sources
23 citations
Review status
Source-backed
Revision
v1 ยท 4,100 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 30, 2026
Sources
23 citations
Review status
Source-backed
Revision
v1 ยท 4,100 words
Add missing citations, update stale details, or suggest a clearer explanation.
Cost-sensitive learning is the family of machine learning methods that minimise the expected misclassification cost rather than the expected misclassification rate. The framework recognises that, in most real applications, different errors have very different real world consequences. Missing a fraudulent transaction is more expensive than briefly inconveniencing a legitimate cardholder; missing a cancer diagnosis is worse than calling a healthy patient back for a second look; routing a phishing email to the inbox is worse than dropping a marketing newsletter into the spam folder. Standard learners that minimise the 0-1 loss treat all of these mistakes as if they were equivalent. Cost-sensitive learning replaces that assumption with an explicit cost matrix and picks the prediction that minimises expected cost under the model's posterior probability.
The field developed since the late 1990s along three lines: reweighting and resampling at the data level, modified loss functions and split criteria at the algorithm level, and post-hoc thresholding at the output level. The classical formal treatment is Charles Elkan's IJCAI paper "The Foundations of Cost-Sensitive Learning" (2001), which set out the expected cost criterion and a rescaling theorem connecting cost-sensitive prediction to class rebalancing. Pedro Domingos's MetaCost (KDD 1999) introduced a black box wrapper that relabels training examples, and Bianca Zadrozny, John Langford and Naoki Abe (ICDM 2003) showed that cost-proportionate weighting and rejection sampling are also general purpose conversions. Surveys by Ling and Sheng (Encyclopedia of Machine Learning, 2008) and Krawczyk (Progress in Artificial Intelligence, 2016) cover later developments inside the broader imbalanced learning literature.
A standard supervised learning algorithm chooses the classification hypothesis that minimises the empirical 0-1 loss on the training set, treating every misclassification as one unit of penalty. That criterion is appropriate only when errors really are interchangeable. In medical screening, fraud control, credit decisioning, churn modelling and intrusion detection, the costs are wildly asymmetric. The opening paragraph of the SMOTE paper makes the point bluntly: when the minority class represents oil spills, malignant tumours, or fraudulent claims, even a 99 percent accurate classifier can be useless if it gets the rare class wrong (Chawla et al., 2002, JAIR 16, 321 to 357).
Class imbalance and cost-sensitive learning are related but not identical. Imbalance is a property of the data; asymmetric cost is a property of the decision. Elkan (2001) shows that the two are formally connected: rebalancing classes is equivalent to applying a particular cost ratio that depends on the inverse of the class priors. Cost-sensitive learning generalises class rebalancing by allowing arbitrary cost matrices, including matrices in which costs vary from one example to the next.
The central object in cost-sensitive learning is the cost matrix C. For a problem with K classes, C is a K by K table whose entry C(i, j) gives the cost incurred when an instance whose true class is j is predicted as class i. Elkan (2001) writes C(i, j) with i predicted and j actual, while many textbooks transpose the table; the convention does not matter as long as it is stated. Diagonal entries are usually zero and off-diagonal entries hold the misclassification costs.
In binary problems the matrix collapses to a 2 by 2 table. A canonical example for fraud detection might look like this.
| Predicted \ Actual | Legitimate (0) | Fraud (1) |
|---|---|---|
| Legitimate (0) | 0 | 100 |
| Fraud (1) | 1 | 0 |
Predicting legitimate when the transaction is actually fraud (a false negative) costs 100 units; flagging a legitimate transaction (a false positive) costs only 1 unit. The 100 to 1 ratio of these off-diagonal costs is what drives the cost-sensitive decision rule. Elkan (2001) also cautions that cost matrices must be "reasonable": a matrix is incoherent if one row dominates another, since then one prediction is never preferred no matter what the posterior probability is. Cost differences, not absolute values, determine the decision; adding the same constant to a column of the matrix does not change the optimal prediction.
Given a cost matrix C and a probabilistic classifier that estimates posterior class probabilities P(y | x), the Bayes optimal cost-sensitive prediction for an input x is the class y* that minimises the expected cost,
y*(x) = arg min over i of sum over j of P(y = j | x) C(i, j).
This is the criterion derived by Elkan (2001). With the 0-1 loss matrix, where C(i, j) is 1 for i not equal to j and 0 otherwise, the rule reduces to the familiar maximum a posteriori prediction. With an asymmetric matrix the decision boundary shifts toward the cheaper class.
In the binary case the rule simplifies further. Letting p = P(y = 1 | x), the optimal threshold turns out to be
p* = (C(1, 0) - C(0, 0)) / (C(1, 0) - C(0, 0) + C(0, 1) - C(1, 1)),
which, with the diagonal of C set to zero, collapses to p* = C(1, 0) / (C(1, 0) + C(0, 1)). For the fraud example above the cost-optimal threshold is 1 / (1 + 100) which is approximately 0.0099: a transaction should be flagged as fraud whenever the model is more than about one percent confident, far below the default 0.5 cutoff.
Elkan (2001) also derives a rescaling theorem that connects expected cost to class priors. If the original training set has class prior b for the positive class, and an unbiased classifier trained on it predicts probability p_0 at the boundary, then the same decision rule can be reproduced by training on a resampled set with positive prior b' and using a new threshold p' satisfying p' / (1 - p') = (p_0 / (1 - p_0)) (b' (1 - b)) / ((1 - b') b). This is the formal link between resampling for imbalance and thresholding for cost.
The cost-sensitive literature is usually organised by where the cost information enters the pipeline. Ling and Sheng (2008) and Krawczyk (2016) both use roughly this taxonomy: data level methods reshape the training distribution, algorithm level methods modify the learner's loss or split criterion, and output level methods threshold a calibrated posterior. The table below summarises the trade-offs; the subsections that follow give the details.
| Approach | Cost enters at | Representative methods | Main trade-off |
|---|---|---|---|
| Data level | Training data | Cost-proportionate weighting (Zadrozny, Langford and Abe, 2003), MetaCost (Domingos, 1999), random over- and undersampling, SMOTE (Chawla et al., 2002) | Works with any learner, but resampling can inflate variance |
| Algorithm level | Loss or split criterion | Cost-sensitive decision tree splitting, AdaCost (Fan et al., 1999), CSB1 and CSB2 (Ting, 2000), per-class C in SVMs, weighted cross-entropy loss, focal loss (Lin et al., 2017), scikit-learn class_weight | Best probability estimates, but needs per-algorithm implementation |
| Output level | Posterior threshold | Bayes minimum risk thresholding (Elkan, 2001) over calibrated probabilities | One model serves many cost regimes, but only works if probabilities are calibrated |
The simplest data-level approach is to sample examples in proportion to their misclassification cost. Zadrozny, Langford and Abe (2003) prove a folk theorem that, for any cost-insensitive learner that converges to the Bayes optimal classifier under its training distribution, sampling examples with probability proportional to their cost converts that learner into a cost-sensitive one. Because oversampling rare costly examples can blow up training time, they introduced costing, a method based on rejection sampling and ensemble aggregation that achieves the same effect with far less computation (ICDM 2003, pages 435 to 442).
MetaCost (Domingos, 1999) wraps any classifier in a relabelling procedure. It uses bagging to estimate posterior probabilities, applies the Bayesian minimum risk rule to assign new labels to the training examples, and then trains a single classifier on the relabelled data. The result is a cost-sensitive model that requires no modification of the base learner. Domingos's experiments on a large suite of UCI data sets showed that MetaCost almost always produced large cost reductions compared to a cost-blind C4.5RULES baseline and to two forms of stratification.
For the special case of imbalance, SMOTE (Synthetic Minority Over-sampling Technique) creates synthetic examples in the feature space by interpolating between minority class instances and their nearest neighbours rather than duplicating existing examples (Chawla, Bowyer, Hall and Kegelmeyer, JAIR 2002). It is one of the most cited algorithms in the imbalanced learning literature. Variants such as Borderline-SMOTE, SVM-SMOTE and ADASYN ship with the imbalanced-learn package, a scikit-learn-contrib project.
Classic algorithms can be made cost-sensitive by changing how they weight training examples or update parameters. In a cost-sensitive decision tree, splits are chosen to minimise expected cost rather than information gain, and leaves are labelled with the minimum-cost class. In a random forest, the same idea applies inside each tree, and per-class weights are passed via the class_weight argument in scikit-learn's RandomForestClassifier.
Cost-sensitive boosting modifies AdaBoost so that example weights at each round depend on misclassification costs. AdaCost (Fan, Stolfo, Zhang and Chan, ICML 1999) inflates the weights of costly misclassified examples and deflates the weights of costly correctly classified examples, through a cost adjustment function added to the AdaBoost weight update. The original paper proves a strict reduction of an upper bound on cumulative misclassification cost relative to AdaBoost. Kai Ming Ting later proposed CSB1 and CSB2 (ICML 2000), which use cost factors directly in the weight update rather than through the boosting coefficient alpha. A comparative study by Nikolaou, Edakunni, Kull, Flach and Brown (Machine Learning, 2016) found that AdaC2 and AdaMEC are the most theoretically justified variants and tend to perform best.
For support vector machines, the standard cost-sensitive trick is to use a different regularisation constant C+ and C- for the two classes; the scikit-learn LinearSVC and SVC classes expose this through class_weight, with "balanced" scaling the constants by inverse class frequencies. The same parameter works on logistic regression and most other linear classifiers.
In deep learning, the most common cost-sensitive devices are weighted cross-entropy and focal loss. Focal loss (Lin, Goyal, Girshick, He and Dollar, ICCV 2017, arXiv:1708.02002) reshapes the cross-entropy by a factor of (1 - p_t)^gamma so that easy examples contribute very little and hard examples dominate training. With gamma equal to zero it reduces to cross-entropy; with gamma equal to two (the value used in the original paper) an example with predicted probability 0.9 contributes about 100 times less to the gradient than one with predicted probability 0.5. Focal loss became standard in dense object detection because the foreground to background ratio in single-stage detectors can exceed 1 to 1000.
If the underlying classifier outputs accurate posterior probabilities, the cost-optimal decision rule is simply thresholding. Elkan (2001) called this the correct way to make optimal decisions because it cleanly separates probability estimation from cost-driven action: train a model on the original distribution, calibrate its probabilities if necessary, then set the operating point so that expected cost is minimised. The practical advantage is that the same trained model can serve many cost regimes; a bank can serve different transaction types by changing only the decision threshold, with no retraining. The drawback is that the rule depends critically on probability calibration. Random forests, naive Bayes and boosted trees often produce systematically distorted probabilities, and thresholding the raw scores will give suboptimal decisions even when the cost matrix is exactly right.
Thresholding works only if posterior probabilities are well calibrated, meaning that among examples assigned probability around p, roughly a fraction p really belong to the positive class. Two classical post-hoc methods sit on top of any base classifier.
Platt scaling, introduced by John Platt in 1999, fits a one-dimensional logistic regression to the classifier's scores using a held-out set. It was originally developed for support vector machines but works well for boosted trees and naive Bayes too, and it assumes a sigmoid calibration map. Isotonic regression, applied to classifier calibration by Bianca Zadrozny and Charles Elkan (KDD 2002), fits a non-parametric monotone step function via the pair-adjacent violators algorithm. It is more flexible than Platt scaling and corrects non-sigmoid distortions, but it overfits on small calibration sets. Both methods are available in scikit-learn through CalibratedClassifierCV. The standard empirical comparison is Niculescu-Mizil and Caruana (ICML 2005, "Predicting Good Probabilities with Supervised Learning"): Platt scaling helps SVMs and boosted trees the most, isotonic regression usually wins when there is enough calibration data, and naturally well-calibrated learners such as plain logistic regression and bagged trees benefit little from either.
Imbalanced data is the situation where one class is much rarer than another in the training set. It is closely related to but distinct from cost-sensitive learning. The connection, formalised by Elkan (2001), is that rebalancing the training set is equivalent to choosing a particular cost ratio. Specifically, if you train an unbiased classifier on a resampled set with positive class prior b' instead of b and keep the threshold at 0.5, the resulting decision rule matches training on the original set with cost ratio C(0, 1) / C(1, 0) equal to ((1 - b) b') / (b (1 - b')).
Three consequences follow. First, class rebalancing methods such as random oversampling and undersampling are a special case of cost-sensitive learning where the implicit cost ratio is the inverse of the class prior ratio. Second, if the true cost ratio is unknown but is suspected to scale with rarity, balancing classes is a sensible default; if the true ratio is known and differs from the prior ratio, use it directly rather than relying on rebalancing. Third, pure imbalance with symmetric costs is itself a real problem because most learners regularise toward the majority class even when no asymmetric cost is intended. Krawczyk (2016) argues that the cleanest way to think about the relationship is to treat both imbalance and asymmetric cost as instances of a single decision-theoretic question about how the loss surface should be shaped at deployment time.
Accuracy is a misleading metric in cost-sensitive settings. A trivial classifier that always predicts the majority class can achieve 99 percent accuracy on a 1 percent imbalanced data set while incurring the maximum possible cost. Standard cost-aware metrics include:
| Metric | Definition | When to use |
|---|---|---|
| Total cost | Sum of C(prediction, truth) over the test set | Cost matrix is known and stable |
| Cost curves | Expected cost as a function of operating cost ratio (Drummond and Holte, 2006) | Deployment cost ratio is uncertain |
| ROC curve and AUC | Threshold-independent ranking metric | Ranking matters more than calibration |
| Iso-cost lines on the ROC plane | Lines of constant expected cost (Provost and Fawcett, 2001) | Choosing among classifiers under a range of cost regimes |
| ROC convex hull (ROCCH) | Upper envelope of all classifiers | Same as above |
| Partial AUC | Area under the ROC restricted to a low false positive region | Only one part of the curve is operationally relevant |
| Precision, recall, F1 score | Standard binary metrics on the minority class | Quick summary when one class is the focus |
| F-beta | Weighted harmonic mean of precision and recall | Beta encodes the cost trade-off in a single number |
| Lift and gain charts | Cumulative response of the top-scoring fraction | Direct marketing and credit scoring |
Provost and Fawcett's iso-cost analysis is particularly elegant. Each combination of class priors and cost ratio corresponds to a slope on the ROC plane, and the optimal classifier is the one whose ROC curve touches the highest iso-performance line of that slope. The ROC convex hull collects every classifier that is optimal for some cost ratio, so the operating point can be chosen after the costs are known.
For a problem with K classes the cost matrix is K by K with K(K - 1) free off-diagonal entries. The Bayes minimum risk rule generalises directly: predict the class i that minimises the sum over j of P(y = j | x) C(i, j). MetaCost (Domingos, 1999) handles arbitrary K and arbitrary cost matrices by design, while AdaCost and CSB are usually defined for the binary case and need extension for multi-class problems.
Example-dependent cost-sensitive learning generalises further so that each instance has its own cost matrix. This is natural in fraud detection and credit scoring, where the cost of a false negative is the value of the transaction or the size of the loan and varies from one case to the next. Bahnsen, Aouada and Ottersten (Expert Systems with Applications 2015) developed example-dependent cost-sensitive decision trees and a Bayes minimum risk wrapper that take per-example cost matrices as input; their experiments on real credit card transactions and credit scoring portfolios showed substantial savings over class-dependent and cost-blind baselines. The CostSensitiveClassification library on GitHub implements these methods on top of scikit-learn.
Cost-sensitive learning is the rule rather than the exception in operational systems where errors translate directly into money or risk. Common application areas include:
Most mainstream machine learning libraries support cost-sensitive learning under the heading of class weighting.
scikit-learn exposes the class_weight argument on essentially every classifier, including logistic regression, random forest, gradient boosting, support vector machines and linear discriminant analysis. It accepts None for uniform costs, "balanced" for inverse class frequencies, or a dictionary mapping labels to weights. The compute_class_weight utility produces the balanced weights explicitly, and CalibratedClassifierCV applies Platt scaling or isotonic regression on top of any base estimator. The imbalanced-learn project (scikit-learn-contrib) adds SMOTE and its Borderline, SVM and ADASYN variants, plus RandomUnderSampler, NearMiss, Tomek links, edited nearest neighbour cleaning, and BalancedRandomForestClassifier and BalancedBaggingClassifier for ensemble rebalancing.
XGBoost and LightGBM offer scale_pos_weight for binary classification and sample_weight for per-example weights. These are the standard hooks for both class-dependent and example-dependent cost-sensitive boosting on tabular data. The CostSensitiveClassification library by Bahnsen on GitHub implements the example-dependent decision trees and Bayes minimum risk wrappers from his 2015 papers. In deep learning, focal loss appears as tensorflow.keras.losses.BinaryFocalCrossentropy and CategoricalFocalCrossentropy in Keras (from version 2.10) and as the standalone focal-loss package on PyPI; PyTorch users typically combine binary_cross_entropy_with_logits with a focusing factor by hand. mlr3 in R provides a cost-sensitive classification module with measure objects that take a cost matrix as input.
Textbook coverage appears in Witten, Frank and Hall's Data Mining: Practical Machine Learning Tools and Techniques (Morgan Kaufmann, 3rd ed., 2011) and in Charu Aggarwal's Data Classification: Algorithms and Applications (Chapman and Hall, 2014), which includes a dedicated chapter on cost-sensitive learning by Ling and Sheng.