AdaBoost
Last reviewed
Apr 30, 2026
Sources
18 citations
Review status
Source-backed
Revision
v1 · 3,831 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 30, 2026
Sources
18 citations
Review status
Source-backed
Revision
v1 · 3,831 words
Add missing citations, update stale details, or suggest a clearer explanation.
AdaBoost, short for Adaptive Boosting, is a machine learning ensemble algorithm that combines many weak classifiers into a single strong classifier through a weighted majority vote. It was introduced by Yoav Freund and Robert E. Schapire in a 1995 conference paper and a 1997 journal article, and it became the first practical boosting algorithm to gain widespread use. The algorithm is called "adaptive" because, after each round of training, it adjusts the weights of the training examples so that subsequent weak learners focus more strongly on the examples the previous learners got wrong. Because of this, AdaBoost requires no prior knowledge about how accurate the weak learners will be; it adapts to whatever accuracies the base algorithm happens to deliver, provided each weak learner does at least slightly better than random guessing.
AdaBoost is one of the most influential algorithms in statistical machine learning. Freund and Schapire received the 2003 Gödel Prize, awarded jointly by the European Association for Theoretical Computer Science and the ACM Special Interest Group on Algorithms and Computational Theory, for the work that introduced the algorithm. They received the ACM Paris Kanellakis Theory and Practice Award in 2004 for the same line of research. AdaBoost was also the engine behind the Viola-Jones face detector (2001), the first real-time face detection system to ship in commodity software such as OpenCV.
AdaBoost builds a classification function as a weighted sum of simple base hypotheses, called weak learners or weak classifiers. A weak learner is any predictor that does slightly better than random guessing on the training distribution. The most common choice is a depth-one decision tree, also called a decision stump, which thresholds a single feature.
The algorithm trains these weak learners one at a time. After each round, it computes which training examples the latest weak learner classified incorrectly, and increases the weight on those examples. The next weak learner is then trained on the reweighted distribution, so it pays more attention to the hard cases. After T rounds, the final classifier is a weighted vote of all T weak learners, with more accurate weak learners receiving higher voting weights. Schapire has noted that AdaBoost can be implemented in roughly ten lines of code, which is part of why it became such a popular pedagogical example for ensemble learning.
The theoretical question that led to AdaBoost was first posed by Michael Kearns and Leslie Valiant in 1988 and 1989, in the framework of probably approximately correct (PAC) learning. Kearns and Valiant asked whether a "weakly" learnable concept class, one for which any algorithm can do slightly better than random, must also be "strongly" learnable, meaning learnable to arbitrarily small error. Robert Schapire answered yes in his 1990 Machine Learning paper "The strength of weak learnability," and Yoav Freund gave a more efficient construction in his 1995 paper "Boosting a weak learning algorithm by majority." Both early constructions required prior knowledge of the weak learner's accuracy, which made them awkward to use in practice.
Freund and Schapire's 1995 paper, "A decision-theoretic generalization of on-line learning and an application to boosting," presented at the second European Conference on Computational Learning Theory (EuroCOLT), removed that limitation. The full version was published in 1997 in the Journal of Computer and System Sciences, volume 55, issue 1, pages 119 to 139. The algorithm they described in that paper is what is now called AdaBoost. The name reflects that the algorithm adapts to whatever errors the weak learner happens to produce on each round.
In 2003 the European Association for Theoretical Computer Science and ACM SIGACT awarded Freund and Schapire the Gödel Prize for this paper. It was the first machine learning paper to win the Gödel Prize, which is given for outstanding work in theoretical computer science.
Discrete AdaBoost, the original version, solves a binary classification problem. Inputs are training examples (x_1, y_1), ..., (x_N, y_N) with labels y_i in {-1, +1}, a weak learning algorithm, and a number of rounds T.
A few features of the update rule are worth pulling out. The coefficient alpha_t is positive when eps_t is below 1/2 and grows without bound as eps_t approaches zero, so very accurate weak learners get very large votes. The factor exp(-alpha_t * y_i * h_t(x_i)) increases the weight of an example by a factor of sqrt((1 - eps_t) / eps_t) when h_t makes a mistake on it, and decreases it by the inverse factor when h_t classifies it correctly. After renormalization, the weighted error of h_t on the new distribution is exactly 1/2, which means the next weak learner cannot simply repeat h_t.
Freund and Schapire proved a sharp bound on the training error of the final classifier. If the weighted error on round t is eps_t and gamma_t = 1/2 - eps_t is the edge over random guessing, then the training error of H is at most the product over t of 2 * sqrt(eps_t * (1 - eps_t)), which is in turn at most exp(-2 * sum_t gamma_t^2). In words, as long as every weak learner has a positive edge over random guessing, the training error decays exponentially in the number of rounds. This was the first PAC-style boosting result that did not require the algorithm to know the edge gamma in advance.
Researchers have introduced many variants of AdaBoost to handle different prediction tasks, base learners, or robustness concerns. The table below summarizes the main ones.
| Variant | Reference | Output type | Key difference from discrete AdaBoost |
|---|---|---|---|
| Discrete AdaBoost | Freund and Schapire 1995, 1997 | Binary, {-1, +1} weak hypotheses | Original algorithm. |
| AdaBoost.M1 | Freund and Schapire 1996, 1997 | Multi-class | Direct multi-class extension. Requires each weak learner to beat 1/2 on the weighted distribution, which is hard with many classes. |
| AdaBoost.M2 | Freund and Schapire 1996, 1997 | Multi-class | Reduces multi-class to a set of binary problems over (example, wrong label) pairs and uses a pseudo-loss. |
| AdaBoost.MH | Schapire and Singer 1999 | Multi-label / multi-class | Reduces multi-label classification to a set of binary classification problems, one per class. |
| AdaBoost.MR | Schapire and Singer 1999 | Multi-class ranking | Optimizes a ranking loss over (correct, incorrect) label pairs. |
| Real AdaBoost | Schapire and Singer 1999 | Binary, real-valued confidences | Weak learners output real numbers rather than {-1, +1}. The algorithm uses confidence-rated predictions and chooses each h_t to minimize a normalization factor Z_t. |
| AdaBoost.R, R2, R.MH | Drucker 1997; Schapire and Singer 1999 | Regression | Variants for real-valued targets. AdaBoost.R2 by Drucker is the basis of scikit-learn's AdaBoostRegressor. |
| LogitBoost | Friedman, Hastie and Tibshirani 2000 | Binary | Fits an additive logistic regression model by Newton-style stagewise minimization of the log-loss. |
| Gentle AdaBoost | Friedman, Hastie and Tibshirani 2000 | Binary | Like Real AdaBoost but uses bounded Newton steps, which is more numerically stable when class probabilities are near 0 or 1. |
| SAMME | Zhu, Zou, Rosset and Hastie 2009 | Multi-class | Multi-class generalization that only requires weak learners to beat 1/K accuracy, where K is the number of classes. Adds the term ln(K-1) to alpha_t. |
| SAMME.R | Zhu, Zou, Rosset and Hastie 2009 | Multi-class | Real-valued version of SAMME using class probability estimates from the weak learners. Often converges faster than SAMME on the same number of rounds. |
Robert Schapire and Yoram Singer's 1999 paper "Improved boosting algorithms using confidence-rated predictions," published in Machine Learning volume 37, generalized the framework so that weak hypotheses output real numbers whose sign is the predicted class and whose magnitude is a confidence. Their analysis introduced the normalization factor Z_t = sum_i w_i * exp(-alpha_t * y_i * h_t(x_i)) and showed that the training error of the combined classifier is upper bounded by the product of Z_t over all rounds. Choosing each weak hypothesis to minimize Z_t gives a clean, unified design criterion. The paper also gave concrete recipes for confidence-rated decision trees, domain-partitioning weak learners, and the AdaBoost.MH and AdaBoost.MR multi-label and ranking algorithms.
Jerome Friedman, Trevor Hastie and Robert Tibshirani's 2000 paper "Additive logistic regression: a statistical view of boosting," published in The Annals of Statistics volume 28 number 2, reframed AdaBoost as a stagewise greedy fitting procedure for an additive model, with the loss function being the exponential loss exp(-y * F(x)). They observed that the exponential loss is, to second order, equivalent to the binomial log-likelihood, and that AdaBoost can be viewed as approximating Newton steps on this loss. From that observation they derived two new algorithms. LogitBoost works directly on the binomial log-likelihood, and Gentle AdaBoost replaces the closed-form alpha_t with bounded least-squares Newton steps. In the empirical comparisons in that paper, LogitBoost, Real AdaBoost and Gentle AdaBoost all outperformed Discrete AdaBoost on stumps.
Ji Zhu, Hui Zou, Saharon Rosset and Trevor Hastie's 2009 paper "Multi-class AdaBoost," published in Statistics and Its Interface volume 2, gave a multi-class generalization that does not need to be reduced to a sequence of binary problems. The discrete version, SAMME (Stagewise Additive Modeling using a Multi-class Exponential loss function), uses the same weight update as AdaBoost but adds the term ln(K - 1) to alpha_t, which lets the algorithm work whenever each weak learner has accuracy better than 1/K rather than the much harder 1/2 threshold required by AdaBoost.M1. SAMME.R is the real-valued companion that uses class probability estimates from the weak learners. Until version 1.6, scikit-learn's AdaBoostClassifier used SAMME.R as the default; SAMME is now the default.
In experiments with AdaBoost, Leo Breiman, Schapire and others noticed something puzzling: the test error of the combined classifier often kept decreasing even after the training error had hit zero, and after thousands of rounds. Naive Occam's razor arguments suggested this should not happen, since each new weak learner adds capacity to the model.
Robert Schapire, Yoav Freund, Peter Bartlett and Wee Sun Lee gave an explanation in their 1998 Annals of Statistics paper "Boosting the margin: A new explanation for the effectiveness of voting methods." They defined the margin of a training example with respect to a voting classifier as the difference between the vote weight given to the correct label and the vote weight given to the most popular incorrect label, normalized by the total vote weight. They proved a generalization bound that depends on the margin distribution, the number of training examples and the VC dimension of the base hypothesis class, but not on the number of rounds T. They then showed empirically that AdaBoost continues to improve the margin distribution long after the training error has dropped to zero, which is consistent with continued improvement in test error.
Leo Breiman challenged parts of this explanation in 1999 by constructing arc-gv, an algorithm that achieves a larger minimum margin than AdaBoost but generalizes worse. Lev Reyzin and Schapire revisited the question in 2006 and argued that the full margin distribution, not just the minimum margin, is what matters, and that arc-gv achieves a larger minimum margin only by using more complex weak learners. The margin theory and its critiques are surveyed in Schapire and Freund's 2012 textbook Boosting: Foundations and Algorithms (MIT Press).
The Friedman, Hastie and Tibshirani 2000 paper paved the way for a much broader view of boosting. In Jerome Friedman's 2001 paper "Greedy function approximation: A gradient boosting machine," published in The Annals of Statistics volume 29 number 5, he generalized stagewise additive modeling by treating each new weak learner as a steepest-descent step in function space with respect to an arbitrary differentiable loss function. AdaBoost is then exactly the special case of gradient boosting with the exponential loss and a binary classification target. Other choices of loss give other algorithms: squared error gives least-squares boosting for regression, the Huber loss gives a robust regressor, the multinomial deviance gives a multi-class classifier, and so on. This gradient-in-function-space view is the conceptual foundation of modern gradient-boosted decision tree libraries such as XGBoost, LightGBM and CatBoost.
The most famous early application of AdaBoost was Paul Viola and Michael Jones's face detector, presented at the IEEE Conference on Computer Vision and Pattern Recognition in 2001 in the paper "Rapid object detection using a boosted cascade of simple features." The detector had three main ingredients. First, it represented images using a so-called integral image, which makes the sum of pixel values over any axis-aligned rectangle computable in constant time. Second, it computed Haar-like rectangle features over candidate windows; in a 24 by 24 pixel window, there are about 162,000 such features. Third, and most relevant here, it used a modified AdaBoost both to select a small subset of these features and to weight them. Each weak learner was a decision stump on a single Haar feature, and at each round AdaBoost both picked the best feature and assigned it a vote weight.
The Viola-Jones detector then arranged a sequence of AdaBoost-trained classifiers in a cascade, where each stage rejected a large fraction of background windows and only forwarded the survivors to the next, more expensive stage. The result was the first face detector that ran in real time on commodity hardware, achieving roughly 15 frames per second on a 2001-era desktop CPU. It shipped with OpenCV in the form of the haarcascade_frontalface_*.xml files and ran in many digital cameras and consumer products through the 2000s.
AdaBoost saw heavy use in the late 1990s and early 2000s as a strong baseline for text classification, tabular supervised learning, and certain bioinformatics tasks. Several pre-deep-learning systems for handwritten character recognition, speech and music classification, and information retrieval ranking used AdaBoost or close variants. The 2012 textbook by Schapire and Freund collects many of these case studies.
AdaBoost has been popular for nearly thirty years for several practical reasons.
It also has well-known weaknesses.
| Method | Base learner | Loss / criterion | Aggregation | Parallelism | Typical strengths |
|---|---|---|---|---|---|
| AdaBoost (discrete) | Decision stump or shallow tree | Exponential loss | Weighted vote, sequential | Sequential within a model; trivially parallel for ensembles of ensembles | Simple, theoretically clean, strong on small to medium tabular data |
| Random forest (Breiman 2001) | Deep decision tree | Gini or information gain at each node | Equal-weight vote or average over independently trained trees | Embarrassingly parallel | Robust to noise, low variance, good default for tabular data |
| Gradient boosting (Friedman 2001) | Regression tree | Any differentiable loss (log-loss, squared error, Huber, etc.) | Weighted sum, sequential | Sequential rounds; parallel over features per round | Flexible loss, strong accuracy, good for ranking and regression |
| XGBoost (Chen and Guestrin 2016) | Regression tree with second-order split finding | Differentiable loss with explicit L1/L2 regularization | Weighted sum, sequential | Parallel split finding, distributed training | Fast, regularized, dominant on tabular Kaggle competitions for years |
| LightGBM (Ke et al. 2017) | Histogram-based regression tree, leaf-wise growth | Differentiable loss with L1/L2 regularization | Weighted sum, sequential | Histogram bucketing, distributed training | Very fast on large datasets, strong on high-cardinality categorical features |
AdaBoost is rarely the best choice for a new tabular ML problem in 2026. The XGBoost, LightGBM and CatBoost gradient-boosted decision tree libraries are typically more accurate, more robust to noise, faster to train and easier to regularize. AdaBoost remains useful as a small, fast, theoretically transparent baseline, and it is still widely taught because the algorithm itself is short and the analysis is illuminating.
AdaBoost is shipped in nearly every general-purpose machine learning library. The most commonly used reference implementation is in scikit-learn, which provides sklearn.ensemble.AdaBoostClassifier and sklearn.ensemble.AdaBoostRegressor. By default the classifier uses decision stumps as base estimators, 50 rounds (n_estimators=50) and the SAMME algorithm; the regressor uses decision trees of depth 3 and Drucker's AdaBoost.R2. Other libraries with AdaBoost implementations include Weka, R's ada and gbm packages, MATLAB's fitcensemble, and (historically) OpenCV's CvBoost class for the cascaded face detection use case.
A short list of the most cited theoretical results about AdaBoost: