AdaBoost

AdaBoost, short for Adaptive Boosting, is a machine learning ensemble algorithm that combines many weak classifiers into a single strong classifier through a weighted majority vote. It was introduced by Yoav Freund and Robert E. Schapire in a 1995 conference paper and a 1997 journal article, and it became the first practical boosting algorithm to gain widespread use. The algorithm is called "adaptive" because, after each round of training, it adjusts the weights of the training examples so that subsequent weak learners focus more strongly on the examples the previous learners got wrong. Because of this, AdaBoost requires no prior knowledge about how accurate the weak learners will be; it adapts to whatever accuracies the base algorithm happens to deliver, provided each weak learner does at least slightly better than random guessing.

AdaBoost is one of the most influential algorithms in statistical machine learning. Freund and Schapire received the 2003 Gödel Prize, awarded jointly by the European Association for Theoretical Computer Science and the ACM Special Interest Group on Algorithms and Computational Theory, for the work that introduced the algorithm. They received the ACM Paris Kanellakis Theory and Practice Award in 2004 for the same line of research. AdaBoost was also the engine behind the Viola-Jones face detector (2001), the first real-time face detection system to ship in commodity software such as OpenCV.

Definition and intuition

AdaBoost builds a classification function as a weighted sum of simple base hypotheses, called weak learners or weak classifiers. A weak learner is any predictor that does slightly better than random guessing on the training distribution. The most common choice is a depth-one decision tree, also called a decision stump, which thresholds a single feature.

The algorithm trains these weak learners one at a time. After each round, it computes which training examples the latest weak learner classified incorrectly, and increases the weight on those examples. The next weak learner is then trained on the reweighted distribution, so it pays more attention to the hard cases. After T rounds, the final classifier is a weighted vote of all T weak learners, with more accurate weak learners receiving higher voting weights. Schapire has noted that AdaBoost can be implemented in roughly ten lines of code, which is part of why it became such a popular pedagogical example for ensemble learning.

History

The theoretical question that led to AdaBoost was first posed by Michael Kearns and Leslie Valiant in 1988 and 1989, in the framework of probably approximately correct (PAC) learning. Kearns and Valiant asked whether a "weakly" learnable concept class, one for which any algorithm can do slightly better than random, must also be "strongly" learnable, meaning learnable to arbitrarily small error. Robert Schapire answered yes in his 1990 Machine Learning paper "The strength of weak learnability," and Yoav Freund gave a more efficient construction in his 1995 paper "Boosting a weak learning algorithm by majority." Both early constructions required prior knowledge of the weak learner's accuracy, which made them awkward to use in practice.

Freund and Schapire's 1995 paper, "A decision-theoretic generalization of on-line learning and an application to boosting," presented at the second European Conference on Computational Learning Theory (EuroCOLT), removed that limitation. The full version was published in 1997 in the Journal of Computer and System Sciences, volume 55, issue 1, pages 119 to 139. The algorithm they described in that paper is what is now called AdaBoost. The name reflects that the algorithm adapts to whatever errors the weak learner happens to produce on each round.

In 2003 the European Association for Theoretical Computer Science and ACM SIGACT awarded Freund and Schapire the Gödel Prize for this paper. It was the first machine learning paper to win the Gödel Prize, which is given for outstanding work in theoretical computer science.

The discrete AdaBoost algorithm

Discrete AdaBoost, the original version, solves a binary classification problem. Inputs are training examples (x_1, y_1), ..., (x_N, y_N) with labels y_i in {-1, +1}, a weak learning algorithm, and a number of rounds T.

Initialize weights w_i = 1/N for i = 1, ..., N.
For t = 1 to T:
- Train the weak learner on the data weighted by w to obtain a hypothesis h_t : X -> {-1, +1}.
- Compute the weighted training error eps_t = sum_i w_i * 1[h_t(x_i) != y_i].
- If eps_t >= 1/2, stop the loop (or invert h_t and continue, since a weak learner that does worse than chance can be flipped).
- Set the weak learner's coefficient: alpha_t = (1/2) * ln((1 - eps_t) / eps_t).
- Update the example weights: w_i <- w_i * exp(-alpha_t * y_i * h_t(x_i)) and renormalize so that the new weights sum to 1.
Output the final classifier H(x) = sign( sum_t alpha_t * h_t(x) ).

A few features of the update rule are worth pulling out. The coefficient alpha_t is positive when eps_t is below 1/2 and grows without bound as eps_t approaches zero, so very accurate weak learners get very large votes. The factor exp(-alpha_t * y_i * h_t(x_i)) increases the weight of an example by a factor of sqrt((1 - eps_t) / eps_t) when h_t makes a mistake on it, and decreases it by the inverse factor when h_t classifies it correctly. After renormalization, the weighted error of h_t on the new distribution is exactly 1/2, which means the next weak learner cannot simply repeat h_t.

Freund and Schapire proved a sharp bound on the training error of the final classifier. If the weighted error on round t is eps_t and gamma_t = 1/2 - eps_t is the edge over random guessing, then the training error of H is at most the product over t of 2 * sqrt(eps_t * (1 - eps_t)), which is in turn at most exp(-2 * sum_t gamma_t^2). In words, as long as every weak learner has a positive edge over random guessing, the training error decays exponentially in the number of rounds. This was the first PAC-style boosting result that did not require the algorithm to know the edge gamma in advance.

Variants

Researchers have introduced many variants of AdaBoost to handle different prediction tasks, base learners, or robustness concerns. The table below summarizes the main ones.

Variant	Reference	Output type	Key difference from discrete AdaBoost
Discrete AdaBoost	Freund and Schapire 1995, 1997	Binary, {-1, +1} weak hypotheses	Original algorithm.
AdaBoost.M1	Freund and Schapire 1996, 1997	Multi-class	Direct multi-class extension. Requires each weak learner to beat 1/2 on the weighted distribution, which is hard with many classes.
AdaBoost.M2	Freund and Schapire 1996, 1997	Multi-class	Reduces multi-class to a set of binary problems over (example, wrong label) pairs and uses a pseudo-loss.
AdaBoost.MH	Schapire and Singer 1999	Multi-label / multi-class	Reduces multi-label classification to a set of binary classification problems, one per class.
AdaBoost.MR	Schapire and Singer 1999	Multi-class ranking	Optimizes a ranking loss over (correct, incorrect) label pairs.
Real AdaBoost	Schapire and Singer 1999	Binary, real-valued confidences	Weak learners output real numbers rather than {-1, +1}. The algorithm uses confidence-rated predictions and chooses each h_t to minimize a normalization factor Z_t.
AdaBoost.R, R2, R.MH	Drucker 1997; Schapire and Singer 1999	Regression	Variants for real-valued targets. AdaBoost.R2 by Drucker is the basis of scikit-learn's `AdaBoostRegressor`.
LogitBoost	Friedman, Hastie and Tibshirani 2000	Binary	Fits an additive logistic regression model by Newton-style stagewise minimization of the log-loss.
Gentle AdaBoost	Friedman, Hastie and Tibshirani 2000	Binary	Like Real AdaBoost but uses bounded Newton steps, which is more numerically stable when class probabilities are near 0 or 1.
SAMME	Zhu, Zou, Rosset and Hastie 2009	Multi-class	Multi-class generalization that only requires weak learners to beat 1/K accuracy, where K is the number of classes. Adds the term ln(K-1) to alpha_t.
SAMME.R	Zhu, Zou, Rosset and Hastie 2009	Multi-class	Real-valued version of SAMME using class probability estimates from the weak learners. Often converges faster than SAMME on the same number of rounds.

Real AdaBoost and confidence-rated predictions

Robert Schapire and Yoram Singer's 1999 paper "Improved boosting algorithms using confidence-rated predictions," published in Machine Learning volume 37, generalized the framework so that weak hypotheses output real numbers whose sign is the predicted class and whose magnitude is a confidence. Their analysis introduced the normalization factor Z_t = sum_i w_i * exp(-alpha_t * y_i * h_t(x_i)) and showed that the training error of the combined classifier is upper bounded by the product of Z_t over all rounds. Choosing each weak hypothesis to minimize Z_t gives a clean, unified design criterion. The paper also gave concrete recipes for confidence-rated decision trees, domain-partitioning weak learners, and the AdaBoost.MH and AdaBoost.MR multi-label and ranking algorithms.

LogitBoost and Gentle AdaBoost

Jerome Friedman, Trevor Hastie and Robert Tibshirani's 2000 paper "Additive logistic regression: a statistical view of boosting," published in The Annals of Statistics volume 28 number 2, reframed AdaBoost as a stagewise greedy fitting procedure for an additive model, with the loss function being the exponential loss exp(-y * F(x)). They observed that the exponential loss is, to second order, equivalent to the binomial log-likelihood, and that AdaBoost can be viewed as approximating Newton steps on this loss. From that observation they derived two new algorithms. LogitBoost works directly on the binomial log-likelihood, and Gentle AdaBoost replaces the closed-form alpha_t with bounded least-squares Newton steps. In the empirical comparisons in that paper, LogitBoost, Real AdaBoost and Gentle AdaBoost all outperformed Discrete AdaBoost on stumps.

SAMME and SAMME.R

Ji Zhu, Hui Zou, Saharon Rosset and Trevor Hastie's 2009 paper "Multi-class AdaBoost," published in Statistics and Its Interface volume 2, gave a multi-class generalization that does not need to be reduced to a sequence of binary problems. The discrete version, SAMME (Stagewise Additive Modeling using a Multi-class Exponential loss function), uses the same weight update as AdaBoost but adds the term ln(K - 1) to alpha_t, which lets the algorithm work whenever each weak learner has accuracy better than 1/K rather than the much harder 1/2 threshold required by AdaBoost.M1. SAMME.R is the real-valued companion that uses class probability estimates from the weak learners. Until version 1.6, scikit-learn's AdaBoostClassifier used SAMME.R as the default; SAMME is now the default.

Margin theory

In experiments with AdaBoost, Leo Breiman, Schapire and others noticed something puzzling: the test error of the combined classifier often kept decreasing even after the training error had hit zero, and after thousands of rounds. Naive Occam's razor arguments suggested this should not happen, since each new weak learner adds capacity to the model.

Robert Schapire, Yoav Freund, Peter Bartlett and Wee Sun Lee gave an explanation in their 1998 Annals of Statistics paper "Boosting the margin: A new explanation for the effectiveness of voting methods." They defined the margin of a training example with respect to a voting classifier as the difference between the vote weight given to the correct label and the vote weight given to the most popular incorrect label, normalized by the total vote weight. They proved a generalization bound that depends on the margin distribution, the number of training examples and the VC dimension of the base hypothesis class, but not on the number of rounds T. They then showed empirically that AdaBoost continues to improve the margin distribution long after the training error has dropped to zero, which is consistent with continued improvement in test error.

Leo Breiman challenged parts of this explanation in 1999 by constructing arc-gv, an algorithm that achieves a larger minimum margin than AdaBoost but generalizes worse. Lev Reyzin and Schapire revisited the question in 2006 and argued that the full margin distribution, not just the minimum margin, is what matters, and that arc-gv achieves a larger minimum margin only by using more complex weak learners. The margin theory and its critiques are surveyed in Schapire and Freund's 2012 textbook Boosting: Foundations and Algorithms (MIT Press).

Connection to gradient boosting

The Friedman, Hastie and Tibshirani 2000 paper paved the way for a much broader view of boosting. In Jerome Friedman's 2001 paper "Greedy function approximation: A gradient boosting machine," published in The Annals of Statistics volume 29 number 5, he generalized stagewise additive modeling by treating each new weak learner as a steepest-descent step in function space with respect to an arbitrary differentiable loss function. AdaBoost is then exactly the special case of gradient boosting with the exponential loss and a binary classification target. Other choices of loss give other algorithms: squared error gives least-squares boosting for regression, the Huber loss gives a robust regressor, the multinomial deviance gives a multi-class classifier, and so on. This gradient-in-function-space view is the conceptual foundation of modern gradient-boosted decision tree libraries such as XGBoost, LightGBM and CatBoost.

Applications

Viola-Jones face detection

The most famous early application of AdaBoost was Paul Viola and Michael Jones's face detector, presented at the IEEE Conference on Computer Vision and Pattern Recognition in 2001 in the paper "Rapid object detection using a boosted cascade of simple features." The detector had three main ingredients. First, it represented images using a so-called integral image, which makes the sum of pixel values over any axis-aligned rectangle computable in constant time. Second, it computed Haar-like rectangle features over candidate windows; in a 24 by 24 pixel window, there are about 162,000 such features. Third, and most relevant here, it used a modified AdaBoost both to select a small subset of these features and to weight them. Each weak learner was a decision stump on a single Haar feature, and at each round AdaBoost both picked the best feature and assigned it a vote weight.

The Viola-Jones detector then arranged a sequence of AdaBoost-trained classifiers in a cascade, where each stage rejected a large fraction of background windows and only forwarded the survivors to the next, more expensive stage. The result was the first face detector that ran in real time on commodity hardware, achieving roughly 15 frames per second on a 2001-era desktop CPU. It shipped with OpenCV in the form of the haarcascade_frontalface_*.xml files and ran in many digital cameras and consumer products through the 2000s.

Other early applications

AdaBoost saw heavy use in the late 1990s and early 2000s as a strong baseline for text classification, tabular supervised learning, and certain bioinformatics tasks. Several pre-deep-learning systems for handwritten character recognition, speech and music classification, and information retrieval ranking used AdaBoost or close variants. The 2012 textbook by Schapire and Freund collects many of these case studies.

Strengths and weaknesses

AdaBoost has been popular for nearly thirty years for several practical reasons.

It is simple to implement, with few hyperparameters: the choice of weak learner, the number of rounds T, and optionally a learning rate that shrinks each alpha_t.
It comes with strong theoretical guarantees: an exponential decay of training error, a generalization bound that depends on the margin distribution rather than on T, and an interpretation as gradient descent on a convex loss.
With decision-stump or shallow-tree weak learners it gives strong out-of-the-box performance on many tabular classification problems, especially with moderate amounts of clean training data.
It does not require a separate validation set for early stopping, since the margin theory predicts that overfitting is mild as long as the weak learners stay simple.

It also has well-known weaknesses.

It is sensitive to label noise and outliers. Mislabeled or hard-to-fit examples accumulate exponentially growing weights, which lets a few bad examples dominate the later rounds and hurt test accuracy. The exponential loss is not robust in the sense that a single very negative margin contributes a huge gradient.
It assumes the weak learner can consistently beat random guessing on the reweighted distribution. If the data is too noisy or the base learner is too weak, the algorithm will stall.
It is harder to parallelize than random forest because the rounds are sequential: each weak learner depends on the weights produced by the previous ones.
The exponential loss is less calibrated for probability estimation than the log-loss used by LogitBoost or modern gradient-boosted trees, although a logit transform of the AdaBoost score gives a usable estimate.

Comparison with other ensemble methods

Method	Base learner	Loss / criterion	Aggregation	Parallelism	Typical strengths
AdaBoost (discrete)	Decision stump or shallow tree	Exponential loss	Weighted vote, sequential	Sequential within a model; trivially parallel for ensembles of ensembles	Simple, theoretically clean, strong on small to medium tabular data
Random forest (Breiman 2001)	Deep decision tree	Gini or information gain at each node	Equal-weight vote or average over independently trained trees	Embarrassingly parallel	Robust to noise, low variance, good default for tabular data
Gradient boosting (Friedman 2001)	Regression tree	Any differentiable loss (log-loss, squared error, Huber, etc.)	Weighted sum, sequential	Sequential rounds; parallel over features per round	Flexible loss, strong accuracy, good for ranking and regression
XGBoost (Chen and Guestrin 2016)	Regression tree with second-order split finding	Differentiable loss with explicit L1/L2 regularization	Weighted sum, sequential	Parallel split finding, distributed training	Fast, regularized, dominant on tabular Kaggle competitions for years
LightGBM (Ke et al. 2017)	Histogram-based regression tree, leaf-wise growth	Differentiable loss with L1/L2 regularization	Weighted sum, sequential	Histogram bucketing, distributed training	Very fast on large datasets, strong on high-cardinality categorical features

AdaBoost is rarely the best choice for a new tabular ML problem in 2026. The XGBoost, LightGBM and CatBoost gradient-boosted decision tree libraries are typically more accurate, more robust to noise, faster to train and easier to regularize. AdaBoost remains useful as a small, fast, theoretically transparent baseline, and it is still widely taught because the algorithm itself is short and the analysis is illuminating.

Implementations

AdaBoost is shipped in nearly every general-purpose machine learning library. The most commonly used reference implementation is in scikit-learn, which provides sklearn.ensemble.AdaBoostClassifier and sklearn.ensemble.AdaBoostRegressor. By default the classifier uses decision stumps as base estimators, 50 rounds (n_estimators=50) and the SAMME algorithm; the regressor uses decision trees of depth 3 and Drucker's AdaBoost.R2. Other libraries with AdaBoost implementations include Weka, R's ada and gbm packages, MATLAB's fitcensemble, and (historically) OpenCV's CvBoost class for the cascaded face detection use case.

Theoretical results to know

A short list of the most cited theoretical results about AdaBoost:

Training error bound (Freund and Schapire 1997): the training error of the combined classifier is at most the product of 2 * sqrt(eps_t * (1 - eps_t)) over rounds, which decays exponentially in the sum of squared edges if every weak learner beats random guessing.
Generalization bound via VC dimension (Freund and Schapire 1997): with N training examples and base hypothesis class of VC dimension d, the test error of the T-round classifier is at most O(sqrt(T*d/N)), which is loose because it grows with T.
Margin generalization bound (Schapire, Freund, Bartlett and Lee 1998): a tighter bound that depends on the margin distribution, the number of training examples and the VC dimension of the base class, but not on T.
Boosting as additive logistic regression (Friedman, Hastie and Tibshirani 2000): AdaBoost is equivalent to forward stagewise minimization of the exponential loss exp(-y * F(x)) over additive functions F.
Boosting as gradient descent in function space (Friedman 2001; also Mason, Baxter, Bartlett and Frean 2000): more general boosting algorithms can be derived as steepest descent on any differentiable loss, with AdaBoost as the exponential-loss special case.
Multi-class consistency (Zhu, Zou, Rosset and Hastie 2009): SAMME is a Fisher-consistent algorithm for the multi-class problem, which AdaBoost.M1 is not in general.

References

Freund, Y. and Schapire, R. E. (1995). "A decision-theoretic generalization of on-line learning and an application to boosting." In *Computational Learning Theory: Second European Conference, EuroCOLT'95*, pp. 23 to 37. Springer.
Freund, Y. and Schapire, R. E. (1997). "A decision-theoretic generalization of on-line learning and an application to boosting." *Journal of Computer and System Sciences*, 55(1):119 to 139.
Schapire, R. E. (1990). "The strength of weak learnability." *Machine Learning*, 5(2):197 to 227.
Freund, Y. (1995). "Boosting a weak learning algorithm by majority." *Information and Computation*, 121(2):256 to 285.
Schapire, R. E., Freund, Y., Bartlett, P. and Lee, W. S. (1998). "Boosting the margin: A new explanation for the effectiveness of voting methods." *The Annals of Statistics*, 26(5):1651 to 1686.
Schapire, R. E. and Singer, Y. (1999). "Improved boosting algorithms using confidence-rated predictions." *Machine Learning*, 37(3):297 to 336.
Friedman, J., Hastie, T. and Tibshirani, R. (2000). "Additive logistic regression: a statistical view of boosting (with discussion)." *The Annals of Statistics*, 28(2):337 to 407.
Friedman, J. H. (2001). "Greedy function approximation: A gradient boosting machine." *The Annals of Statistics*, 29(5):1189 to 1232.
Mason, L., Baxter, J., Bartlett, P. and Frean, M. (2000). "Boosting algorithms as gradient descent." In *Advances in Neural Information Processing Systems 12*, pp. 512 to 518. MIT Press.
Viola, P. and Jones, M. (2001). "Rapid object detection using a boosted cascade of simple features." In *Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition*, vol. 1, pp. I-511 to I-518.
Drucker, H. (1997). "Improving regressors using boosting techniques." In *Proceedings of the 14th International Conference on Machine Learning*, pp. 107 to 115.
Zhu, J., Zou, H., Rosset, S. and Hastie, T. (2009). "Multi-class AdaBoost." *Statistics and Its Interface*, 2(3):349 to 360.
Reyzin, L. and Schapire, R. E. (2006). "How boosting the margin can also boost classifier complexity." In *Proceedings of the 23rd International Conference on Machine Learning*, pp. 753 to 760.
Hastie, T., Tibshirani, R. and Friedman, J. (2009). *The Elements of Statistical Learning: Data Mining, Inference, and Prediction*, 2nd edition, chapter 10. Springer.
Schapire, R. E. and Freund, Y. (2012). *Boosting: Foundations and Algorithms*. MIT Press.
European Association for Theoretical Computer Science. "Gödel Prize 2003: Yoav Freund and Robert E. Schapire." EATCS announcement, 2003.
Pedregosa, F. et al. (2011). "Scikit-learn: Machine learning in Python." *Journal of Machine Learning Research*, 12:2825 to 2830. (For the `AdaBoostClassifier` and `AdaBoostRegressor` reference implementations.)
Kearns, M. and Valiant, L. G. (1989). "Cryptographic limitations on learning Boolean formulae and finite automata." In *Proceedings of the 21st Annual ACM Symposium on Theory of Computing*, pp. 433 to 444. (Original "weak vs strong learnability" question.)

AdaBoost

AdaBoost

Definition and intuition

History

The discrete AdaBoost algorithm

Variants

Real AdaBoost and confidence-rated predictions

LogitBoost and Gentle AdaBoost

SAMME and SAMME.R

Margin theory

Connection to gradient boosting

Applications

Viola-Jones face detection

Other early applications

Strengths and weaknesses

Comparison with other ensemble methods

Implementations

Theoretical results to know

See also

References

Improve this article

AdaBoost

Definition and intuition

History

The discrete AdaBoost algorithm

Variants

Real AdaBoost and confidence-rated predictions

LogitBoost and Gentle AdaBoost

SAMME and SAMME.R

Margin theory

Connection to gradient boosting

Applications

Viola-Jones face detection

Other early applications

Strengths and weaknesses

Comparison with other ensemble methods

Implementations

Theoretical results to know

See also

References

AdaBoost

Definition and intuition

History

The discrete AdaBoost algorithm

Variants

Real AdaBoost and confidence-rated predictions

LogitBoost and Gentle AdaBoost

SAMME and SAMME.R

Margin theory

Connection to gradient boosting

Applications

Viola-Jones face detection

Other early applications

Strengths and weaknesses

Comparison with other ensemble methods

Implementations

Theoretical results to know

See also

References

Improve this article

Related Articles

ARC-AGI 2

Gradient boosted (decision) trees (GBT)

Bagging

Boosting

Decision Forest

Ensemble

AdaBoost

Definition and intuition

History

The discrete AdaBoost algorithm

Variants

Real AdaBoost and confidence-rated predictions

LogitBoost and Gentle AdaBoost

SAMME and SAMME.R

Margin theory

Connection to gradient boosting

Applications

Viola-Jones face detection

Other early applications

Strengths and weaknesses

Comparison with other ensemble methods

Implementations

Theoretical results to know

See also

References

Related Articles

ARC-AGI 2

Gradient boosted (decision) trees (GBT)

Bagging

Boosting

Decision Forest

Ensemble