# AdaBoost

> Source: https://aiwiki.ai/wiki/adaboost
> Updated: 2026-07-11
> Categories: Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**AdaBoost** (short for **Adaptive Boosting**) is a [machine learning](/wiki/machine_learning) [ensemble](/wiki/ensemble) algorithm that combines many weak classifiers into a single strong classifier through a weighted majority vote, training the weak learners one at a time and reweighting the training examples after each round so later learners focus on the cases earlier ones got wrong. It was introduced by Yoav Freund and Robert E. Schapire in a 1995 conference paper and a 1997 journal article, and it was the first practical [boosting](/wiki/boosting) algorithm to gain widespread use.[1][2] Freund and Schapire won the 2003 Gödel Prize for the work, the first time a [machine learning](/wiki/machine_learning) paper received that award.[16] The algorithm is called "adaptive" because it requires no prior knowledge of how accurate its weak learners will be: it adapts to whatever accuracies the base algorithm delivers, provided each weak learner does at least slightly better than random guessing.[2]

AdaBoost is one of the most influential algorithms in statistical machine learning. Freund and Schapire received the 2003 Gödel Prize, awarded jointly by the European Association for Theoretical Computer Science (EATCS) and the ACM Special Interest Group on Algorithms and Computation Theory (SIGACT), for the paper that introduced the algorithm, and they received the ACM Paris Kanellakis Theory and Practice Award in 2004 for the same line of research.[16] AdaBoost was also the engine behind the Viola-Jones face detector (2001), the first real-time face detection system, which used AdaBoost to select a small subset of features and shipped in commodity software such as OpenCV.[10]

## What is AdaBoost in simple terms?

AdaBoost builds a [classification](/wiki/classification) function as a weighted sum of simple base hypotheses, called weak learners or weak classifiers. A weak learner is any predictor that does slightly better than random guessing on the training distribution. As the canonical reference puts it, "as long as the performance of each one is slightly better than random guessing, the final model can be proven to converge to a strong learner." The most common choice is a depth-one decision tree, also called a decision [stump](/wiki/decision_tree), which thresholds a single feature.

The algorithm trains these weak learners one at a time. After each round, it computes which training examples the latest weak learner classified incorrectly, and increases the weight on those examples. The next weak learner is then trained on the reweighted distribution, so it pays more attention to the hard cases. After T rounds, the final classifier is a weighted vote of all T weak learners, with more accurate weak learners receiving higher voting weights. Schapire has noted that AdaBoost can be implemented in roughly ten lines of code, which is part of why it became such a popular pedagogical example for [ensemble learning](/wiki/ensemble_learning).[15]

## When was AdaBoost invented?

The theoretical question that led to AdaBoost was first posed by Michael Kearns and Leslie Valiant in 1988 and 1989, in the framework of probably approximately correct (PAC) learning.[18] Kearns and Valiant asked whether a "weakly" learnable concept class, one for which any algorithm can do slightly better than random, must also be "strongly" learnable, meaning learnable to arbitrarily small error. Robert Schapire answered yes in his 1990 Machine Learning paper "The strength of weak learnability," and Yoav Freund gave a more efficient construction in his 1995 paper "Boosting a weak learning algorithm by majority."[3][4] Both early constructions required prior knowledge of the weak learner's accuracy, which made them awkward to use in practice.

Freund and Schapire's 1995 paper, "A decision-theoretic generalization of on-line learning and an application to boosting," presented at the second European Conference on Computational Learning Theory (EuroCOLT), removed that limitation.[1] The full version was published in 1997 in the Journal of Computer and System Sciences, volume 55, issue 1, pages 119 to 139.[2] The algorithm they described in that paper is what is now called AdaBoost. The name reflects that the algorithm adapts to whatever errors the weak learner happens to produce on each round.

In 2003 the European Association for Theoretical Computer Science and ACM SIGACT awarded Freund and Schapire the Gödel Prize for this paper.[16] It was the first machine learning paper to win the Gödel Prize, which is given for outstanding work in theoretical computer science.

## How does the AdaBoost algorithm work?

Discrete AdaBoost, the original version, solves a binary classification problem. Inputs are training examples $$(x_1, y_1), \ldots, (x_N, y_N)$$ with labels $$y_i \in \{-1, +1\}$$, a weak learning algorithm, and a number of rounds T.[2]

1. Initialize weights $$w_i = 1/N$$ for $$i = 1, \ldots, N$$.
2. For t = 1 to T:
   - Train the weak learner on the data weighted by w to obtain a hypothesis $$h_t : X \to \{-1, +1\}$$.
   - Compute the weighted training error $$\epsilon_t = \sum_i w_i \mathbf{1}[h_t(x_i) \ne y_i]$$.
   - If $$\epsilon_t \ge 1/2$$, stop the loop (or invert $$h_t$$ and continue, since a weak learner that does worse than chance can be flipped).
   - Set the weak learner's coefficient: $$\alpha_t = \frac{1}{2} \ln\left(\frac{1 - \epsilon_t}{\epsilon_t}\right)$$.
   - Update the example weights: $$w_i \leftarrow w_i \exp(-\alpha_t y_i h_t(x_i))$$ and renormalize so that the new weights sum to 1.
3. Output the final classifier $$H(x) = \mathrm{sign}\left(\sum_t \alpha_t h_t(x)\right)$$.

A few features of the update rule are worth pulling out. The coefficient $$\alpha_t$$ is positive when $$\epsilon_t$$ is below 1/2 and grows without bound as $$\epsilon_t$$ approaches zero, so very accurate weak learners get very large votes. The factor $$\exp(-\alpha_t y_i h_t(x_i))$$ increases the weight of an example by a factor of $$\sqrt{(1 - \epsilon_t) / \epsilon_t}$$ when $$h_t$$ makes a mistake on it, and decreases it by the inverse factor when $$h_t$$ classifies it correctly. After renormalization, the weighted error of $$h_t$$ on the new distribution is exactly 1/2, which means the next weak learner cannot simply repeat $$h_t$$.

Freund and Schapire proved a sharp bound on the training error of the final classifier.[2] If the weighted error on round t is $$\epsilon_t$$ and $$\gamma_t = 1/2 - \epsilon_t$$ is the edge over random guessing, then the training error of H is at most the product over t of $$2\sqrt{\epsilon_t (1 - \epsilon_t)}$$, which is in turn at most $$\exp(-2 \sum_t \gamma_t^2)$$. In words, as long as every weak learner has a positive edge over random guessing, the training error decays exponentially in the number of rounds. This was the first PAC-style boosting result that did not require the algorithm to know the edge $$\gamma$$ in advance.

## What are the main variants of AdaBoost?

Researchers have introduced many variants of AdaBoost to handle different prediction tasks, base learners, or robustness concerns. The table below summarizes the main ones.

| Variant | Reference | Output type | Key difference from discrete AdaBoost |
|---|---|---|---|
| Discrete AdaBoost | Freund and Schapire 1995, 1997 | Binary, {-1, +1} weak hypotheses | Original algorithm. |
| AdaBoost.M1 | Freund and Schapire 1996, 1997 | [Multi-class](/wiki/multi-class_classification) | Direct multi-class extension. Requires each weak learner to beat 1/2 on the weighted distribution, which is hard with many classes. |
| AdaBoost.M2 | Freund and Schapire 1996, 1997 | Multi-class | Reduces multi-class to a set of binary problems over (example, wrong label) pairs and uses a pseudo-loss. |
| AdaBoost.MH | Schapire and Singer 1999 | Multi-label / multi-class | Reduces multi-label classification to a set of binary classification problems, one per class. |
| AdaBoost.MR | Schapire and Singer 1999 | Multi-class ranking | Optimizes a ranking loss over (correct, incorrect) label pairs. |
| Real AdaBoost | Schapire and Singer 1999 | Binary, real-valued confidences | Weak learners output real numbers rather than {-1, +1}. The algorithm uses confidence-rated predictions and chooses each $$h_t$$ to minimize a normalization factor $$Z_t$$. |
| AdaBoost.R, R2, R.MH | Drucker 1997; Schapire and Singer 1999 | [Regression](/wiki/regression) | Variants for real-valued targets. AdaBoost.R2 by Drucker is the basis of scikit-learn's `AdaBoostRegressor`. |
| LogitBoost | Friedman, Hastie and Tibshirani 2000 | Binary | Fits an additive [logistic regression](/wiki/logistic_regression) model by Newton-style stagewise minimization of the log-loss. |
| Gentle AdaBoost | Friedman, Hastie and Tibshirani 2000 | Binary | Like Real AdaBoost but uses bounded Newton steps, which is more numerically stable when class probabilities are near 0 or 1. |
| SAMME | Zhu, Zou, Rosset and Hastie 2009 | Multi-class | Multi-class generalization that only requires weak learners to beat $$1/K$$ accuracy, where K is the number of classes. Adds the term $$\ln(K-1)$$ to $$\alpha_t$$. |
| SAMME.R | Zhu, Zou, Rosset and Hastie 2009 | Multi-class | Real-valued version of SAMME using class probability estimates from the weak learners. Often converges faster than SAMME on the same number of rounds. |

### Real AdaBoost and confidence-rated predictions

Robert Schapire and Yoram Singer's 1999 paper "Improved boosting algorithms using confidence-rated predictions," published in Machine Learning volume 37, generalized the framework so that weak hypotheses output real numbers whose sign is the predicted class and whose magnitude is a confidence.[6] Their analysis introduced the normalization factor $$Z_t = \sum_i w_i \exp(-\alpha_t y_i h_t(x_i))$$ and showed that the training error of the combined classifier is upper bounded by the product of $$Z_t$$ over all rounds. Choosing each weak hypothesis to minimize $$Z_t$$ gives a clean, unified design criterion. The paper also gave concrete recipes for confidence-rated [decision trees](/wiki/decision_tree), domain-partitioning weak learners, and the AdaBoost.MH and AdaBoost.MR multi-label and ranking algorithms.

### LogitBoost and Gentle AdaBoost

Jerome Friedman, Trevor Hastie and Robert Tibshirani's 2000 paper "Additive logistic regression: a statistical view of boosting," published in The Annals of Statistics volume 28 number 2, reframed AdaBoost as a stagewise greedy fitting procedure for an additive model, with the loss function being the exponential loss $$\exp(-y F(x))$$.[7] They observed that the exponential loss is, to second order, equivalent to the binomial log-likelihood, and that AdaBoost can be viewed as approximating Newton steps on this loss. From that observation they derived two new algorithms. LogitBoost works directly on the binomial log-likelihood, and Gentle AdaBoost replaces the closed-form $$\alpha_t$$ with bounded least-squares Newton steps. In the empirical comparisons in that paper, LogitBoost, Real AdaBoost and Gentle AdaBoost all outperformed Discrete AdaBoost on stumps.[7]

### SAMME and SAMME.R

Ji Zhu, Hui Zou, Saharon Rosset and Trevor Hastie's 2009 paper "Multi-class AdaBoost," published in Statistics and Its Interface volume 2, gave a multi-class generalization that does not need to be reduced to a sequence of binary problems.[12] The discrete version, SAMME (Stagewise Additive Modeling using a Multi-class Exponential loss function), uses the same weight update as AdaBoost but adds the term $$\ln(K - 1)$$ to $$\alpha_t$$, which lets the algorithm work whenever each weak learner has accuracy better than $$1/K$$ rather than the much harder $$1/2$$ threshold required by AdaBoost.M1.[12] SAMME.R is the real-valued companion that uses class probability estimates from the weak learners. For many years scikit-learn's `AdaBoostClassifier` defaulted to SAMME.R; SAMME.R was deprecated in scikit-learn 1.4, SAMME became the effective default in 1.6, and from version 1.9 SAMME is the only algorithm implemented.[17]

## Why does AdaBoost keep improving after zero training error?

In experiments with AdaBoost, Leo Breiman, Schapire and others noticed something puzzling: the test error of the combined classifier often kept decreasing even after the training error had hit zero, and after thousands of rounds.[5] Naive Occam's razor arguments suggested this should not happen, since each new weak learner adds capacity to the model.

Robert Schapire, Yoav Freund, Peter Bartlett and Wee Sun Lee gave an explanation in their 1998 Annals of Statistics paper "Boosting the margin: A new explanation for the effectiveness of voting methods."[5] They defined the margin of a training example with respect to a voting classifier as the difference between the vote weight given to the correct label and the vote weight given to the most popular incorrect label, normalized by the total vote weight. They proved a generalization bound that depends on the margin distribution, the number of training examples and the VC dimension of the base hypothesis class, but not on the number of rounds T. They then showed empirically that AdaBoost continues to improve the margin distribution long after the training error has dropped to zero, which is consistent with continued improvement in test error.[5]

Leo Breiman challenged parts of this explanation in 1999 by constructing arc-gv, an algorithm that achieves a larger minimum margin than AdaBoost but generalizes worse. Lev Reyzin and Schapire revisited the question in 2006 and argued that the full margin distribution, not just the minimum margin, is what matters, and that arc-gv achieves a larger minimum margin only by using more complex weak learners.[13] The margin theory and its critiques are surveyed in Schapire and Freund's 2012 textbook *Boosting: Foundations and Algorithms* (MIT Press).[15]

## How does AdaBoost relate to gradient boosting?

The Friedman, Hastie and Tibshirani 2000 paper paved the way for a much broader view of boosting.[7] In Jerome Friedman's 2001 paper "Greedy function approximation: A gradient boosting machine," published in The Annals of Statistics volume 29 number 5, he generalized stagewise additive modeling by treating each new weak learner as a steepest-descent step in function space with respect to an arbitrary differentiable loss function.[8] AdaBoost is then exactly the special case of [gradient boosting](/wiki/gradient_boosting) with the exponential loss and a binary classification target.[8] Other choices of loss give other algorithms: squared error gives least-squares boosting for regression, the Huber loss gives a robust regressor, the multinomial deviance gives a multi-class classifier, and so on. This gradient-in-function-space view, also derived independently by Mason, Baxter, Bartlett and Frean in 2000, is the conceptual foundation of modern gradient-boosted decision tree libraries such as [XGBoost](/wiki/xgboost), [LightGBM](/wiki/lightgbm) and CatBoost.[9]

## What is AdaBoost used for?

### Viola-Jones face detection

The most famous early application of AdaBoost was Paul Viola and Michael Jones's face detector, presented at the IEEE Conference on Computer Vision and Pattern Recognition in 2001 in the paper "Rapid object detection using a boosted cascade of simple features."[10] The detector had three main ingredients. First, it represented images using a so-called integral image, which makes the sum of pixel values over any axis-aligned rectangle computable in constant time. Second, it computed Haar-like rectangle features over candidate windows; the authors note that "the exhaustive set of rectangle features" in a 24 by 24 pixel detector sub-window "is quite large, over 180,000."[10] Third, and most relevant here, it used a learning algorithm "based on AdaBoost" both to select a small subset of these features and to weight them. Each weak learner was a decision stump on a single Haar feature, and at each round AdaBoost both picked the best feature and assigned it a vote weight.[10]

The Viola-Jones detector then arranged a sequence of AdaBoost-trained classifiers in a cascade of 38 stages, where each stage rejected a large fraction of background windows and only forwarded the survivors to the next, more expensive stage.[10] The result was the first face detector that ran in real time on commodity hardware, processing a 384 by 288 pixel image at "15 frames per second on a 700 MHz Pentium III processor," roughly 15 times faster than comparable detectors of the era.[10] It shipped with OpenCV in the form of the `haarcascade_frontalface_*.xml` files and ran in many digital cameras and consumer products through the 2000s.[19]

### Other early applications

AdaBoost saw heavy use in the late 1990s and early 2000s as a strong baseline for [text classification](/wiki/text_classification_models), [tabular](/wiki/tabular_classification_models) supervised learning, and certain bioinformatics tasks. Several pre-deep-learning systems for handwritten character recognition, speech and music classification, and information retrieval ranking used AdaBoost or close variants. The 2012 textbook by Schapire and Freund collects many of these case studies.[15]

## What are the strengths and weaknesses of AdaBoost?

AdaBoost has been popular for nearly thirty years for several practical reasons.

- It is simple to implement, with few hyperparameters: the choice of weak learner, the number of rounds T, and optionally a learning rate that shrinks each $$\alpha_t$$.
- It comes with strong theoretical guarantees: an exponential decay of training error, a generalization bound that depends on the margin distribution rather than on T, and an interpretation as gradient descent on a convex loss.[2][5][8]
- With decision-stump or shallow-tree weak learners it gives strong out-of-the-box performance on many tabular classification problems, especially with moderate amounts of clean training data.
- It does not require a separate validation set for early stopping, since the margin theory predicts that overfitting is mild as long as the weak learners stay simple.[5]

It also has well-known weaknesses.

- It is sensitive to label noise and outliers. Mislabeled or hard-to-fit examples accumulate exponentially growing weights, which lets a few bad examples dominate the later rounds and hurt test accuracy. Because each misclassified example's weight grows like $$\exp(\alpha_t)$$, the exponential loss is not robust: a single very negative margin contributes a huge gradient, so excessive weight can be assigned to outliers.[7]
- It assumes the weak learner can consistently beat random guessing on the reweighted distribution. If the data is too noisy or the base learner is too weak, the algorithm will stall.[2]
- It is harder to parallelize than [random forest](/wiki/random_forest) because the rounds are sequential: each weak learner depends on the weights produced by the previous ones.
- The exponential loss is less calibrated for probability estimation than the log-loss used by LogitBoost or modern gradient-boosted trees, although a logit transform of the AdaBoost score gives a usable estimate.[7]

## How does AdaBoost compare to random forests and gradient boosting?

| Method | Base learner | Loss / criterion | Aggregation | Parallelism | Typical strengths |
|---|---|---|---|---|---|
| AdaBoost (discrete) | Decision stump or shallow tree | Exponential loss | Weighted vote, sequential | Sequential within a model; trivially parallel for ensembles of ensembles | Simple, theoretically clean, strong on small to medium tabular data |
| [Random forest](/wiki/random_forest) (Breiman 2001) | Deep decision tree | Gini or information gain at each node | Equal-weight vote or average over independently trained trees | Embarrassingly parallel | Robust to noise, low variance, good default for tabular data |
| [Gradient boosting](/wiki/gradient_boosting) (Friedman 2001) | Regression tree | Any differentiable loss (log-loss, squared error, Huber, etc.) | Weighted sum, sequential | Sequential rounds; parallel over features per round | Flexible loss, strong accuracy, good for ranking and regression |
| [XGBoost](/wiki/xgboost) (Chen and Guestrin 2016) | Regression tree with second-order split finding | Differentiable loss with explicit L1/L2 regularization | Weighted sum, sequential | Parallel split finding, distributed training | Fast, regularized, dominant on tabular Kaggle competitions for years |
| [LightGBM](/wiki/lightgbm) (Ke et al. 2017) | Histogram-based regression tree, leaf-wise growth | Differentiable loss with L1/L2 regularization | Weighted sum, sequential | Histogram bucketing, distributed training | Very fast on large datasets, strong on high-cardinality categorical features |

AdaBoost is rarely the best choice for a new tabular ML problem in 2026. The XGBoost, LightGBM and CatBoost gradient-boosted decision tree libraries are typically more accurate, more robust to noise, faster to train and easier to regularize. AdaBoost remains useful as a small, fast, theoretically transparent baseline, and it is still widely taught because the algorithm itself is short and the analysis is illuminating.

## Implementations

AdaBoost is shipped in nearly every general-purpose machine learning library. The most commonly used reference implementation is in scikit-learn, which provides `sklearn.ensemble.AdaBoostClassifier` and `sklearn.ensemble.AdaBoostRegressor`.[17] By default the classifier uses decision stumps as base estimators, 50 rounds (`n_estimators=50`) and the SAMME algorithm; the regressor uses decision trees of depth 3 and Drucker's AdaBoost.R2.[11][17] Other libraries with AdaBoost implementations include Weka, R's `ada` and `gbm` packages, MATLAB's `fitcensemble`, and (historically) OpenCV's CvBoost class for the cascaded face detection use case.[19]

## Theoretical results to know

A short list of the most cited theoretical results about AdaBoost:

- **Training error bound (Freund and Schapire 1997):** the training error of the combined classifier is at most the product of $$2\sqrt{\epsilon_t (1 - \epsilon_t)}$$ over rounds, which decays exponentially in the sum of squared edges if every weak learner beats random guessing.[2]
- **Generalization bound via VC dimension (Freund and Schapire 1997):** with N training examples and base hypothesis class of VC dimension d, the test error of the T-round classifier is at most $$O(\sqrt{T d / N})$$, which is loose because it grows with T.[2]
- **Margin generalization bound (Schapire, Freund, Bartlett and Lee 1998):** a tighter bound that depends on the margin distribution, the number of training examples and the VC dimension of the base class, but not on T.[5]
- **Boosting as additive logistic regression (Friedman, Hastie and Tibshirani 2000):** AdaBoost is equivalent to forward stagewise minimization of the exponential loss $$\exp(-y F(x))$$ over additive functions F.[7]
- **Boosting as gradient descent in function space (Friedman 2001; also Mason, Baxter, Bartlett and Frean 2000):** more general boosting algorithms can be derived as steepest descent on any differentiable loss, with AdaBoost as the exponential-loss special case.[8][9]
- **Multi-class consistency (Zhu, Zou, Rosset and Hastie 2009):** SAMME is a Fisher-consistent algorithm for the multi-class problem, which AdaBoost.M1 is not in general.[12]

## See also

- [Boosting](/wiki/boosting)
- [Gradient boosting](/wiki/gradient_boosting)
- [Gradient boosted decision trees](/wiki/gradient_boosted_decision_trees_gbt)
- [XGBoost](/wiki/xgboost)
- [LightGBM](/wiki/lightgbm)
- [Random forest](/wiki/random_forest)
- [Decision tree](/wiki/decision_tree)
- [Ensemble learning](/wiki/ensemble_learning)
- [Binary classification](/wiki/binary_classification)

## References

1. Freund, Y. and Schapire, R. E. (1995). "A decision-theoretic generalization of on-line learning and an application to boosting." In *Computational Learning Theory: Second European Conference, EuroCOLT'95*, pp. 23 to 37. Springer.
2. Freund, Y. and Schapire, R. E. (1997). "A decision-theoretic generalization of on-line learning and an application to boosting." *Journal of Computer and System Sciences*, 55(1):119 to 139. doi:10.1006/jcss.1997.1504.
3. Schapire, R. E. (1990). "The strength of weak learnability." *Machine Learning*, 5(2):197 to 227.
4. Freund, Y. (1995). "Boosting a weak learning algorithm by majority." *Information and Computation*, 121(2):256 to 285.
5. Schapire, R. E., Freund, Y., Bartlett, P. and Lee, W. S. (1998). "Boosting the margin: A new explanation for the effectiveness of voting methods." *The Annals of Statistics*, 26(5):1651 to 1686.
6. Schapire, R. E. and Singer, Y. (1999). "Improved boosting algorithms using confidence-rated predictions." *Machine Learning*, 37(3):297 to 336.
7. Friedman, J., Hastie, T. and Tibshirani, R. (2000). "Additive logistic regression: a statistical view of boosting (with discussion)." *The Annals of Statistics*, 28(2):337 to 407.
8. Friedman, J. H. (2001). "Greedy function approximation: A gradient boosting machine." *The Annals of Statistics*, 29(5):1189 to 1232.
9. Mason, L., Baxter, J., Bartlett, P. and Frean, M. (2000). "Boosting algorithms as gradient descent." In *Advances in Neural Information Processing Systems 12*, pp. 512 to 518. MIT Press.
10. Viola, P. and Jones, M. (2001). "Rapid object detection using a boosted cascade of simple features." In *Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition*, vol. 1, pp. I-511 to I-518.
11. Drucker, H. (1997). "Improving regressors using boosting techniques." In *Proceedings of the 14th International Conference on Machine Learning*, pp. 107 to 115.
12. Zhu, J., Zou, H., Rosset, S. and Hastie, T. (2009). "Multi-class AdaBoost." *Statistics and Its Interface*, 2(3):349 to 360.
13. Reyzin, L. and Schapire, R. E. (2006). "How boosting the margin can also boost classifier complexity." In *Proceedings of the 23rd International Conference on Machine Learning*, pp. 753 to 760.
14. Hastie, T., Tibshirani, R. and Friedman, J. (2009). *The Elements of Statistical Learning: Data Mining, Inference, and Prediction*, 2nd edition, chapter 10. Springer.
15. Schapire, R. E. and Freund, Y. (2012). *Boosting: Foundations and Algorithms*. MIT Press.
16. European Association for Theoretical Computer Science. "Gödel Prize 2003: Yoav Freund and Robert E. Schapire." EATCS / ACM SIGACT announcement, 2003. https://www.sigact.org/prizes/godel.html
17. Pedregosa, F. et al. (2011). "Scikit-learn: Machine learning in Python." *Journal of Machine Learning Research*, 12:2825 to 2830; and scikit-learn API documentation, `sklearn.ensemble.AdaBoostClassifier` and `AdaBoostRegressor`. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html
18. Kearns, M. and Valiant, L. G. (1989). "Cryptographic limitations on learning Boolean formulae and finite automata." In *Proceedings of the 21st Annual ACM Symposium on Theory of Computing*, pp. 433 to 444. (Originating "weak vs strong learnability" question, first posed in Kearns and Valiant, Technical Report TR-14-88, Harvard University, 1988.)
19. OpenCV. "Cascade Classifier" and `haarcascade_frontalface_*.xml`, OpenCV documentation. https://docs.opencv.org/4.x/db/d28/tutorial_cascade_classifier.html