Wisdom of the crowd is the observation that the aggregate judgment of a large group of individuals often produces more accurate estimates or decisions than any single member of that group, including domain experts. In machine learning, this principle forms the theoretical foundation for ensemble methods, which combine the outputs of multiple models to achieve better predictive performance than any individual model alone.
The concept connects a centuries-old statistical insight to modern computational techniques. From bagging and boosting to stacking and voting, ensemble approaches apply the logic of collective intelligence to algorithms, treating each model as an independent "voter" whose errors can be averaged away when combined with sufficiently diverse peers.
Imagine you have a big jar full of jelly beans, and you ask 100 of your friends to guess how many jelly beans are inside. Some friends will guess way too high, and some will guess way too low. But if you take the average of all 100 guesses, the answer will usually be really close to the real number. That is because the mistakes cancel each other out: the people who guessed too high balance out the people who guessed too low.
Computers do the same thing. Instead of using just one program to make a prediction, they use lots of different programs. Each program might make different mistakes. But when you combine all their answers together, the mistakes cancel out and you get a much better answer. That is the wisdom of the crowd for computers.
The earliest well-documented demonstration of crowd wisdom dates to 1906, when the British statistician Sir Francis Galton attended a livestock fair in Plymouth, England. Visitors to the fair were invited to guess the weight of an ox after it had been slaughtered and dressed. Galton collected 787 usable entries and found that the median estimate was 1,207 pounds, while the actual weight was 1,198 pounds. The crowd's median error was only 9 pounds, or roughly 0.8 percent of the true weight. Galton published his findings in a 1907 paper in Nature titled "Vox Populi" (Latin for "voice of the people"), noting that the collective estimate was more accurate than any single expert's guess [1].
The result surprised Galton himself, who had initially expected to demonstrate the unreliability of democratic judgment. Instead, the experiment showed that when individual errors are roughly random and independent, aggregation through averaging or taking the median effectively cancels out the noise.
The mathematical underpinnings of crowd wisdom predate Galton by more than a century. In 1785, the Marquis de Condorcet formulated what is now known as Condorcet's jury theorem in his Essay on the Application of Analysis to the Probability of Majority Decisions. The theorem addresses a group of voters making a binary decision (for example, guilty or not guilty) and states the following [2]:
The theorem relies on two assumptions: that each voter's competence exceeds chance (p > 0.5), and that votes are statistically independent. When the first condition holds and the second is approximately satisfied, even modest individual accuracy translates into near-certain group correctness as the group size grows. The rate of convergence depends on how far p is from 0.5; when p is close to 0.5, the improvement grows proportionally to the square root of the number of voters.
Condorcet's theorem has a direct analogue in machine learning. An ensemble of classifiers, each slightly better than random guessing and making independent errors, will converge toward perfect accuracy as the number of classifiers increases. This insight motivates much of ensemble learning theory.
In 2004, journalist James Surowiecki popularized the concept in his book The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations. Surowiecki identified four conditions that must hold for a crowd to be "wise" [3]:
| Condition | Description |
|---|---|
| Diversity of opinion | Each person in the group holds private information or a different interpretation of known facts |
| Independence | People's opinions are not determined by those around them |
| Decentralization | People draw on local and specialized knowledge rather than following a central authority |
| Aggregation | A mechanism exists for combining individual judgments into a collective decision |
When any of these conditions breaks down, crowds can produce poor outcomes. Herding behavior, information cascades, and groupthink are examples of failures in independence that cause crowds to amplify errors rather than cancel them. These same failure modes have analogues in ensemble learning, where correlated model errors reduce or eliminate the benefits of combining predictions.
Ensemble learning is the machine learning realization of crowd wisdom. Instead of relying on a single model's prediction, ensemble methods train multiple models and combine their outputs. The theoretical motivation mirrors the statistical argument behind Galton's ox experiment: if each model's errors are partially random and partially independent from other models' errors, then averaging or voting over many models will reduce the total error.
The effectiveness of ensembles can be understood through the bias-variance tradeoff. The expected error of any model can be decomposed into three components:
Different ensemble strategies address different components. Bagging primarily reduces variance by averaging out the fluctuations of high-variance models. Boosting primarily reduces bias by iteratively correcting the systematic errors of weak learners. In both cases, the ensemble's total error is typically lower than that of any single constituent model [4].
A more refined decomposition for ensembles introduces a fourth term: diversity. For an ensemble that combines predictions by averaging, the expected error equals the average bias of the individual models plus the average variance of the individual models minus the diversity of the ensemble. Diversity measures how much the individual models' predictions disagree with each other. Higher diversity always subtracts from the expected risk, which is why building diverse ensembles is a primary goal in ensemble learning design [5].
The major families of ensemble methods each implement the wisdom of the crowd principle in different ways. The table below summarizes the main approaches.
| Method | Year introduced | Key contributor(s) | Core idea | Reduces |
|---|---|---|---|---|
| Stacking | 1992 | David Wolpert | Train a meta-model on the predictions of base models | Bias and variance |
| Bagging | 1996 | Leo Breiman | Train models on bootstrap samples, then average | Variance |
| AdaBoost | 1997 | Yoav Freund, Robert Schapire | Sequentially reweight misclassified examples | Bias |
| Random forest | 2001 | Leo Breiman | Bagging + random feature subsets for decision trees | Variance |
| Gradient boosting | 2001 | Jerome Friedman | Fit new models to the residuals (gradient of loss) | Bias |
| XGBoost | 2016 | Tianqi Chen, Carlos Guestrin | Regularized gradient boosting with parallelization | Bias and variance |
| LightGBM | 2017 | Microsoft Research | Histogram-based, leaf-wise tree growth | Bias and variance |
| CatBoost | 2017 | Yandex | Ordered boosting, native categorical feature handling | Bias and variance |
Bagging was introduced by Leo Breiman in 1996 [6]. The procedure works as follows:
Bagging is most effective when the base learner is unstable, meaning small changes to the training data produce large changes in the resulting model. Decision trees are the classic example of an unstable learner. By training many trees on slightly different datasets and averaging their predictions, bagging smooths out the high variance of individual trees without substantially increasing bias.
Each bootstrap sample leaves out roughly 36.8 percent of the original data (the out-of-bag or OOB samples). These OOB samples can be used to estimate the ensemble's generalization error without needing a separate validation set.
Random forests, proposed by Leo Breiman in 2001, extend bagging with an additional source of randomness [7]. At each split in each tree, the algorithm considers only a random subset of the available features rather than all features. This decorrelates the trees in the ensemble, increasing diversity. Even when a few features are strongly predictive, different trees will sometimes be forced to split on weaker features, producing trees that make different types of errors.
Key properties of random forests include:
Boosting builds an ensemble sequentially. Each new model focuses on the examples that previous models handled poorly. The term "boosting" originates from the question of whether a set of weak learners (models only slightly better than random guessing) can be combined into a strong learner (a model with arbitrarily high accuracy). Robert Schapire proved in 1990 that the answer is yes, which laid the groundwork for practical boosting algorithms.
AdaBoost (Adaptive Boosting), introduced by Yoav Freund and Robert Schapire in 1997, was the first practical boosting algorithm [8]. The process works as follows:
AdaBoost won the Godel Prize in 2003, marking the first time a machine learning algorithm received this award in theoretical computer science.
Jerome Friedman generalized boosting in his 2001 paper "Greedy Function Approximation: A Gradient Boosting Machine" [9]. Instead of reweighting examples, gradient boosting frames the ensemble construction as gradient descent in function space. At each step, a new model is fit to the negative gradient (pseudo-residuals) of the loss function with respect to the current ensemble's predictions. This allows gradient boosting to optimize any differentiable loss function, making it far more flexible than AdaBoost.
Friedman also introduced stochastic gradient boosting, which trains each new tree on a random subsample of the training data. This modification, inspired by bagging, improved both accuracy and training speed.
Three open-source libraries have made gradient boosting one of the most widely used machine learning techniques, particularly for tabular data.
| Framework | Developer | Key innovation | Typical use case |
|---|---|---|---|
| XGBoost | Tianqi Chen (University of Washington) | L1/L2 regularization, column subsampling, parallelized tree construction | General-purpose tabular data |
| LightGBM | Microsoft Research | Gradient-based one-side sampling (GOSS), exclusive feature bundling, leaf-wise tree growth | Very large datasets requiring fast training |
| CatBoost | Yandex | Ordered boosting to prevent target leakage, native handling of categorical features | Datasets with many categorical columns |
These frameworks dominate machine learning competitions. On platforms like Kaggle, gradient boosting methods consistently appear in winning solutions for tabular data problems, and many top-ranking submissions blend multiple gradient boosting models together.
Stacking was introduced by David Wolpert in 1992 [10]. Unlike bagging and boosting, which combine models of the same type, stacking typically combines heterogeneous models. The procedure has two levels:
The meta-learner discovers patterns in how the base learners err. For instance, it might learn that the neural network is more reliable for certain types of inputs while the random forest is better for others. Stacking can theoretically represent any other ensemble technique and often outperforms simpler combination strategies.
Voting is the simplest ensemble combination method. Multiple independently trained models each cast a "vote" for their predicted outcome. There are two variants:
Soft voting generally outperforms hard voting because it uses more information (the model's confidence level) and is more robust to noise. However, soft voting requires that all base models produce well-calibrated probability estimates.
The mixture of experts (MoE) approach, introduced by Robert Jacobs, Michael Jordan, Steven Nowlan, and Geoffrey Hinton in 1991, divides the input space into regions and assigns specialized "expert" models to each region. A gating network learns which expert to consult for each input. Unlike voting or averaging, which give every model a say on every input, MoE routes each input to the most appropriate expert or weighted combination of experts.
MoE has seen a resurgence in modern large language models, where architectures like the Mixtral and Switch Transformer use sparse mixtures of experts to scale model capacity without proportionally increasing computational cost.
Diversity among ensemble members is the single most important factor determining whether an ensemble will outperform its individual components. This mirrors Surowiecki's "diversity of opinion" condition for wise crowds.
Ensemble methods create diversity through several mechanisms:
| Diversity mechanism | How it works | Methods that use it |
|---|---|---|
| Data sampling | Each model trains on a different subset of the training data | Bagging, random forests |
| Feature sampling | Each model or each split considers a different subset of features | Random forests, random subspace method |
| Algorithm variation | Different learning algorithms with different inductive biases | Stacking, voting |
| Hyperparameter variation | Same algorithm with different hyperparameter settings | Hyperparameter ensembles |
| Initialization randomness | Neural networks with different random weight initializations | Deep ensembles |
| Sequential reweighting | Each model focuses on the examples that previous models got wrong | AdaBoost, gradient boosting |
| Output perturbation | Adding random noise to model outputs during training | Output smearing |
The benefit of an ensemble depends directly on the degree of error correlation among its members. Consider N models, each with expected squared error of sigma-squared. If their errors are perfectly correlated (all models make the same mistakes), averaging produces no improvement; the ensemble error remains sigma-squared. If their errors are completely uncorrelated, averaging reduces the ensemble error to sigma-squared divided by N.
In practice, errors are partially correlated. The ensemble error can be expressed as:
Ensemble error = (1/N) * average_variance + (1 - 1/N) * average_covariance
This formula shows that adding more models always reduces the first term but never eliminates the second. Once the ensemble is large enough, the error is dominated by the average covariance between models. Therefore, reducing error correlation through diversity is more valuable than simply adding more models of the same type.
Negative correlation learning (NCL) is a technique that explicitly encourages ensemble members to make different errors during training. Each model's loss function includes a penalty term that discourages agreement with the other models in the ensemble. Both theoretical analysis and experiments have shown that training neural networks with negative error correlation produces ensembles with better generalization ability than training each network independently [11].
The wisdom of the crowd principle extends naturally to deep learning, though the high computational cost of training large neural networks makes traditional ensemble approaches expensive.
The simplest approach is to train multiple neural networks independently with different random initializations and average their predictions. Despite the lack of any explicit diversity mechanism beyond initialization randomness, deep ensembles consistently improve accuracy and produce well-calibrated uncertainty estimates. This works because the loss landscape of deep neural networks contains many different local minima, and networks initialized differently converge to different solutions that make partially independent errors.
Snapshot ensembles, proposed by Huang et al. in 2017, provide a way to obtain multiple models from a single training run. The technique uses a cyclical learning rate schedule that periodically drives the learning rate to a high value and then anneals it back down. At the end of each annealing cycle, the model has converged to a different local minimum, and a "snapshot" of the weights is saved. The predictions from all saved snapshots are averaged at test time, producing an ensemble from what would otherwise be a single training run.
Dropout, normally used as a regularization technique during training, can also serve as an ensemble method at inference time. By keeping dropout active during prediction and running the same input through the network multiple times, each forward pass produces a different prediction because different neurons are randomly disabled. Averaging these predictions approximates Bayesian model averaging over an exponentially large number of subnetworks.
In the context of large language models, ensemble techniques are applied through methods such as model merging, where the weight parameters of multiple fine-tuned models are combined without retraining. Techniques like linear interpolation of weights (model soups), SLERP (spherical linear interpolation), and TIES (trim, elect sign, and merge) allow practitioners to blend the capabilities of different models into a single set of weights.
The Netflix Prize (2006 to 2009) is perhaps the most famous demonstration of ensemble methods in practice. Netflix offered a $1 million prize to any team that could improve their recommendation algorithm's accuracy by 10 percent. The winning team, BellKor's Pragmatic Chaos, achieved a 10.06 percent improvement by combining over 100 individual models through blending, a form of stacking [12]. The team itself was a merger of three separate teams (BellKor, Pragmatic Theory, and BigChaos), each bringing different modeling approaches. The final solution linearly blended predictions from matrix factorization models, neighborhood-based models, and restricted Boltzmann machines, among others.
The Netflix Prize demonstrated a key lesson: the marginal gain from adding more models to an ensemble decreases as the ensemble grows. The first few diverse models provide large improvements, but each additional model contributes less. Netflix ultimately decided not to deploy the winning ensemble in production because the marginal improvement over a simpler solution did not justify the engineering complexity.
Ensemble methods have become standard practice in machine learning competitions. On Kaggle, the vast majority of winning solutions for tabular data problems use some form of ensembling, typically combining multiple gradient boosting models (XGBoost, LightGBM, CatBoost) with stacking or blending. Some winning solutions have used multi-level stacking with dozens of base models feeding into multiple layers of meta-learners.
Condorcet's jury theorem has been directly applied to medical imaging, where multiple physicians independently assess the same diagnostic image. Aggregating their assessments through majority voting produces diagnoses more accurate than those of any single physician. Ensemble methods in automated medical imaging follow the same logic, combining multiple convolutional neural networks trained on the same diagnostic task.
The wisdom of the crowd, whether applied to human groups or machine learning ensembles, can fail under specific conditions.
When ensemble members tend to make the same mistakes, combining their predictions amplifies rather than mitigates errors. This can happen when models are trained on the same data with similar algorithms, or when the training data itself contains systematic biases. Research examining collective decision-making dynamics has found scenarios where the accuracy of group decisions actually decreases as group size increases, specifically when members share highly correlated information [13].
Ensembles require training and maintaining multiple models, which multiplies computational cost, memory usage, and inference latency. In real-time applications, running predictions through dozens of models may not be practical. This is why techniques like snapshot ensembles, knowledge distillation (training a single student model to mimic an ensemble), and model merging have become popular alternatives.
A single decision tree is easy to interpret. A random forest of 500 trees is not. As ensembles grow in size and complexity, understanding why the ensemble made a particular prediction becomes increasingly difficult. Feature importance scores and SHAP values can provide partial explanations, but the full decision process of a large ensemble is generally opaque.
The error reduction from adding ensemble members follows a law of diminishing returns. Most of the benefit comes from the first handful of diverse models. After a certain point, adding more models of similar type provides negligible improvement while increasing computational cost.
Both Condorcet's theorem and ensemble learning theory assume some degree of independence among members. When models are trained on overlapping data, use similar feature representations, or share architectural designs, their errors become correlated, and the theoretical guarantees of the wisdom of the crowd weaken. This parallels Surowiecki's independence condition: social influence and herding behavior cause crowds to fail for the same mathematical reason that correlated classifiers produce weak ensembles.
The following table compares the major ensemble strategies across several practical dimensions.
| Strategy | Parallelizable | Base learner type | Typical number of models | Handles high-dimensional data | Interpretability |
|---|---|---|---|---|---|
| Bagging | Yes | Homogeneous (usually trees) | 50 to 500 | Moderate | Low |
| Random forest | Yes | Homogeneous (decision trees) | 100 to 1,000 | Good | Low to moderate |
| AdaBoost | No (sequential) | Homogeneous (weak learners) | 50 to 200 | Moderate | Low |
| Gradient boosting | No (sequential) | Homogeneous (usually trees) | 100 to 10,000 | Good | Low |
| Stacking | Partially (base learners yes, meta-learner no) | Heterogeneous | 3 to 20 | Good | Very low |
| Voting | Yes | Heterogeneous | 3 to 10 | Good | Moderate |
| Mixture of experts | Partially | Heterogeneous | Varies | Good | Moderate |
For a group of n independent voters each with accuracy p > 0.5, the probability that the majority is correct is given by the sum of binomial probabilities for all majority sizes. As n approaches infinity, this probability approaches 1. The speed of convergence depends on the margin (p - 0.5): larger margins yield faster convergence.
For an ensemble of N models using simple averaging, the expected squared error can be decomposed as:
Expected error = average_bias_squared + average_variance - diversity
Where diversity is the average squared difference between each model's prediction and the ensemble's prediction. This decomposition shows that diversity always reduces the ensemble's error. The decomposition has been extended beyond squared loss to cross-entropy loss and Poisson loss [5].
Another way to express the ensemble advantage is through the ambiguity decomposition, attributed to Krogh and Vedelsby (1994):
Ensemble error = average_individual_error - average_ambiguity
Ambiguity measures the average squared deviation of individual predictions from the ensemble prediction. This decomposition guarantees that the ensemble error is always less than or equal to the average error of the individual models, provided the combination rule is simple averaging.