Wisdom of the Crowd

Wisdom of the crowd is the observation that the aggregate judgment of a large group of individuals often produces more accurate estimates or decisions than any single member of that group, including domain experts. In machine learning, this principle forms the theoretical foundation for ensemble methods, which combine the outputs of multiple models to achieve better predictive performance than any individual model alone.

The concept connects a centuries-old statistical insight to modern computational techniques. From bagging and boosting to stacking and voting, ensemble approaches apply the logic of collective intelligence to algorithms, treating each model as an independent "voter" whose errors can be averaged away when combined with sufficiently diverse peers.

ELI5 (Explain like I'm 5)

Imagine you have a big jar full of jelly beans, and you ask 100 of your friends to guess how many jelly beans are inside. Some friends will guess way too high, and some will guess way too low. But if you take the average of all 100 guesses, the answer will usually be really close to the real number. That is because the mistakes cancel each other out: the people who guessed too high balance out the people who guessed too low.

Computers do the same thing. Instead of using just one program to make a prediction, they use lots of different programs. Each program might make different mistakes. But when you combine all their answers together, the mistakes cancel out and you get a much better answer. That is the wisdom of the crowd for computers.

Historical background

Francis Galton and the Vox Populi

The earliest well-documented demonstration of crowd wisdom dates to 1906, when the British statistician Sir Francis Galton attended a livestock fair in Plymouth, England. Visitors to the fair were invited to guess the weight of an ox after it had been slaughtered and dressed. Galton collected 787 usable entries and found that the median estimate was 1,207 pounds, while the actual weight was 1,198 pounds. The crowd's median error was only 9 pounds, or roughly 0.8 percent of the true weight. Galton published his findings in a 1907 paper in Nature titled "Vox Populi" (Latin for "voice of the people"), noting that the collective estimate was more accurate than any single expert's guess ^[1].

The result surprised Galton himself, who had initially expected to demonstrate the unreliability of democratic judgment. Instead, the experiment showed that when individual errors are roughly random and independent, aggregation through averaging or taking the median effectively cancels out the noise.

Condorcet's jury theorem

The mathematical underpinnings of crowd wisdom predate Galton by more than a century. In 1785, the Marquis de Condorcet formulated what is now known as Condorcet's jury theorem in his Essay on the Application of Analysis to the Probability of Majority Decisions. The theorem addresses a group of voters making a binary decision (for example, guilty or not guilty) and states the following ^[2]:

If each voter independently selects the correct option with probability p > 0.5, then the probability that the majority vote is correct increases as the number of voters grows, approaching 1.0 in the limit.
If p < 0.5, meaning voters are individually more likely to be wrong than right, then adding more voters makes the group decision worse, and the optimal "jury" consists of a single person.

The theorem relies on two assumptions: that each voter's competence exceeds chance (p > 0.5), and that votes are statistically independent. When the first condition holds and the second is approximately satisfied, even modest individual accuracy translates into near-certain group correctness as the group size grows. The rate of convergence depends on how far p is from 0.5; when p is close to 0.5, the improvement grows proportionally to the square root of the number of voters.

Condorcet's theorem has a direct analogue in machine learning. An ensemble of classifiers, each slightly better than random guessing and making independent errors, will converge toward perfect accuracy as the number of classifiers increases. This insight motivates much of ensemble learning theory.

James Surowiecki and the four conditions

In 2004, journalist James Surowiecki popularized the concept in his book The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations. Surowiecki identified four conditions that must hold for a crowd to be "wise" ^[3]:

Condition	Description
Diversity of opinion	Each person in the group holds private information or a different interpretation of known facts
Independence	People's opinions are not determined by those around them
Decentralization	People draw on local and specialized knowledge rather than following a central authority
Aggregation	A mechanism exists for combining individual judgments into a collective decision

When any of these conditions breaks down, crowds can produce poor outcomes. Herding behavior, information cascades, and groupthink are examples of failures in independence that cause crowds to amplify errors rather than cancel them. These same failure modes have analogues in ensemble learning, where correlated model errors reduce or eliminate the benefits of combining predictions.

Connection to ensemble learning

Ensemble learning is the machine learning realization of crowd wisdom. Instead of relying on a single model's prediction, ensemble methods train multiple models and combine their outputs. The theoretical motivation mirrors the statistical argument behind Galton's ox experiment: if each model's errors are partially random and partially independent from other models' errors, then averaging or voting over many models will reduce the total error.

Why ensembles work: the bias-variance perspective

The effectiveness of ensembles can be understood through the bias-variance tradeoff. The expected error of any model can be decomposed into three components:

Bias: the systematic error arising from incorrect assumptions in the learning algorithm
Variance: the error arising from sensitivity to small fluctuations in the training data
Irreducible noise: randomness inherent in the data that no model can capture

Different ensemble strategies address different components. Bagging primarily reduces variance by averaging out the fluctuations of high-variance models. Boosting primarily reduces bias by iteratively correcting the systematic errors of weak learners. In both cases, the ensemble's total error is typically lower than that of any single constituent model ^[4].

A more refined decomposition for ensembles introduces a fourth term: diversity. For an ensemble that combines predictions by averaging, the expected error equals the average bias of the individual models plus the average variance of the individual models minus the diversity of the ensemble. Diversity measures how much the individual models' predictions disagree with each other. Higher diversity always subtracts from the expected risk, which is why building diverse ensembles is a primary goal in ensemble learning design ^[5].

Ensemble methods

The major families of ensemble methods each implement the wisdom of the crowd principle in different ways. The table below summarizes the main approaches.

Method	Year introduced	Key contributor(s)	Core idea	Reduces
Stacking	1992	David Wolpert	Train a meta-model on the predictions of base models	Bias and variance
Bagging	1996	Leo Breiman	Train models on bootstrap samples, then average	Variance
AdaBoost	1997	Yoav Freund, Robert Schapire	Sequentially reweight misclassified examples	Bias
Random forest	2001	Leo Breiman	Bagging + random feature subsets for decision trees	Variance
Gradient boosting	2001	Jerome Friedman	Fit new models to the residuals (gradient of loss)	Bias
XGBoost	2016	Tianqi Chen, Carlos Guestrin	Regularized gradient boosting with parallelization	Bias and variance
LightGBM	2017	Microsoft Research	Histogram-based, leaf-wise tree growth	Bias and variance
CatBoost	2017	Yandex	Ordered boosting, native categorical feature handling	Bias and variance

Bagging (bootstrap aggregating)

Bagging was introduced by Leo Breiman in 1996 ^[6]. The procedure works as follows:

Draw B bootstrap samples from the original training set, each created by sampling n examples with replacement.
Train a separate base model on each bootstrap sample.
Combine predictions by averaging (for regression) or majority voting (for classification).

Bagging is most effective when the base learner is unstable, meaning small changes to the training data produce large changes in the resulting model. Decision trees are the classic example of an unstable learner. By training many trees on slightly different datasets and averaging their predictions, bagging smooths out the high variance of individual trees without substantially increasing bias.

Each bootstrap sample leaves out roughly 36.8 percent of the original data (the out-of-bag or OOB samples). These OOB samples can be used to estimate the ensemble's generalization error without needing a separate validation set.

Random forests

Random forests, proposed by Leo Breiman in 2001, extend bagging with an additional source of randomness ^[7]. At each split in each tree, the algorithm considers only a random subset of the available features rather than all features. This decorrelates the trees in the ensemble, increasing diversity. Even when a few features are strongly predictive, different trees will sometimes be forced to split on weaker features, producing trees that make different types of errors.

Key properties of random forests include:

They do not overfit as more trees are added. The generalization error converges to a limiting value.
The OOB error estimate provides a reliable measure of accuracy.
Variable importance scores can be computed by measuring how much each feature contributes to error reduction across all trees.
They are relatively robust to outliers and noise.

Boosting

Boosting builds an ensemble sequentially. Each new model focuses on the examples that previous models handled poorly. The term "boosting" originates from the question of whether a set of weak learners (models only slightly better than random guessing) can be combined into a strong learner (a model with arbitrarily high accuracy). Robert Schapire proved in 1990 that the answer is yes, which laid the groundwork for practical boosting algorithms.

AdaBoost

AdaBoost (Adaptive Boosting), introduced by Yoav Freund and Robert Schapire in 1997, was the first practical boosting algorithm ^[8]. The process works as follows:

Initialize equal weights for all training examples.
Train a weak learner on the weighted training set.
Compute the weighted error rate of the learner.
Increase the weights of misclassified examples and decrease the weights of correctly classified examples.
Repeat steps 2 through 4 for a fixed number of rounds.
Combine all weak learners through a weighted majority vote, where each learner's weight is proportional to its accuracy.

AdaBoost won the Godel Prize in 2003, marking the first time a machine learning algorithm received this award in theoretical computer science.

Gradient boosting

Jerome Friedman generalized boosting in his 2001 paper "Greedy Function Approximation: A Gradient Boosting Machine" ^[9]. Instead of reweighting examples, gradient boosting frames the ensemble construction as gradient descent in function space. At each step, a new model is fit to the negative gradient (pseudo-residuals) of the loss function with respect to the current ensemble's predictions. This allows gradient boosting to optimize any differentiable loss function, making it far more flexible than AdaBoost.

Friedman also introduced stochastic gradient boosting, which trains each new tree on a random subsample of the training data. This modification, inspired by bagging, improved both accuracy and training speed.

Modern gradient boosting frameworks

Three open-source libraries have made gradient boosting one of the most widely used machine learning techniques, particularly for tabular data.

Framework	Developer	Key innovation	Typical use case
XGBoost	Tianqi Chen (University of Washington)	L1/L2 regularization, column subsampling, parallelized tree construction	General-purpose tabular data
LightGBM	Microsoft Research	Gradient-based one-side sampling (GOSS), exclusive feature bundling, leaf-wise tree growth	Very large datasets requiring fast training
CatBoost	Yandex	Ordered boosting to prevent target leakage, native handling of categorical features	Datasets with many categorical columns

These frameworks dominate machine learning competitions. On platforms like Kaggle, gradient boosting methods consistently appear in winning solutions for tabular data problems, and many top-ranking submissions blend multiple gradient boosting models together.

Stacking (stacked generalization)

Stacking was introduced by David Wolpert in 1992 ^[10]. Unlike bagging and boosting, which combine models of the same type, stacking typically combines heterogeneous models. The procedure has two levels:

Level 0 (base learners): Train several different models (for example, a logistic regression, a random forest, and a neural network) on the training data using cross-validation to generate out-of-fold predictions.
Level 1 (meta-learner): Train a new model that takes the base learners' out-of-fold predictions as input features and learns the optimal way to combine them.

The meta-learner discovers patterns in how the base learners err. For instance, it might learn that the neural network is more reliable for certain types of inputs while the random forest is better for others. Stacking can theoretically represent any other ensemble technique and often outperforms simpler combination strategies.

Voting

Voting is the simplest ensemble combination method. Multiple independently trained models each cast a "vote" for their predicted outcome. There are two variants:

Hard voting: each model predicts a class label, and the final prediction is the label that receives the most votes (majority vote).
Soft voting: each model outputs a probability distribution over classes, these probabilities are averaged, and the class with the highest average probability is selected.

Soft voting generally outperforms hard voting because it uses more information (the model's confidence level) and is more robust to noise. However, soft voting requires that all base models produce well-calibrated probability estimates.

Mixture of experts

The mixture of experts (MoE) approach, introduced by Robert Jacobs, Michael Jordan, Steven Nowlan, and Geoffrey Hinton in 1991, divides the input space into regions and assigns specialized "expert" models to each region. A gating network learns which expert to consult for each input. Unlike voting or averaging, which give every model a say on every input, MoE routes each input to the most appropriate expert or weighted combination of experts.

MoE has seen a resurgence in modern large language models, where architectures like the Mixtral and Switch Transformer use sparse mixtures of experts to scale model capacity without proportionally increasing computational cost.

The role of diversity

Diversity among ensemble members is the single most important factor determining whether an ensemble will outperform its individual components. This mirrors Surowiecki's "diversity of opinion" condition for wise crowds.

Sources of diversity

Ensemble methods create diversity through several mechanisms:

Diversity mechanism	How it works	Methods that use it
Data sampling	Each model trains on a different subset of the training data	Bagging, random forests
Feature sampling	Each model or each split considers a different subset of features	Random forests, random subspace method
Algorithm variation	Different learning algorithms with different inductive biases	Stacking, voting
Hyperparameter variation	Same algorithm with different hyperparameter settings	Hyperparameter ensembles
Initialization randomness	Neural networks with different random weight initializations	Deep ensembles
Sequential reweighting	Each model focuses on the examples that previous models got wrong	AdaBoost, gradient boosting
Output perturbation	Adding random noise to model outputs during training	Output smearing

Error correlation and ensemble performance

The benefit of an ensemble depends directly on the degree of error correlation among its members. Consider N models, each with expected squared error of sigma-squared. If their errors are perfectly correlated (all models make the same mistakes), averaging produces no improvement; the ensemble error remains sigma-squared. If their errors are completely uncorrelated, averaging reduces the ensemble error to sigma-squared divided by N.

In practice, errors are partially correlated. The ensemble error can be expressed as:

Ensemble error = (1/N) * average_variance + (1 - 1/N) * average_covariance

This formula shows that adding more models always reduces the first term but never eliminates the second. Once the ensemble is large enough, the error is dominated by the average covariance between models. Therefore, reducing error correlation through diversity is more valuable than simply adding more models of the same type.

Negative correlation learning

Negative correlation learning (NCL) is a technique that explicitly encourages ensemble members to make different errors during training. Each model's loss function includes a penalty term that discourages agreement with the other models in the ensemble. Both theoretical analysis and experiments have shown that training neural networks with negative error correlation produces ensembles with better generalization ability than training each network independently ^[11].

Deep learning and the wisdom of the crowd

The wisdom of the crowd principle extends naturally to deep learning, though the high computational cost of training large neural networks makes traditional ensemble approaches expensive.

Deep ensembles

The simplest approach is to train multiple neural networks independently with different random initializations and average their predictions. Despite the lack of any explicit diversity mechanism beyond initialization randomness, deep ensembles consistently improve accuracy and produce well-calibrated uncertainty estimates. This works because the loss landscape of deep neural networks contains many different local minima, and networks initialized differently converge to different solutions that make partially independent errors.

Snapshot ensembles

Snapshot ensembles, proposed by Huang et al. in 2017, provide a way to obtain multiple models from a single training run. The technique uses a cyclical learning rate schedule that periodically drives the learning rate to a high value and then anneals it back down. At the end of each annealing cycle, the model has converged to a different local minimum, and a "snapshot" of the weights is saved. The predictions from all saved snapshots are averaged at test time, producing an ensemble from what would otherwise be a single training run.

Monte Carlo dropout

Dropout, normally used as a regularization technique during training, can also serve as an ensemble method at inference time. By keeping dropout active during prediction and running the same input through the network multiple times, each forward pass produces a different prediction because different neurons are randomly disabled. Averaging these predictions approximates Bayesian model averaging over an exponentially large number of subnetworks.

Model averaging in large language models

In the context of large language models, ensemble techniques are applied through methods such as model merging, where the weight parameters of multiple fine-tuned models are combined without retraining. Techniques like linear interpolation of weights (model soups), SLERP (spherical linear interpolation), and TIES (trim, elect sign, and merge) allow practitioners to blend the capabilities of different models into a single set of weights.

Real-world applications and case studies

The Netflix Prize

The Netflix Prize (2006 to 2009) is perhaps the most famous demonstration of ensemble methods in practice. Netflix offered a $1 million prize to any team that could improve their recommendation algorithm's accuracy by 10 percent. The winning team, BellKor's Pragmatic Chaos, achieved a 10.06 percent improvement by combining over 100 individual models through blending, a form of stacking ^[12]. The team itself was a merger of three separate teams (BellKor, Pragmatic Theory, and BigChaos), each bringing different modeling approaches. The final solution linearly blended predictions from matrix factorization models, neighborhood-based models, and restricted Boltzmann machines, among others.

The Netflix Prize demonstrated a key lesson: the marginal gain from adding more models to an ensemble decreases as the ensemble grows. The first few diverse models provide large improvements, but each additional model contributes less. Netflix ultimately decided not to deploy the winning ensemble in production because the marginal improvement over a simpler solution did not justify the engineering complexity.

Kaggle competitions

Ensemble methods have become standard practice in machine learning competitions. On Kaggle, the vast majority of winning solutions for tabular data problems use some form of ensembling, typically combining multiple gradient boosting models (XGBoost, LightGBM, CatBoost) with stacking or blending. Some winning solutions have used multi-level stacking with dozens of base models feeding into multiple layers of meta-learners.

Medical imaging and diagnostics

Condorcet's jury theorem has been directly applied to medical imaging, where multiple physicians independently assess the same diagnostic image. Aggregating their assessments through majority voting produces diagnoses more accurate than those of any single physician. Ensemble methods in automated medical imaging follow the same logic, combining multiple convolutional neural networks trained on the same diagnostic task.

Limitations and failure modes

The wisdom of the crowd, whether applied to human groups or machine learning ensembles, can fail under specific conditions.

Correlated errors

When ensemble members tend to make the same mistakes, combining their predictions amplifies rather than mitigates errors. This can happen when models are trained on the same data with similar algorithms, or when the training data itself contains systematic biases. Research examining collective decision-making dynamics has found scenarios where the accuracy of group decisions actually decreases as group size increases, specifically when members share highly correlated information ^[13].

Computational cost

Ensembles require training and maintaining multiple models, which multiplies computational cost, memory usage, and inference latency. In real-time applications, running predictions through dozens of models may not be practical. This is why techniques like snapshot ensembles, knowledge distillation (training a single student model to mimic an ensemble), and model merging have become popular alternatives.

Interpretability

A single decision tree is easy to interpret. A random forest of 500 trees is not. As ensembles grow in size and complexity, understanding why the ensemble made a particular prediction becomes increasingly difficult. Feature importance scores and SHAP values can provide partial explanations, but the full decision process of a large ensemble is generally opaque.

Diminishing returns

The error reduction from adding ensemble members follows a law of diminishing returns. Most of the benefit comes from the first handful of diverse models. After a certain point, adding more models of similar type provides negligible improvement while increasing computational cost.

Violation of independence

Both Condorcet's theorem and ensemble learning theory assume some degree of independence among members. When models are trained on overlapping data, use similar feature representations, or share architectural designs, their errors become correlated, and the theoretical guarantees of the wisdom of the crowd weaken. This parallels Surowiecki's independence condition: social influence and herding behavior cause crowds to fail for the same mathematical reason that correlated classifiers produce weak ensembles.

Comparison of ensemble strategies

The following table compares the major ensemble strategies across several practical dimensions.

Strategy	Parallelizable	Base learner type	Typical number of models	Handles high-dimensional data	Interpretability
Bagging	Yes	Homogeneous (usually trees)	50 to 500	Moderate	Low
Random forest	Yes	Homogeneous (decision trees)	100 to 1,000	Good	Low to moderate
AdaBoost	No (sequential)	Homogeneous (weak learners)	50 to 200	Moderate	Low
Gradient boosting	No (sequential)	Homogeneous (usually trees)	100 to 10,000	Good	Low
Stacking	Partially (base learners yes, meta-learner no)	Heterogeneous	3 to 20	Good	Very low
Voting	Yes	Heterogeneous	3 to 10	Good	Moderate
Mixture of experts	Partially	Heterogeneous	Varies	Good	Moderate

Mathematical foundations

The Condorcet limit

For a group of n independent voters each with accuracy p > 0.5, the probability that the majority is correct is given by the sum of binomial probabilities for all majority sizes. As n approaches infinity, this probability approaches 1. The speed of convergence depends on the margin (p - 0.5): larger margins yield faster convergence.

Bias-variance-diversity decomposition

For an ensemble of N models using simple averaging, the expected squared error can be decomposed as:

Expected error = average_bias_squared + average_variance - diversity

Where diversity is the average squared difference between each model's prediction and the ensemble's prediction. This decomposition shows that diversity always reduces the ensemble's error. The decomposition has been extended beyond squared loss to cross-entropy loss and Poisson loss ^[5].

The ambiguity decomposition

Another way to express the ensemble advantage is through the ambiguity decomposition, attributed to Krogh and Vedelsby (1994):

Ensemble error = average_individual_error - average_ambiguity

Ambiguity measures the average squared deviation of individual predictions from the ensemble prediction. This decomposition guarantees that the ensemble error is always less than or equal to the average error of the individual models, provided the combination rule is simple averaging.

References

Galton, F. (1907). "Vox Populi." *Nature*, 75(1949), 450-451.
Condorcet, M. (1785). *Essay on the Application of Analysis to the Probability of Majority Decisions*.
Surowiecki, J. (2004). *The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations*. Doubleday.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning* (2nd ed.). Springer.
Wood, I., Pfahringer, B., & Wainwright, M. (2023). "A Unified Theory of Diversity in Ensemble Learning." *Journal of Machine Learning Research*, 24, 1-49.
Breiman, L. (1996). "Bagging Predictors." *Machine Learning*, 24(2), 123-140.
Breiman, L. (2001). "Random Forests." *Machine Learning*, 45(1), 5-32.
Freund, Y., & Schapire, R. (1997). "A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting." *Journal of Computer and System Sciences*, 55(1), 119-139.
Friedman, J. H. (2001). "Greedy Function Approximation: A Gradient Boosting Machine." *Annals of Statistics*, 29(5), 1189-1232.
Wolpert, D. H. (1992). "Stacked Generalization." *Neural Networks*, 5(2), 241-259.
Liu, Y., & Yao, X. (1999). "Ensemble Learning via Negative Correlation." *Neural Networks*, 12(10), 1399-1404.
Koren, Y. (2009). "The BellKor Solution to the Netflix Grand Prize." Netflix Prize documentation.
Lorenz, J., Rauhut, H., Schweitzer, F., & Helbing, D. (2011). "How Social Influence Can Undermine the Wisdom of Crowd Effect." *Proceedings of the National Academy of Sciences*, 108(22), 9020-9025.

ELI5 (Explain like I'm 5)

Historical background

Francis Galton and the Vox Populi

Condorcet's jury theorem

James Surowiecki and the four conditions

Connection to ensemble learning

Why ensembles work: the bias-variance perspective

Ensemble methods

Bagging (bootstrap aggregating)

Random forests

Boosting

AdaBoost

Gradient boosting

Modern gradient boosting frameworks

Stacking (stacked generalization)

Voting

Mixture of experts

The role of diversity

Sources of diversity

Error correlation and ensemble performance

Negative correlation learning

Deep learning and the wisdom of the crowd

Deep ensembles

Snapshot ensembles

Monte Carlo dropout

Model averaging in large language models

Real-world applications and case studies

The Netflix Prize

Kaggle competitions

Medical imaging and diagnostics

Limitations and failure modes

Correlated errors

Computational cost

Interpretability

Diminishing returns

Violation of independence

Comparison of ensemble strategies

Mathematical foundations

The Condorcet limit

Bias-variance-diversity decomposition

The ambiguity decomposition

See also

References

Improve this article

Related Articles

ARC-AGI 2

AUC-ROC

ARIMA

Bagging

Boosting

Decision Forest

ELI5 (Explain like I'm 5)

Historical background

Francis Galton and the Vox Populi

Condorcet's jury theorem

James Surowiecki and the four conditions

Connection to ensemble learning

Why ensembles work: the bias-variance perspective

Ensemble methods

Bagging (bootstrap aggregating)

Random forests

Boosting

AdaBoost

Gradient boosting

Modern gradient boosting frameworks

Stacking (stacked generalization)

Voting

Mixture of experts

The role of diversity

Sources of diversity

Error correlation and ensemble performance

Negative correlation learning

Deep learning and the wisdom of the crowd

Deep ensembles

Snapshot ensembles

Monte Carlo dropout

Model averaging in large language models

Real-world applications and case studies

The Netflix Prize

Kaggle competitions