Estimator
Last reviewed
Jun 2, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 · 4,003 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 2, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 · 4,003 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms
An estimator is a rule, function, or algorithm that takes observed data and produces a value intended to approximate some unknown quantity, typically a parameter of a probability distribution or a function relating inputs to outputs. The term carries two related but distinct meanings in modern usage. In classical statistics, an estimator is a procedure for inferring population parameters from a sample. In machine learning libraries such as scikit-learn, the word also names a software object that fits a model to data and exposes methods like fit and predict. The two senses are connected because library estimators perform parameter estimation under the hood, often using techniques developed by statisticians a century ago.
The formal study of estimators began with Ronald A. Fisher's 1922 paper On the Mathematical Foundations of Theoretical Statistics, which introduced consistency, efficiency, sufficiency, and the method of maximum likelihood as a unified framework for evaluating estimation procedures [1]. Earlier work by Carl Friedrich Gauss, Adrien-Marie Legendre, and Karl Pearson had produced specific estimation techniques, but Fisher's paper gave the field its theoretical vocabulary [2]. Subsequent contributions from Harald Cramér [3], C. R. Rao [4], Charles Stein [7], and Peter Huber [5] extended the theory in directions that shaped both classical and modern statistical practice.
The earliest formal estimators predate the word itself. Adrien-Marie Legendre published the method of least squares in 1805, applying it to the shape of the Earth. Gauss claimed to have used the method since 1795 and famously deployed it in 1801 to predict the orbit of the newly discovered asteroid Ceres after Italian astronomer Giuseppe Piazzi lost track of it. Gauss connected least squares to the normal distribution and probability theory in his 1809 Theoria Motus, transforming what Legendre had presented as an algebraic curve-fitting trick into a statistical method [9]. The priority dispute between Gauss and Legendre over who invented least squares became one of the most famous controversies in the history of statistics [9].
Karl Pearson introduced the method of moments in 1894, applying it to fit skewed distributions to data on crab measurements [8]. Pearson borrowed the term "moments" from physics and used the technique to estimate the parameters of his Pearson family of distributions, since maximum likelihood was not yet available [12].
Fisher's 1922 paper is generally regarded as the founding document of modern estimation theory [1][2]. He introduced maximum likelihood estimation as a general principle, defined what it means for an estimator to be consistent and efficient, and proved that the sufficient statistic for a normal scale parameter is the sum of squared deviations rather than the sum of absolute deviations championed by astronomer Arthur Eddington [1]. The paper also introduced the concept of Fisher information, which measures how much a sample tells us about an unknown parameter. Fisher had first used the phrase "method of maximum likelihood" in a 1912 paper, but it was the 1922 work that placed it on a rigorous footing [2].
In 1945, C. R. Rao published a lower bound on the variance of any unbiased estimator [4], derived independently and almost simultaneously by Harald Cramér in his 1946 textbook Mathematical Methods of Statistics [3]. The result, now called the Cramer-Rao bound, states that no unbiased estimator can have variance below the inverse of the Fisher information [10]. The story goes that Rao, then 24 years old and teaching estimation at Calcutta University, was asked by a student why a Fisher result for large samples could not be proved for finite samples. Rao went home, worked through the night, and produced the inequality the next day.
Charles Stein shocked the statistics community in 1956 by showing that the sample mean is inadmissible as an estimator of the mean vector of a multivariate normal distribution when the dimension is three or more [7]. Willard James and Stein produced an explicit dominating estimator in 1961, now called the James-Stein estimator [6]. The result is paradoxical because it implies that one can do better at estimating, say, the batting averages of three baseball players by shrinking each estimate toward a common value, even though the players have nothing to do with each other [11].
Peter Huber's 1964 paper Robust Estimation of a Location Parameter, published in the Annals of Mathematical Statistics, founded the field of robust statistics by introducing M-estimators [5]. These estimators interpolate between the sample mean and the sample median, and they remain useful when the data contains outliers or when the assumed distribution is only approximately correct [5].
Formally, an estimator is a measurable function from the sample space to the parameter space. If the data is a random sample $X_1, X_2, \ldots, X_n$ drawn from a distribution with unknown parameter $\theta$, an estimator $\hat\theta$ is any function $\hat\theta = T(X_1, \ldots, X_n)$ that returns a value in the same space as $\theta$. The estimator is itself a random variable because it depends on random data; the value it returns for a particular sample is called an estimate.
A crucial distinction separates point estimators from interval estimators. A point estimator returns a single value, such as the sample mean as an estimate of the population mean. An interval estimator returns a range, such as a 95% confidence interval, that is intended to contain the true parameter with a specified probability. Confidence intervals were developed by Jerzy Neyman in 1937 as a frequentist counterpart to Bayesian credible intervals [17]. The frequentist interpretation is subtle and often misunderstood: a 95% confidence interval does not mean the true parameter lies in the computed interval with probability 0.95. It means that if the estimation procedure were repeated on many independent samples, about 95% of the resulting intervals would contain the true value. The parameter is fixed, and the interval is the random quantity [17].
Statisticians evaluate estimators against a small set of formal properties. No single estimator is best on all criteria; choice depends on the application and on what one is willing to assume about the data.
| Property | Definition | Why it matters |
|---|---|---|
| Unbiasedness | $E[\hat\theta] = \theta$ for all $\theta$ | The estimator is correct on average. Bias measures systematic error. |
| Consistency | $\hat\theta \xrightarrow{p} \theta$ as $n \to \infty$ | The estimator gets closer to the truth with more data. A weak but essential requirement. |
| Efficiency | Variance attains the Cramer-Rao bound | The estimator wastes no information. Efficient estimators have the lowest possible variance among unbiased estimators. |
| Sufficiency | $\hat\theta$ contains all sample information about $\theta$ | Reducing data to a sufficient statistic loses nothing. Fisher introduced this in 1922. |
| Robustness | Performance degrades gracefully under model misspecification or outliers | The estimator does not collapse when assumptions fail. Foundational to Huber's program. |
| Invariance | Estimator transforms predictably under reparametrization | Useful for choosing scales and units without changing inferences. |
The mean squared error decomposition combines bias and variance into a single criterion:
$$\text{MSE}(\hat\theta) = E[(\hat\theta - \theta)^2] = \text{Bias}(\hat\theta)^2 + \text{Var}(\hat\theta)$$
This decomposition motivates the bias-variance tradeoff that pervades modern machine learning. A slightly biased estimator can have lower MSE than the best unbiased one, and the James-Stein estimator is the canonical example [11]. Ridge regression and most regularization techniques exploit this tradeoff explicitly.
If one restricts attention to unbiased estimators, a natural goal is to find the one with the smallest variance. An estimator that achieves the lowest variance among all unbiased estimators, for every value of the parameter, is called a minimum-variance unbiased estimator (MVUE), sometimes written UMVUE for uniformly minimum variance [18]. The MVUE is closely related to efficiency but is not the same thing. An efficient estimator, one whose variance meets the Cramer-Rao bound, is automatically the MVUE when it exists, yet many problems admit an MVUE that does not attain the bound. Efficiency is the stronger condition [18].
Two classical theorems explain how to construct the MVUE. The Rao-Blackwell theorem says that if you take any unbiased estimator and condition it on a sufficient statistic, the resulting estimator is still unbiased and has variance no larger than the original. In symbols, if $\delta$ is unbiased for $g(\theta)$ and $T$ is sufficient, then $\eta = E[\delta \mid T]$ is unbiased and $\text{Var}(\eta) \leq \text{Var}(\delta)$ [18]. Conditioning on a sufficient statistic can only help, never hurt. The Lehmann-Scheffé theorem sharpens this: if $T$ is both sufficient and complete, then any unbiased function of $T$ is the unique MVUE [18]. Together the two results give a recipe. Find a complete sufficient statistic, find any unbiased estimator that depends on the data only through it, and you are done.
Many estimators cannot be evaluated exactly for finite samples, so statisticians study how they behave as the sample size $n$ grows without bound. The central large-sample result concerns maximum likelihood. Under standard regularity conditions, the maximum likelihood estimator is consistent and asymptotically normal: the rescaled error $\sqrt{n},(\hat\theta - \theta)$ converges in distribution to a normal distribution with mean zero and variance equal to the inverse Fisher information $I(\theta)^{-1}$ [19]. Two facts fall out of this. First, the MLE converges at the rate $1/\sqrt{n}$, which is the usual rate for well-behaved parametric estimators. Second, its asymptotic variance equals the Cramer-Rao lower bound, so the MLE is asymptotically efficient: no consistent estimator does better in the limit [19][10]. These guarantees hold only when the regularity conditions are met. They can fail when the parameter sits on the boundary of its space, when the support of the distribution depends on the parameter, or when the model is misspecified [19].
A handful of estimation principles dominate practice. Each can be derived from a different criterion, and each has characteristic strengths.
| Estimator | Principle | Year introduced | Notable property |
|---|---|---|---|
| Ordinary least squares (OLS) | Minimize sum of squared residuals | Legendre 1805, Gauss 1809 | Best linear unbiased estimator under Gauss-Markov assumptions |
| Method of moments (MoM) | Equate sample moments to theoretical moments | Pearson 1894 | Simple, consistent, often inefficient |
| Maximum likelihood (MLE) | Maximize the likelihood function $L(\theta; x)$ | Fisher 1912, formalized 1922 | Asymptotically efficient and consistent under regularity |
| Maximum a posteriori (MAP) | Maximize the posterior $p(\theta \mid x)$ | Twentieth-century Bayesian revival | Adds a prior; reduces to MLE under a flat prior |
| Bayes estimator | Posterior mean, median, or mode minimizing expected loss | Eighteenth-century origins, formalized twentieth century | Optimal for the chosen loss function |
| M-estimator | Minimize a generalized loss $\sum \rho(x_i, \theta)$ | Huber 1964 | Robust to outliers and model misspecification |
| James-Stein | Shrink sample mean toward a common point | James and Stein 1961 | Biased but dominates the sample mean for dimension $\geq 3$ |
| Generalized method of moments (GMM) | Match a vector of moment conditions | Hansen 1982 | Standard in econometrics; nests OLS, MLE, MoM as special cases |
Maximum likelihood is the default in much of modern statistics because it is asymptotically efficient under mild conditions: as the sample size grows, no other consistent estimator achieves lower variance [1][19]. The method also generalizes naturally to complex models, including the deep neural networks trained today by minimizing negative log-likelihood (cross-entropy) loss. Ordinary least squares is itself a maximum likelihood estimator when the errors are assumed to be independent and normally distributed, which is one reason the two methods so often agree in practice [13].
The distinction between parametric and non-parametric estimation runs through the entire field.
A parametric model assumes the data was generated by a distribution that belongs to a family indexed by a finite-dimensional parameter. The job of the estimator is to pin down that parameter. Linear regression, logistic regression, Gaussian mixture models, and most classical statistical methods are parametric. The advantage is statistical efficiency: when the model is correct, parametric estimators converge fast and produce tight confidence intervals. The risk is misspecification, since a wrong family produces misleading results no matter how much data is available. Buja and colleagues argued in 2019 that classical inference quietly assumes the model is exactly right, and that once a regression model is treated as an approximation rather than the truth, the standard errors and confidence intervals it produces can be badly off [16].
A non-parametric model does not commit to a fixed parameter count. The number of effective parameters often grows with the sample size. Kernel density estimation, k-nearest neighbors, decision trees, and Gaussian processes are non-parametric. These methods make weaker assumptions about the data-generating process and are more flexible, but they typically converge slower and require more data to reach a given accuracy.
Popular parametric estimators in machine learning include:
Popular non-parametric estimators include:
In practice, the line is blurry. Many models are semi-parametric, combining a finite-dimensional parameter of interest with an infinite-dimensional nuisance component. The Cox proportional hazards model is a classic example.
The word estimator has a specific technical meaning in scikit-learn, the dominant Python machine learning library, and by extension across many libraries that adopt its API conventions. The conventions were first codified in a 2013 design paper by the scikit-learn developers, which argued that a single uniform estimator interface is what lets the library compose and reuse otherwise unrelated algorithms [20].
A scikit-learn estimator is any object that implements a fit method to learn from data [14]. The base class sklearn.base.BaseEstimator provides the common machinery for getting and setting hyperparameters, for the textual and HTML representation shown in notebooks, and for cloning and serialization [15]. By convention, an estimator object stores nothing in its __init__: hyperparameters are accepted but not validated, and no learned state is created until fit is called [14]. Every hyperparameter must be saved to an attribute of the same name, with no renaming and no *args or **kwargs, because tools like clone and GridSearchCV rely on reading those attributes back out [14]. After fitting, the estimator stores learned parameters in attributes whose names end with an underscore (such as coef_ for linear models or cluster_centers_ for KMeans). This trailing-underscore convention is the visible signal that an attribute exists only after fit [14].
Two methods inherited from BaseEstimator make hyperparameters introspectable. get_params returns the constructor arguments as a dictionary, and set_params writes them back, accepting the nested <component>__<parameter> syntax that lets a search reach inside a pipeline [15]. The related function sklearn.base.clone builds a fresh, unfitted copy of an estimator from those same parameters, which is how cross-validation gets an untrained model for each fold [14]. By the same logic, fit is required to return self, so calls can be chained and so clone can reproduce the object [14]. When an estimator is fitted on tabular data it records the input width in n_features_in_, and if it was fitted on a pandas or polars DataFrame it also records the column names in feature_names_in_, then checks both at prediction time to catch shape and naming mistakes [14].
Scikit-learn estimators specialize into a small number of categories, each defined by an additional mixin class [14].
| Category | Mixin | Required methods | Examples |
|---|---|---|---|
| Classifier | ClassifierMixin | fit, predict, often predict_proba | LogisticRegression, RandomForestClassifier, SVC |
| Regressor | RegressorMixin | fit, predict | LinearRegression, Ridge, GradientBoostingRegressor |
| Transformer | TransformerMixin | fit, transform, fit_transform | StandardScaler, PCA, OneHotEncoder |
| Cluster | ClusterMixin | fit, fit_predict, labels_ attribute | KMeans, DBSCAN, AgglomerativeClustering |
| Density estimator | (no canonical mixin) | fit, score_samples | KernelDensity, GaussianMixture |
| Outlier detector | OutlierMixin | fit, predict, decision_function | IsolationForest, LocalOutlierFactor |
Each mixin supplies more than a label. RegressorMixin and ClassifierMixin both provide a default score method, but they score differently: a regressor reports the coefficient of determination $R^2$, while a classifier reports mean accuracy [14]. Classifiers may additionally expose predict_proba for calibrated class probabilities and decision_function for the raw signed distance from the decision boundary, both keyed to the class labels recorded in the classes_ attribute during fit [14]. TransformerMixin supplies fit_transform, which fits and transforms in one pass and is often faster than calling the two steps separately [14].
Because the categories are conventions rather than strict types, scikit-learn leans on duck typing instead of inheritance checks. Helper functions such as sklearn.base.is_classifier and is_regressor inspect an object for the right behavior rather than testing its class, so an estimator that follows the protocol works even if it does not inherit from the canonical mixins [14]. The uniformity of the API has practical consequences. Because every estimator follows the same conventions, scikit-learn can compose them into pipelines, run cross-validation over arbitrary models, and search hyperparameter spaces with GridSearchCV or RandomizedSearchCV. The same loop that trains a logistic regression also trains a gradient boosting model, since both expose fit and predict.
A Pipeline chains transformers and a final estimator into a single object that itself looks like an estimator. Calling fit on the pipeline calls fit_transform on each transformer in sequence and fit on the final estimator. This pattern prevents data leakage, since transformations are learned on training folds and applied to test folds without contamination.
Meta-estimators wrap other estimators to add behavior. GridSearchCV wraps an estimator to perform exhaustive hyperparameter search via cross-validation. BaggingClassifier wraps a base classifier to produce an ensemble. MultiOutputRegressor wraps a single-output regressor to handle multi-output targets. Because all of these objects respect the estimator interface, they nest arbitrarily.
Not every dataset fits in memory, and not every problem allows a single pass over the data. Estimators that support incremental learning expose a partial_fit method that updates the model from one mini-batch at a time and can be called repeatedly [21]. This is the basis for out-of-core learning, in which the data is streamed from disk in chunks so that only a small batch is ever resident in RAM [21]. SGDClassifier and SGDRegressor, the naive Bayes variants, and MiniBatchKMeans are common examples that implement partial_fit [21]. A related but distinct mechanism is the warm_start parameter. When it is set to true, calling fit again reuses the existing solution as a starting point instead of discarding it, which is useful for fitting a model along a path of growing complexity such as adding trees to a gradient boosting ensemble [21].
Because so much downstream code assumes the estimator contract, scikit-learn ships a conformance test suite. The function check_estimator runs a battery of checks against an estimator instance, and the parametrize_with_checks decorator turns those checks into individual pytest cases [14]. Estimators advertise their capabilities through tags, exposed via the __sklearn_tags__ method, which tell the checker things like whether the estimator handles sparse input, allows missing values, or is non-deterministic, so that only the relevant tests run [14]. This machinery is what lets third-party libraries build estimators that drop into scikit-learn pipelines and cross-validation without modification.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
pipe = Pipeline([
("scaler", StandardScaler()),
("clf", LogisticRegression(max_iter=1000)),
])
params = {"clf__C": [0.01, 0.1, 1.0, 10.0]}
search = GridSearchCV(pipe, params, cv=5)
search.fit(X_train, y_train)
print(search.best_params_, search.score(X_test, y_test))
The pipeline is itself an estimator, the grid search is an estimator wrapping that estimator, and the whole stack exposes fit, predict, and score. The composability comes directly from the discipline of the base interface.
The statistical and software senses of estimator are linked by what happens inside fit. When LinearRegression.fit runs, it computes the ordinary least squares solution that Gauss and Legendre worked out two centuries ago. When LogisticRegression.fit runs, it computes a maximum likelihood estimate of the coefficients, typically via the Newton-Raphson or L-BFGS algorithm. When GaussianMixture.fit runs, it iterates the Expectation-Maximization algorithm to find an MLE of the mixture parameters. When BayesianRidge.fit runs, it computes a Bayes estimator under a specific prior.
The properties statisticians study, including consistency, unbiasedness, and efficiency, apply directly to these procedures. Cross-validation, hyperparameter tuning, and regularization are all techniques that trade bias for variance in pursuit of low test error, which is the bias-variance tradeoff by another name. The library API hides the mathematics behind a uniform interface, but the underlying concepts are the same ones Fisher named in 1922 [1].
Imagine you have a jar of jellybeans and you want to guess how many are inside without dumping them out. You could grab a small handful, count them, and multiply. That counting-and-multiplying procedure is an estimator. Different procedures might be more accurate (count two handfuls and average), more cautious (only count handfuls of a certain size), or more honest about their uncertainty (give a range instead of one number). In statistics, we study which procedures give the best guesses on average, which ones get closer to the truth as you grab more jellybeans, and which ones are not fooled when somebody slipped marbles into the jar.