Estimator

See also: Machine learning terms

An estimator is a rule, function, or algorithm that takes observed data and produces a value intended to approximate some unknown quantity, typically a parameter of a probability distribution or a function relating inputs to outputs. The term carries two related but distinct meanings in modern usage. In classical statistics, an estimator is a procedure for inferring population parameters from a sample. In machine learning libraries such as scikit-learn, the word also names a software object that fits a model to data and exposes methods like fit and predict. The two senses are connected because library estimators perform parameter estimation under the hood, often using techniques developed by statisticians a century ago.

The formal study of estimators began with Ronald A. Fisher's 1922 paper On the Mathematical Foundations of Theoretical Statistics, which introduced consistency, efficiency, sufficiency, and the method of maximum likelihood as a unified framework for evaluating estimation procedures. Earlier work by Carl Friedrich Gauss, Adrien-Marie Legendre, and Karl Pearson had produced specific estimation techniques, but Fisher's paper gave the field its theoretical vocabulary. Subsequent contributions from Harald Cramér, C. R. Rao, Charles Stein, and Peter Huber extended the theory in directions that shaped both classical and modern statistical practice.

historical background

The earliest formal estimators predate the word itself. Adrien-Marie Legendre published the method of least squares in 1805, applying it to the shape of the Earth. Gauss claimed to have used the method since 1795 and famously deployed it in 1801 to predict the orbit of the newly discovered asteroid Ceres after Italian astronomer Giuseppe Piazzi lost track of it. Gauss connected least squares to the normal distribution and probability theory in his 1809 Theoria Motus, transforming what Legendre had presented as an algebraic curve-fitting trick into a statistical method.

Karl Pearson introduced the method of moments in 1894, applying it to fit skewed distributions to data on crab measurements. Pearson borrowed the term "moments" from physics and used the technique to estimate the parameters of his Pearson family of distributions, since maximum likelihood was not yet available.

Fisher's 1922 paper is generally regarded as the founding document of modern estimation theory. He introduced maximum likelihood estimation as a general principle, defined what it means for an estimator to be consistent and efficient, and proved that the sufficient statistic for a normal scale parameter is the sum of squared deviations rather than the sum of absolute deviations championed by astronomer Arthur Eddington. The paper also introduced the concept of Fisher information, which measures how much a sample tells us about an unknown parameter.

In 1945, C. R. Rao published a lower bound on the variance of any unbiased estimator, derived independently and almost simultaneously by Harald Cramér in his 1946 textbook Mathematical Methods of Statistics. The result, now called the Cramer-Rao bound, states that no unbiased estimator can have variance below the inverse of the Fisher information. The story goes that Rao, then 24 years old and teaching estimation at Calcutta University, was asked by a student why a Fisher result for large samples could not be proved for finite samples. Rao went home, worked through the night, and produced the inequality the next day.

Charles Stein shocked the statistics community in 1956 by showing that the sample mean is inadmissible as an estimator of the mean vector of a multivariate normal distribution when the dimension is three or more. Willard James and Stein produced an explicit dominating estimator in 1961, now called the James-Stein estimator. The result is paradoxical because it implies that one can do better at estimating, say, the batting averages of three baseball players by shrinking each estimate toward a common value, even though the players have nothing to do with each other.

Peter Huber's 1964 paper Robust Estimation of a Location Parameter, published in the Annals of Mathematical Statistics, founded the field of robust statistics by introducing M-estimators. These estimators interpolate between the sample mean and the sample median, and they remain useful when the data contains outliers or when the assumed distribution is only approximately correct.

statistical estimator

Formally, an estimator is a measurable function from the sample space to the parameter space. If the data is a random sample $X_1, X_2, \ldots, X_n$ drawn from a distribution with unknown parameter $\theta$, an estimator $\hat\theta$ is any function $\hat\theta = T(X_1, \ldots, X_n)$ that returns a value in the same space as $\theta$. The estimator is itself a random variable because it depends on random data; the value it returns for a particular sample is called an estimate.

A crucial distinction separates point estimators from interval estimators. A point estimator returns a single value, such as the sample mean as an estimate of the population mean. An interval estimator returns a range, such as a 95% confidence interval, that is intended to contain the true parameter with a specified probability. Confidence intervals were developed by Jerzy Neyman in 1937 as a frequentist counterpart to Bayesian credible intervals.

properties of estimators

Statisticians evaluate estimators against a small set of formal properties. No single estimator is best on all criteria; choice depends on the application and on what one is willing to assume about the data.

Property	Definition	Why it matters
Unbiasedness	$E[\hat\theta] = \theta$ for all $\theta$	The estimator is correct on average. Bias measures systematic error.
Consistency	$\hat\theta \xrightarrow{p} \theta$ as $n \to \infty$	The estimator gets closer to the truth with more data. A weak but essential requirement.
Efficiency	Variance attains the Cramer-Rao bound	The estimator wastes no information. Efficient estimators have the lowest possible variance among unbiased estimators.
Sufficiency	$\hat\theta$ contains all sample information about $\theta$	Reducing data to a sufficient statistic loses nothing. Fisher introduced this in 1922.
Robustness	Performance degrades gracefully under model misspecification or outliers	The estimator does not collapse when assumptions fail. Foundational to Huber's program.
Invariance	Estimator transforms predictably under reparametrization	Useful for choosing scales and units without changing inferences.

The mean squared error decomposition combines bias and variance into a single criterion:

$$\text{MSE}(\hat\theta) = E[(\hat\theta - \theta)^2] = \text{Bias}(\hat\theta)^2 + \text{Var}(\hat\theta)$$

This decomposition motivates the bias-variance tradeoff that pervades modern machine learning. A slightly biased estimator can have lower MSE than the best unbiased one, and the James-Stein estimator is the canonical example. Ridge regression and most regularization techniques exploit this tradeoff explicitly.

common estimators

A handful of estimation principles dominate practice. Each can be derived from a different criterion, and each has characteristic strengths.

Estimator	Principle	Year introduced	Notable property
Ordinary least squares (OLS)	Minimize sum of squared residuals	Legendre 1805, Gauss 1809	Best linear unbiased estimator under Gauss-Markov assumptions
Method of moments (MoM)	Equate sample moments to theoretical moments	Pearson 1894	Simple, consistent, often inefficient
Maximum likelihood (MLE)	Maximize the likelihood function $L(\theta; x)$	Fisher 1912, formalized 1922	Asymptotically efficient and consistent under regularity
Maximum a posteriori (MAP)	Maximize the posterior $p(\theta \mid x)$	Twentieth-century Bayesian revival	Adds a prior; reduces to MLE under a flat prior
Bayes estimator	Posterior mean, median, or mode minimizing expected loss	Eighteenth-century origins, formalized twentieth century	Optimal for the chosen loss function
M-estimator	Minimize a generalized loss $\sum \rho(x_i, \theta)$	Huber 1964	Robust to outliers and model misspecification
James-Stein	Shrink sample mean toward a common point	James and Stein 1961	Biased but dominates the sample mean for dimension $\geq 3$
Generalized method of moments (GMM)	Match a vector of moment conditions	Hansen 1982	Standard in econometrics; nests OLS, MLE, MoM as special cases

Maximum likelihood is the default in much of modern statistics because it is asymptotically efficient under mild conditions: as the sample size grows, no other consistent estimator achieves lower variance. The method also generalizes naturally to complex models, including the deep neural networks trained today by minimizing negative log-likelihood (cross-entropy) loss.

parametric and non-parametric estimators

The distinction between parametric and non-parametric estimation runs through the entire field.

A parametric model assumes the data was generated by a distribution that belongs to a family indexed by a finite-dimensional parameter. The job of the estimator is to pin down that parameter. Linear regression, logistic regression, Gaussian mixture models, and most classical statistical methods are parametric. The advantage is statistical efficiency: when the model is correct, parametric estimators converge fast and produce tight confidence intervals. The risk is misspecification, since a wrong family produces misleading results no matter how much data is available.

A non-parametric model does not commit to a fixed parameter count. The number of effective parameters often grows with the sample size. Kernel density estimation, k-nearest neighbors, decision trees, and Gaussian processes are non-parametric. These methods make weaker assumptions about the data-generating process and are more flexible, but they typically converge slower and require more data to reach a given accuracy.

Popular parametric estimators in machine learning include:

Linear regression, which assumes the relationship between dependent and independent variables can be described by a linear equation with two parameters (slope and intercept) in the simple case.
Logistic regression, which models the probability of a binary outcome by assuming a linear relationship between the predictors and the log-odds.
Gaussian mixture models, fit by expectation-maximization, which assume the data is drawn from a finite weighted sum of normal distributions.

Popular non-parametric estimators include:

K-nearest neighbors (KNN), which estimates the target value at a query point by averaging or voting over the labels of the $k$ closest training points in the feature space.
Kernel density estimation (KDE), which estimates a probability density by placing a kernel function (often Gaussian) at each observation and summing the kernels into a smooth curve.
Decision trees and random forests, which partition the input space recursively without committing to a global functional form.

In practice, the line is blurry. Many models are semi-parametric, combining a finite-dimensional parameter of interest with an infinite-dimensional nuisance component. The Cox proportional hazards model is a classic example.

estimator in scikit-learn

The word estimator has a specific technical meaning in scikit-learn, the dominant Python machine learning library, and by extension across many libraries that adopt its API conventions.

A scikit-learn estimator is any object that implements a fit method to learn from data. The base class sklearn.base.BaseEstimator provides the common machinery for getting and setting hyperparameters. By convention, an estimator object stores nothing in its __init__: hyperparameters are accepted but not validated, and no learned state is created until fit is called. After fitting, the estimator stores learned parameters in attributes whose names end with an underscore (such as coef_ for linear models or cluster_centers_ for KMeans). This trailing-underscore convention is the visible signal that an attribute exists only after fit.

estimator categories

Scikit-learn estimators specialize into a small number of categories, each defined by an additional mixin class.

Category	Mixin	Required methods	Examples
Classifier	`ClassifierMixin`	`fit`, `predict`, often `predict_proba`	`LogisticRegression`, `RandomForestClassifier`, `SVC`
Regressor	`RegressorMixin`	`fit`, `predict`	`LinearRegression`, `Ridge`, `GradientBoostingRegressor`
Transformer	`TransformerMixin`	`fit`, `transform`, `fit_transform`	`StandardScaler`, `PCA`, `OneHotEncoder`
Cluster	`ClusterMixin`	`fit`, `fit_predict`, `labels_` attribute	`KMeans`, `DBSCAN`, `AgglomerativeClustering`
Density estimator	(no canonical mixin)	`fit`, `score_samples`	`KernelDensity`, `GaussianMixture`
Outlier detector	`OutlierMixin`	`fit`, `predict`, `decision_function`	`IsolationForest`, `LocalOutlierFactor`

The uniformity of the API has practical consequences. Because every estimator follows the same conventions, scikit-learn can compose them into pipelines, run cross-validation over arbitrary models, and search hyperparameter spaces with GridSearchCV or RandomizedSearchCV. The same loop that trains a logistic regression also trains a gradient boosting model, since both expose fit and predict.

pipelines and meta-estimators

A Pipeline chains transformers and a final estimator into a single object that itself looks like an estimator. Calling fit on the pipeline calls fit_transform on each transformer in sequence and fit on the final estimator. This pattern prevents data leakage, since transformations are learned on training folds and applied to test folds without contamination.

Meta-estimators wrap other estimators to add behavior. GridSearchCV wraps an estimator to perform exhaustive hyperparameter search via cross-validation. BaggingClassifier wraps a base classifier to produce an ensemble. MultiOutputRegressor wraps a single-output regressor to handle multi-output targets. Because all of these objects respect the estimator interface, they nest arbitrarily.

example

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(max_iter=1000)),
])

params = {"clf__C": [0.01, 0.1, 1.0, 10.0]}
search = GridSearchCV(pipe, params, cv=5)
search.fit(X_train, y_train)
print(search.best_params_, search.score(X_test, y_test))

The pipeline is itself an estimator, the grid search is an estimator wrapping that estimator, and the whole stack exposes fit, predict, and score. The composability comes directly from the discipline of the base interface.

connecting the two senses

The statistical and software senses of estimator are linked by what happens inside fit. When LinearRegression.fit runs, it computes the ordinary least squares solution that Gauss and Legendre worked out two centuries ago. When LogisticRegression.fit runs, it computes a maximum likelihood estimate of the coefficients, typically via the Newton-Raphson or L-BFGS algorithm. When GaussianMixture.fit runs, it iterates the Expectation-Maximization algorithm to find an MLE of the mixture parameters. When BayesianRidge.fit runs, it computes a Bayes estimator under a specific prior.

The properties statisticians study, including consistency, unbiasedness, and efficiency, apply directly to these procedures. Cross-validation, hyperparameter tuning, and regularization are all techniques that trade bias for variance in pursuit of low test error, which is the bias-variance tradeoff by another name. The library API hides the mathematics behind a uniform interface, but the underlying concepts are the same ones Fisher named in 1922.

explain like I'm 5

Imagine you have a jar of jellybeans and you want to guess how many are inside without dumping them out. You could grab a small handful, count them, and multiply. That counting-and-multiplying procedure is an estimator. Different procedures might be more accurate (count two handfuls and average), more cautious (only count handfuls of a certain size), or more honest about their uncertainty (give a range instead of one number). In statistics, we study which procedures give the best guesses on average, which ones get closer to the truth as you grab more jellybeans, and which ones are not fooled when somebody slipped marbles into the jar.

references

Fisher, R. A. (1922). On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society A, 222, 309-368. https://doi.org/10.1098/rsta.1922.0009
Aldrich, J. (2015). From evidence to understanding: a commentary on Fisher (1922). Philosophical Transactions of the Royal Society A, 373(2039). https://royalsocietypublishing.org/doi/10.1098/rsta.2014.0252
Cramér, H. (1946). Mathematical Methods of Statistics. Princeton University Press.
Rao, C. R. (1945). Information and the accuracy attainable in the estimation of statistical parameters. Bulletin of the Calcutta Mathematical Society, 37, 81-91.
Huber, P. J. (1964). Robust estimation of a location parameter. The Annals of Mathematical Statistics, 35(1), 73-101. https://doi.org/10.1214/aoms/1177703732
James, W., and Stein, C. (1961). Estimation with quadratic loss. Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, 1, 361-379.
Stein, C. (1956). Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, 1, 197-206.
Pearson, K. (1894). Contributions to the mathematical theory of evolution. Philosophical Transactions of the Royal Society A, 185, 71-110.
Stigler, S. M. (1981). Gauss and the invention of least squares. The Annals of Statistics, 9(3), 465-474.
Cramér-Rao bound. Wikipedia. https://en.wikipedia.org/wiki/Cram%C3%A9r%E2%80%93Rao_bound
James-Stein estimator. Wikipedia. https://en.wikipedia.org/wiki/James%E2%80%93Stein_estimator
Method of moments (statistics). Wikipedia. https://en.wikipedia.org/wiki/Method_of_moments_(statistics)
Least squares. Wikipedia. https://en.wikipedia.org/wiki/Least_squares
scikit-learn developers. (2024). Developing scikit-learn estimators. https://scikit-learn.org/stable/developers/develop.html
scikit-learn developers. (2024). sklearn.base.BaseEstimator API reference. https://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html
Buja, A., et al. (2019). Models as approximations: a conspiracy of random regressors and model deviations against classical inference in regression. Statistical Science, 34(4), 545-565.

historical background

statistical estimator

properties of estimators

common estimators

parametric and non-parametric estimators

estimator in scikit-learn

estimator categories

pipelines and meta-estimators

example

connecting the two senses

explain like I'm 5

references

Improve this article

Related Articles

ARC-AGI 2

AUC-ROC

ARIMA

Machine learning terms/Clustering

Machine learning terms/Decision Forests

Machine learning terms/Fairness

historical background

statistical estimator

properties of estimators

common estimators

parametric and non-parametric estimators

estimator in scikit-learn

estimator categories

pipelines and meta-estimators

example

connecting the two senses

explain like I'm 5

references

Related Articles

ARC-AGI 2

AUC-ROC

ARIMA

Machine learning terms/Clustering

Machine learning terms/Decision Forests

Machine learning terms/Fairness