See also: Machine learning terms
An estimator is a rule, function, or algorithm that takes observed data and produces a value intended to approximate some unknown quantity, typically a parameter of a probability distribution or a function relating inputs to outputs. The term carries two related but distinct meanings in modern usage. In classical statistics, an estimator is a procedure for inferring population parameters from a sample. In machine learning libraries such as scikit-learn, the word also names a software object that fits a model to data and exposes methods like fit and predict. The two senses are connected because library estimators perform parameter estimation under the hood, often using techniques developed by statisticians a century ago.
The formal study of estimators began with Ronald A. Fisher's 1922 paper On the Mathematical Foundations of Theoretical Statistics, which introduced consistency, efficiency, sufficiency, and the method of maximum likelihood as a unified framework for evaluating estimation procedures. Earlier work by Carl Friedrich Gauss, Adrien-Marie Legendre, and Karl Pearson had produced specific estimation techniques, but Fisher's paper gave the field its theoretical vocabulary. Subsequent contributions from Harald Cramér, C. R. Rao, Charles Stein, and Peter Huber extended the theory in directions that shaped both classical and modern statistical practice.
The earliest formal estimators predate the word itself. Adrien-Marie Legendre published the method of least squares in 1805, applying it to the shape of the Earth. Gauss claimed to have used the method since 1795 and famously deployed it in 1801 to predict the orbit of the newly discovered asteroid Ceres after Italian astronomer Giuseppe Piazzi lost track of it. Gauss connected least squares to the normal distribution and probability theory in his 1809 Theoria Motus, transforming what Legendre had presented as an algebraic curve-fitting trick into a statistical method.
Karl Pearson introduced the method of moments in 1894, applying it to fit skewed distributions to data on crab measurements. Pearson borrowed the term "moments" from physics and used the technique to estimate the parameters of his Pearson family of distributions, since maximum likelihood was not yet available.
Fisher's 1922 paper is generally regarded as the founding document of modern estimation theory. He introduced maximum likelihood estimation as a general principle, defined what it means for an estimator to be consistent and efficient, and proved that the sufficient statistic for a normal scale parameter is the sum of squared deviations rather than the sum of absolute deviations championed by astronomer Arthur Eddington. The paper also introduced the concept of Fisher information, which measures how much a sample tells us about an unknown parameter.
In 1945, C. R. Rao published a lower bound on the variance of any unbiased estimator, derived independently and almost simultaneously by Harald Cramér in his 1946 textbook Mathematical Methods of Statistics. The result, now called the Cramer-Rao bound, states that no unbiased estimator can have variance below the inverse of the Fisher information. The story goes that Rao, then 24 years old and teaching estimation at Calcutta University, was asked by a student why a Fisher result for large samples could not be proved for finite samples. Rao went home, worked through the night, and produced the inequality the next day.
Charles Stein shocked the statistics community in 1956 by showing that the sample mean is inadmissible as an estimator of the mean vector of a multivariate normal distribution when the dimension is three or more. Willard James and Stein produced an explicit dominating estimator in 1961, now called the James-Stein estimator. The result is paradoxical because it implies that one can do better at estimating, say, the batting averages of three baseball players by shrinking each estimate toward a common value, even though the players have nothing to do with each other.
Peter Huber's 1964 paper Robust Estimation of a Location Parameter, published in the Annals of Mathematical Statistics, founded the field of robust statistics by introducing M-estimators. These estimators interpolate between the sample mean and the sample median, and they remain useful when the data contains outliers or when the assumed distribution is only approximately correct.
Formally, an estimator is a measurable function from the sample space to the parameter space. If the data is a random sample $X_1, X_2, \ldots, X_n$ drawn from a distribution with unknown parameter $\theta$, an estimator $\hat\theta$ is any function $\hat\theta = T(X_1, \ldots, X_n)$ that returns a value in the same space as $\theta$. The estimator is itself a random variable because it depends on random data; the value it returns for a particular sample is called an estimate.
A crucial distinction separates point estimators from interval estimators. A point estimator returns a single value, such as the sample mean as an estimate of the population mean. An interval estimator returns a range, such as a 95% confidence interval, that is intended to contain the true parameter with a specified probability. Confidence intervals were developed by Jerzy Neyman in 1937 as a frequentist counterpart to Bayesian credible intervals.
Statisticians evaluate estimators against a small set of formal properties. No single estimator is best on all criteria; choice depends on the application and on what one is willing to assume about the data.
| Property | Definition | Why it matters |
|---|---|---|
| Unbiasedness | $E[\hat\theta] = \theta$ for all $\theta$ | The estimator is correct on average. Bias measures systematic error. |
| Consistency | $\hat\theta \xrightarrow{p} \theta$ as $n \to \infty$ | The estimator gets closer to the truth with more data. A weak but essential requirement. |
| Efficiency | Variance attains the Cramer-Rao bound | The estimator wastes no information. Efficient estimators have the lowest possible variance among unbiased estimators. |
| Sufficiency | $\hat\theta$ contains all sample information about $\theta$ | Reducing data to a sufficient statistic loses nothing. Fisher introduced this in 1922. |
| Robustness | Performance degrades gracefully under model misspecification or outliers | The estimator does not collapse when assumptions fail. Foundational to Huber's program. |
| Invariance | Estimator transforms predictably under reparametrization | Useful for choosing scales and units without changing inferences. |
The mean squared error decomposition combines bias and variance into a single criterion:
$$\text{MSE}(\hat\theta) = E[(\hat\theta - \theta)^2] = \text{Bias}(\hat\theta)^2 + \text{Var}(\hat\theta)$$
This decomposition motivates the bias-variance tradeoff that pervades modern machine learning. A slightly biased estimator can have lower MSE than the best unbiased one, and the James-Stein estimator is the canonical example. Ridge regression and most regularization techniques exploit this tradeoff explicitly.
A handful of estimation principles dominate practice. Each can be derived from a different criterion, and each has characteristic strengths.
| Estimator | Principle | Year introduced | Notable property |
|---|---|---|---|
| Ordinary least squares (OLS) | Minimize sum of squared residuals | Legendre 1805, Gauss 1809 | Best linear unbiased estimator under Gauss-Markov assumptions |
| Method of moments (MoM) | Equate sample moments to theoretical moments | Pearson 1894 | Simple, consistent, often inefficient |
| Maximum likelihood (MLE) | Maximize the likelihood function $L(\theta; x)$ | Fisher 1912, formalized 1922 | Asymptotically efficient and consistent under regularity |
| Maximum a posteriori (MAP) | Maximize the posterior $p(\theta \mid x)$ | Twentieth-century Bayesian revival | Adds a prior; reduces to MLE under a flat prior |
| Bayes estimator | Posterior mean, median, or mode minimizing expected loss | Eighteenth-century origins, formalized twentieth century | Optimal for the chosen loss function |
| M-estimator | Minimize a generalized loss $\sum \rho(x_i, \theta)$ | Huber 1964 | Robust to outliers and model misspecification |
| James-Stein | Shrink sample mean toward a common point | James and Stein 1961 | Biased but dominates the sample mean for dimension $\geq 3$ |
| Generalized method of moments (GMM) | Match a vector of moment conditions | Hansen 1982 | Standard in econometrics; nests OLS, MLE, MoM as special cases |
Maximum likelihood is the default in much of modern statistics because it is asymptotically efficient under mild conditions: as the sample size grows, no other consistent estimator achieves lower variance. The method also generalizes naturally to complex models, including the deep neural networks trained today by minimizing negative log-likelihood (cross-entropy) loss.
The distinction between parametric and non-parametric estimation runs through the entire field.
A parametric model assumes the data was generated by a distribution that belongs to a family indexed by a finite-dimensional parameter. The job of the estimator is to pin down that parameter. Linear regression, logistic regression, Gaussian mixture models, and most classical statistical methods are parametric. The advantage is statistical efficiency: when the model is correct, parametric estimators converge fast and produce tight confidence intervals. The risk is misspecification, since a wrong family produces misleading results no matter how much data is available.
A non-parametric model does not commit to a fixed parameter count. The number of effective parameters often grows with the sample size. Kernel density estimation, k-nearest neighbors, decision trees, and Gaussian processes are non-parametric. These methods make weaker assumptions about the data-generating process and are more flexible, but they typically converge slower and require more data to reach a given accuracy.
Popular parametric estimators in machine learning include:
Popular non-parametric estimators include:
In practice, the line is blurry. Many models are semi-parametric, combining a finite-dimensional parameter of interest with an infinite-dimensional nuisance component. The Cox proportional hazards model is a classic example.
The word estimator has a specific technical meaning in scikit-learn, the dominant Python machine learning library, and by extension across many libraries that adopt its API conventions.
A scikit-learn estimator is any object that implements a fit method to learn from data. The base class sklearn.base.BaseEstimator provides the common machinery for getting and setting hyperparameters. By convention, an estimator object stores nothing in its __init__: hyperparameters are accepted but not validated, and no learned state is created until fit is called. After fitting, the estimator stores learned parameters in attributes whose names end with an underscore (such as coef_ for linear models or cluster_centers_ for KMeans). This trailing-underscore convention is the visible signal that an attribute exists only after fit.
Scikit-learn estimators specialize into a small number of categories, each defined by an additional mixin class.
| Category | Mixin | Required methods | Examples |
|---|---|---|---|
| Classifier | ClassifierMixin | fit, predict, often predict_proba | LogisticRegression, RandomForestClassifier, SVC |
| Regressor | RegressorMixin | fit, predict | LinearRegression, Ridge, GradientBoostingRegressor |
| Transformer | TransformerMixin | fit, transform, fit_transform | StandardScaler, PCA, OneHotEncoder |
| Cluster | ClusterMixin | fit, fit_predict, labels_ attribute | KMeans, DBSCAN, AgglomerativeClustering |
| Density estimator | (no canonical mixin) | fit, score_samples | KernelDensity, GaussianMixture |
| Outlier detector | OutlierMixin | fit, predict, decision_function | IsolationForest, LocalOutlierFactor |
The uniformity of the API has practical consequences. Because every estimator follows the same conventions, scikit-learn can compose them into pipelines, run cross-validation over arbitrary models, and search hyperparameter spaces with GridSearchCV or RandomizedSearchCV. The same loop that trains a logistic regression also trains a gradient boosting model, since both expose fit and predict.
A Pipeline chains transformers and a final estimator into a single object that itself looks like an estimator. Calling fit on the pipeline calls fit_transform on each transformer in sequence and fit on the final estimator. This pattern prevents data leakage, since transformations are learned on training folds and applied to test folds without contamination.
Meta-estimators wrap other estimators to add behavior. GridSearchCV wraps an estimator to perform exhaustive hyperparameter search via cross-validation. BaggingClassifier wraps a base classifier to produce an ensemble. MultiOutputRegressor wraps a single-output regressor to handle multi-output targets. Because all of these objects respect the estimator interface, they nest arbitrarily.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
pipe = Pipeline([
("scaler", StandardScaler()),
("clf", LogisticRegression(max_iter=1000)),
])
params = {"clf__C": [0.01, 0.1, 1.0, 10.0]}
search = GridSearchCV(pipe, params, cv=5)
search.fit(X_train, y_train)
print(search.best_params_, search.score(X_test, y_test))
The pipeline is itself an estimator, the grid search is an estimator wrapping that estimator, and the whole stack exposes fit, predict, and score. The composability comes directly from the discipline of the base interface.
The statistical and software senses of estimator are linked by what happens inside fit. When LinearRegression.fit runs, it computes the ordinary least squares solution that Gauss and Legendre worked out two centuries ago. When LogisticRegression.fit runs, it computes a maximum likelihood estimate of the coefficients, typically via the Newton-Raphson or L-BFGS algorithm. When GaussianMixture.fit runs, it iterates the Expectation-Maximization algorithm to find an MLE of the mixture parameters. When BayesianRidge.fit runs, it computes a Bayes estimator under a specific prior.
The properties statisticians study, including consistency, unbiasedness, and efficiency, apply directly to these procedures. Cross-validation, hyperparameter tuning, and regularization are all techniques that trade bias for variance in pursuit of low test error, which is the bias-variance tradeoff by another name. The library API hides the mathematics behind a uniform interface, but the underlying concepts are the same ones Fisher named in 1922.
Imagine you have a jar of jellybeans and you want to guess how many are inside without dumping them out. You could grab a small handful, count them, and multiply. That counting-and-multiplying procedure is an estimator. Different procedures might be more accurate (count two handfuls and average), more cautious (only count handfuls of a certain size), or more honest about their uncertainty (give a range instead of one number). In statistics, we study which procedures give the best guesses on average, which ones get closer to the truth as you grab more jellybeans, and which ones are not fooled when somebody slipped marbles into the jar.
sklearn.base.BaseEstimator API reference. https://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html