See also: Machine learning terms, Model evaluation
In machine learning, a baseline is a simple, often minimally complex model that serves as a reference point for evaluating the performance of more sophisticated approaches. The purpose of a baseline is not to produce the best possible predictions. Instead, it establishes a minimum performance threshold that any more complex model should exceed to justify its added complexity. Baselines can be constructed using simple statistical rules, random selection, heuristic methods, or well-known standard algorithms, depending on the problem domain and available data.
The concept is foundational to responsible model development. Every machine learning project should begin with a baseline because it provides the context needed to interpret results. Without one, a reported accuracy of 95% is meaningless; if a trivial majority-class predictor also achieves 95%, the complex model has added no real value.
Baselines serve several critical functions throughout the machine learning workflow.
A baseline provides a concrete number against which to measure every subsequent experiment. If a newly trained model does not outperform the baseline, the added complexity of that model is not justified. This prevents practitioners from mistakenly believing a sophisticated model is performing well when it merely matches what a trivial rule could achieve.
Baselines act as sanity checks for the entire modeling pipeline. If a complex deep learning model performs worse than a mean predictor, something is likely wrong with the data preprocessing, feature engineering, label encoding, or training procedure. The baseline exposes these problems early, before significant time and compute resources are wasted.
When trained models consistently fail to outperform a baseline, this signals that the dataset may lack predictive power for the given task. Rather than continuing to build increasingly complex models, practitioners should investigate data quality, check for label noise, or reconsider whether the chosen features carry useful information.
Baselines guide the allocation of modeling effort. A strong baseline that already performs well may indicate that diminishing returns will come from further model complexity. Conversely, a large gap between a baseline and human performance suggests there is substantial room for improvement, justifying investment in more advanced methods.
In academic research, baselines provide the context that makes experimental results interpretable. A claim that a new architecture achieves 88% accuracy is only meaningful when compared against baselines. Reviewers and readers use baseline comparisons to assess whether the proposed method offers genuine improvement or incremental gains.
For classification tasks, several standard baselines are widely used.
| Baseline Strategy | Description | Expected Performance |
|---|---|---|
| Majority class (most frequent) | Always predicts the most common class label in the training set | Equals the proportion of the majority class |
| Random uniform | Assigns each class with equal probability regardless of class distribution | Approximately 1/k for k classes |
| Stratified random | Samples predictions according to the observed class distribution in the training data | Reflects the squared sum of class proportions |
| Prior probability | Uses the empirical class distribution for predict_proba() and the most frequent class for predict() | Similar to majority class for hard predictions |
| Constant class | Always predicts a single user-specified class | Useful for cost-sensitive evaluation of specific classes |
The majority-class baseline is especially important for imbalanced datasets. For example, in a fraud detection problem where only 1% of transactions are fraudulent, a model that always predicts "not fraud" achieves 99% accuracy. This makes the baseline essential for understanding whether a more complex model has actually learned to detect the rare class.
For regression tasks, baselines typically predict a constant value derived from the training data.
| Baseline Strategy | Description | R-squared Score |
|---|---|---|
| Mean predictor | Always predicts the mean of the target variable from the training set | 0.0 by definition |
| Median predictor | Always predicts the median of the target variable | Near 0.0; more robust to outliers |
| Quantile predictor | Always predicts a user-specified quantile | Useful for asymmetric loss scenarios |
| Constant predictor | Always predicts a user-specified constant | Depends on the chosen constant |
The mean predictor is the most common regression baseline because the R-squared metric is defined relative to it. An R-squared of 0.0 means the model performs no better than predicting the mean; an R-squared below 0.0 means the model performs worse than the mean. This makes the mean predictor a natural anchor for regression evaluation.
Different machine learning subfields have established their own standard baselines that go beyond simple statistical rules.
In natural language processing, text classification tasks commonly use TF-IDF combined with logistic regression as a baseline. This approach converts text into numerical feature vectors using term frequency-inverse document frequency weighting, then trains a logistic regression classifier. Despite its simplicity, this combination is remarkably effective: it has been shown to achieve near-perfect accuracy (around 0.98) on standard text classification benchmarks with training times under one second.
Another common NLP baseline is the bag of words representation paired with a simple linear classifier. Bag of words counts word occurrences without regard to order, providing a straightforward but interpretable feature set. While it discards word order and context, it remains a useful lower bound for evaluating whether more complex models like transformers or recurrent neural networks are adding genuine value.
For sequence labeling and generation tasks, simple n-gram models or frequency-based predictors serve as baselines.
In computer vision, baseline choices depend on the complexity of the task. For image classification, a simple convolutional neural network (CNN) with a few convolutional and pooling layers serves as a minimal learned baseline. This approach captures basic spatial patterns without the depth and sophistication of modern architectures.
More commonly, practitioners use pretrained models like ResNet-50 or VGG with transfer learning as stronger baselines. These models, pretrained on the ImageNet dataset, provide learned feature representations that can be fine-tuned on new tasks with relatively little data. ResNet-50 has become a de facto standard baseline for image classification because it is well-documented, widely available in all major frameworks, and delivers solid accuracy across a variety of tasks.
For structured, tabular data problems, gradient boosting frameworks such as XGBoost, LightGBM, or CatBoost are commonly used as strong baselines. Linear regression and logistic regression serve as simpler baselines. Decision trees and random forests also provide interpretable reference points.
Human-level performance is one of the most meaningful baselines in machine learning. It represents the accuracy that a trained human achieves on the same task the model is trying to solve. This baseline is especially useful because it provides an upper bound estimate of how well any system might perform on tasks involving inherently subjective or ambiguous judgments.
In image classification, human performance on the ImageNet benchmark was measured at approximately 5.1% top-5 error rate. When AI systems surpassed this level in 2015, it was a significant milestone. Similarly, on reading comprehension benchmarks like SQuAD, human performance scores serve as targets that models aim to match or exceed.
However, measuring human performance rigorously is itself challenging. The machine learning community has lacked a standardized framework for evaluating human baselines. Studies often report human performance without documenting participant count, recruitment methods, expertise levels, or measures of inter-annotator variability. This makes some human baseline numbers less reliable than they appear.
With the rise of large language models (LLMs), zero-shot performance of models like GPT-4 or Claude has become a new category of baseline for many NLP tasks. In zero-shot evaluation, the model is given a task description and must perform the task without any training examples, relying entirely on knowledge acquired during pretraining.
Research has shown that LLMs are capable zero-shot reasoners. The seminal paper "Large Language Models are Zero-Shot Reasoners" (Kojima et al., 2022) demonstrated that adding the prompt "Let's think step by step" before each answer dramatically improved accuracy on reasoning benchmarks, increasing performance on MultiArith from 17.7% to 78.7% with the InstructGPT model.
Zero-shot LLM baselines are increasingly used in research papers to contextualize the performance of specialized, fine-tuned models. If a purpose-built classifier cannot outperform a general-purpose LLM used zero-shot, this raises questions about the value of the specialized approach.
In academic machine learning research, the previous state-of-the-art (SOTA) result on a benchmark dataset is the primary baseline against which new methods are evaluated. Platforms like Papers With Code track SOTA results across thousands of benchmarks and tasks, making it straightforward to identify the current best-performing method.
When presenting a new model or technique, researchers are expected to compare against the best previously published results on the same dataset using the same evaluation protocol. This ensures that claimed improvements are measured against the strongest known competition rather than weak or outdated methods.
However, the CMU Machine Learning Blog has highlighted a persistent problem with weak baselines in research. Studies have shown that in some fields, previously "poorly performing" baseline methods, when properly tuned, outperformed recently published methods that claimed significant advances. In information retrieval, one analysis found that 60% of proposed models performed worse than an untuned RM3 baseline from earlier research, suggesting illusory progress.
Scikit-learn provides two dedicated classes for constructing baseline models: DummyClassifier and DummyRegressor. These tools make it easy to generate baseline predictions programmatically and compare them against real models using the same evaluation pipeline.
DummyClassifier makes predictions that completely ignore input features. Its behavior is controlled by the strategy parameter.
| Strategy | Behavior |
|---|---|
most_frequent | Always predicts the most frequent class from the training set |
prior (default) | Returns most frequent class for predict(); returns empirical class distribution for predict_proba() |
stratified | Randomly samples classes according to the training class distribution |
uniform | Predicts each class with equal probability |
constant | Always predicts a user-specified class label |
Example usage:
import numpy as np
from sklearn.dummy import DummyClassifier
X = np.array([[0], <sup><a href="#cite_note-1" class="cite-ref">[1]</a></sup>, <sup><a href="#cite_note-2" class="cite-ref">[2]</a></sup>, <sup><a href="#cite_note-3" class="cite-ref">[3]</a></sup>])
y = np.array([0, 1, 1, 1])
dummy = DummyClassifier(strategy="most_frequent")
dummy.fit(X, y)
print(dummy.predict(X)) # [1, 1, 1, 1]
print(dummy.score(X, y)) # 0.75
DummyRegressor follows the same pattern for regression tasks.
| Strategy | Behavior |
|---|---|
mean (default) | Always predicts the mean of the training targets |
median | Always predicts the median of the training targets |
quantile | Always predicts a specified quantile (between 0.0 and 1.0) |
constant | Always predicts a user-specified constant value |
Example usage:
import numpy as np
from sklearn.dummy import DummyRegressor
X = np.array([[1], <sup><a href="#cite_note-2" class="cite-ref">[2]</a></sup>, <sup><a href="#cite_note-3" class="cite-ref">[3]</a></sup>, <sup><a href="#cite_note-4" class="cite-ref">[4]</a></sup>])
y = np.array([2.0, 3.0, 5.0, 10.0])
dummy = DummyRegressor(strategy="mean")
dummy.fit(X, y)
print(dummy.predict(X)) # [5.0, 5.0, 5.0, 5.0]
print(dummy.score(X, y)) # 0.0
Both classes integrate seamlessly with scikit-learn's model selection tools, including cross_val_score, GridSearchCV, and Pipeline, making it straightforward to include baseline comparisons in any modeling workflow.
Proper baseline reporting is essential for reproducible and meaningful machine learning research. Several best practices have emerged in the community.
Papers should include multiple baselines spanning different levels of complexity. At minimum, this means including a naive baseline (majority class, mean predictor), a standard algorithm baseline (logistic regression, linear regression), and the previous SOTA method. This gives readers a complete picture of where the proposed method falls on the simplicity-to-complexity spectrum.
Baseline models must be tuned fairly. A common methodological flaw is to extensively tune a proposed method while using default hyperparameters for baselines. This creates an unfair comparison that exaggerates the proposed method's advantage. Best practice requires applying the same tuning effort (for example, grid search or random search) to both baselines and proposed methods.
| Reporting Element | Description |
|---|---|
| Hyperparameters | All hyperparameter values for both baselines and proposed models |
| Random seeds | Fixed seeds for reproducibility of stochastic baselines |
| Data splits | Exact train/validation/test splits or cross-validation folds |
| Preprocessing | Identical preprocessing applied to baselines and proposed methods |
| Statistical significance | Confidence intervals or significance tests, not just point estimates |
| Compute resources | Hardware and training time for fair comparison |
The machine learning community has identified a persistent problem with weak or poorly tuned baselines inflating the apparent contribution of new methods. The CMU ML Blog notes that weak baselines create "the false belief that some methods are performing well when they are not." Reviewers are increasingly encouraged to scrutinize baseline validity, and some conferences now require authors to document their baseline tuning efforts.
The choice of performance metric depends on the task type and the specific goals of the evaluation.
| Task Type | Common Metrics |
|---|---|
| Classification | Accuracy, precision, recall, F1 score, AUC-ROC |
| Regression | Mean squared error (MSE), Mean absolute error (MAE), R-squared |
| Ranking | NDCG, MAP, MRR |
| Generation | BLEU, ROUGE, perplexity |
When comparing against baselines, it is important to report the same metrics for both the baseline and the proposed model, computed on the same test set. Reporting multiple metrics provides a more complete view of model performance than any single number.
Imagine you are trying to guess how many jellybeans are in a jar. You do not know the exact number, but you can make a simple guess, like saying there are 100 jellybeans. This simple guess is your baseline.
Now your friend uses a magnifying glass, counts the layers, and estimates 147 jellybeans. If the real answer is 150, your friend's method is much better than your simple guess. But if the real answer is 102, your friend's fancy method was barely better than just guessing 100.
In machine learning, a baseline works the same way. It is the simplest possible guess. Scientists build complicated models and then check: "Is my fancy model actually better than the simple guess?" If it is not much better, then all that extra work was not worth it. The baseline helps everyone understand whether a new idea is truly an improvement or just unnecessary complexity.