Baseline

Definition

In machine learning, a baseline is a simple, often minimally complex model that serves as a reference point for evaluating the performance of more sophisticated approaches. The purpose of a baseline is not to produce the best possible predictions. Instead, it establishes a minimum performance threshold that any more complex model should exceed to justify its added complexity. Baselines can be constructed using simple statistical rules, random selection, heuristic methods, or well-known standard algorithms, depending on the problem domain and available data.

The concept is foundational to responsible model development. Every machine learning project should begin with a baseline because it provides the context needed to interpret results. Without one, a reported accuracy of 95% is meaningless; if a trivial majority-class predictor also achieves 95%, the complex model has added no real value.

Why Baselines Matter

Baselines serve several critical functions throughout the machine learning workflow.

Performance Benchmarking

A baseline provides a concrete number against which to measure every subsequent experiment. If a newly trained model does not outperform the baseline, the added complexity of that model is not justified. This prevents practitioners from mistakenly believing a sophisticated model is performing well when it merely matches what a trivial rule could achieve.

Sanity Checking

Baselines act as sanity checks for the entire modeling pipeline. If a complex deep learning model performs worse than a mean predictor, something is likely wrong with the data preprocessing, feature engineering, label encoding, or training procedure. The baseline exposes these problems early, before significant time and compute resources are wasted.

Data Quality Assessment

When trained models consistently fail to outperform a baseline, this signals that the dataset may lack predictive power for the given task. Rather than continuing to build increasingly complex models, practitioners should investigate data quality, check for label noise, or reconsider whether the chosen features carry useful information.

Resource Efficiency

Baselines guide the allocation of modeling effort. A strong baseline that already performs well may indicate that diminishing returns will come from further model complexity. Conversely, a large gap between a baseline and human performance suggests there is substantial room for improvement, justifying investment in more advanced methods.

Context for Results in Papers

In academic research, baselines provide the context that makes experimental results interpretable. A claim that a new architecture achieves 88% accuracy is only meaningful when compared against baselines. Reviewers and readers use baseline comparisons to assess whether the proposed method offers genuine improvement or incremental gains.

Common Baselines for Classification

For classification tasks, several standard baselines are widely used.

Baseline Strategy	Description	Expected Performance
Majority class (most frequent)	Always predicts the most common class label in the training set	Equals the proportion of the majority class
Random uniform	Assigns each class with equal probability regardless of class distribution	Approximately 1/k for k classes
Stratified random	Samples predictions according to the observed class distribution in the training data	Reflects the squared sum of class proportions
Prior probability	Uses the empirical class distribution for `predict_proba()` and the most frequent class for `predict()`	Similar to majority class for hard predictions
Constant class	Always predicts a single user-specified class	Useful for cost-sensitive evaluation of specific classes

The majority-class baseline is especially important for imbalanced datasets. For example, in a fraud detection problem where only 1% of transactions are fraudulent, a model that always predicts "not fraud" achieves 99% accuracy. This makes the baseline essential for understanding whether a more complex model has actually learned to detect the rare class.

Common Baselines for Regression

For regression tasks, baselines typically predict a constant value derived from the training data.

Baseline Strategy	Description	R-squared Score
Mean predictor	Always predicts the mean of the target variable from the training set	0.0 by definition
Median predictor	Always predicts the median of the target variable	Near 0.0; more robust to outliers
Quantile predictor	Always predicts a user-specified quantile	Useful for asymmetric loss scenarios
Constant predictor	Always predicts a user-specified constant	Depends on the chosen constant

The mean predictor is the most common regression baseline because the R-squared metric is defined relative to it. An R-squared of 0.0 means the model performs no better than predicting the mean; an R-squared below 0.0 means the model performs worse than the mean. This makes the mean predictor a natural anchor for regression evaluation.

Baseline Models by Domain

Different machine learning subfields have established their own standard baselines that go beyond simple statistical rules.

Natural Language Processing (NLP)

In natural language processing, text classification tasks commonly use TF-IDF combined with logistic regression as a baseline. This approach converts text into numerical feature vectors using term frequency-inverse document frequency weighting, then trains a logistic regression classifier. Despite its simplicity, this combination is remarkably effective: it has been shown to achieve near-perfect accuracy (around 0.98) on standard text classification benchmarks with training times under one second.

Another common NLP baseline is the bag of words representation paired with a simple linear classifier. Bag of words counts word occurrences without regard to order, providing a straightforward but interpretable feature set. While it discards word order and context, it remains a useful lower bound for evaluating whether more complex models like transformers or recurrent neural networks are adding genuine value.

For sequence labeling and generation tasks, simple n-gram models or frequency-based predictors serve as baselines.

Computer Vision

In computer vision, baseline choices depend on the complexity of the task. For image classification, a simple convolutional neural network (CNN) with a few convolutional and pooling layers serves as a minimal learned baseline. This approach captures basic spatial patterns without the depth and sophistication of modern architectures.

More commonly, practitioners use pretrained models like ResNet-50 or VGG with transfer learning as stronger baselines. These models, pretrained on the ImageNet dataset, provide learned feature representations that can be fine-tuned on new tasks with relatively little data. ResNet-50 has become a de facto standard baseline for image classification because it is well-documented, widely available in all major frameworks, and delivers solid accuracy across a variety of tasks.

Tabular Data

For structured, tabular data problems, gradient boosting frameworks such as XGBoost, LightGBM, or CatBoost are commonly used as strong baselines. Linear regression and logistic regression serve as simpler baselines. Decision trees and random forests also provide interpretable reference points.

Human Performance as a Baseline

Human-level performance is one of the most meaningful baselines in machine learning. It represents the accuracy that a trained human achieves on the same task the model is trying to solve. This baseline is especially useful because it provides an upper bound estimate of how well any system might perform on tasks involving inherently subjective or ambiguous judgments.

In image classification, human performance on the ImageNet benchmark was measured at approximately 5.1% top-5 error rate. When AI systems surpassed this level in 2015, it was a significant milestone. Similarly, on reading comprehension benchmarks like SQuAD, human performance scores serve as targets that models aim to match or exceed.

However, measuring human performance rigorously is itself challenging. The machine learning community has lacked a standardized framework for evaluating human baselines. Studies often report human performance without documenting participant count, recruitment methods, expertise levels, or measures of inter-annotator variability. This makes some human baseline numbers less reliable than they appear.

Zero-Shot LLM Performance as a Baseline

With the rise of large language models (LLMs), zero-shot performance of models like GPT-4 or Claude has become a new category of baseline for many NLP tasks. In zero-shot evaluation, the model is given a task description and must perform the task without any training examples, relying entirely on knowledge acquired during pretraining.

Research has shown that LLMs are capable zero-shot reasoners. The seminal paper "Large Language Models are Zero-Shot Reasoners" (Kojima et al., 2022) demonstrated that adding the prompt "Let's think step by step" before each answer dramatically improved accuracy on reasoning benchmarks, increasing performance on MultiArith from 17.7% to 78.7% with the InstructGPT model.

Zero-shot LLM baselines are increasingly used in research papers to contextualize the performance of specialized, fine-tuned models. If a purpose-built classifier cannot outperform a general-purpose LLM used zero-shot, this raises questions about the value of the specialized approach.

Previous State-of-the-Art as a Baseline

In academic machine learning research, the previous state-of-the-art (SOTA) result on a benchmark dataset is the primary baseline against which new methods are evaluated. Platforms like Papers With Code track SOTA results across thousands of benchmarks and tasks, making it straightforward to identify the current best-performing method.

When presenting a new model or technique, researchers are expected to compare against the best previously published results on the same dataset using the same evaluation protocol. This ensures that claimed improvements are measured against the strongest known competition rather than weak or outdated methods.

However, the CMU Machine Learning Blog has highlighted a persistent problem with weak baselines in research. Studies have shown that in some fields, previously "poorly performing" baseline methods, when properly tuned, outperformed recently published methods that claimed significant advances. In information retrieval, one analysis found that 60% of proposed models performed worse than an untuned RM3 baseline from earlier research, suggesting illusory progress.

DummyClassifier and DummyRegressor in Scikit-Learn

Scikit-learn provides two dedicated classes for constructing baseline models: DummyClassifier and DummyRegressor. These tools make it easy to generate baseline predictions programmatically and compare them against real models using the same evaluation pipeline.

DummyClassifier

DummyClassifier makes predictions that completely ignore input features. Its behavior is controlled by the strategy parameter.

Strategy	Behavior
`most_frequent`	Always predicts the most frequent class from the training set
`prior` (default)	Returns most frequent class for `predict()`; returns empirical class distribution for `predict_proba()`
`stratified`	Randomly samples classes according to the training class distribution
`uniform`	Predicts each class with equal probability
`constant`	Always predicts a user-specified class label

Example usage:

import numpy as np
from sklearn.dummy import DummyClassifier

X = np.array([[0], <sup><a href="#cite_note-1" class="cite-ref">[1]</a></sup>, <sup><a href="#cite_note-2" class="cite-ref">[2]</a></sup>, <sup><a href="#cite_note-3" class="cite-ref">[3]</a></sup>])
y = np.array([0, 1, 1, 1])

dummy = DummyClassifier(strategy="most_frequent")
dummy.fit(X, y)
print(dummy.predict(X))   # [1, 1, 1, 1]
print(dummy.score(X, y))  # 0.75

DummyRegressor

DummyRegressor follows the same pattern for regression tasks.

Strategy	Behavior
`mean` (default)	Always predicts the mean of the training targets
`median`	Always predicts the median of the training targets
`quantile`	Always predicts a specified quantile (between 0.0 and 1.0)
`constant`	Always predicts a user-specified constant value

Example usage:

import numpy as np
from sklearn.dummy import DummyRegressor

X = np.array([[1], <sup><a href="#cite_note-2" class="cite-ref">[2]</a></sup>, <sup><a href="#cite_note-3" class="cite-ref">[3]</a></sup>, <sup><a href="#cite_note-4" class="cite-ref">[4]</a></sup>])
y = np.array([2.0, 3.0, 5.0, 10.0])

dummy = DummyRegressor(strategy="mean")
dummy.fit(X, y)
print(dummy.predict(X))   # [5.0, 5.0, 5.0, 5.0]
print(dummy.score(X, y))  # 0.0

Both classes integrate seamlessly with scikit-learn's model selection tools, including cross_val_score, GridSearchCV, and Pipeline, making it straightforward to include baseline comparisons in any modeling workflow.

Reporting Baselines in Research Papers

Proper baseline reporting is essential for reproducible and meaningful machine learning research. Several best practices have emerged in the community.

Selection of Baselines

Papers should include multiple baselines spanning different levels of complexity. At minimum, this means including a naive baseline (majority class, mean predictor), a standard algorithm baseline (logistic regression, linear regression), and the previous SOTA method. This gives readers a complete picture of where the proposed method falls on the simplicity-to-complexity spectrum.

Fair Tuning

Baseline models must be tuned fairly. A common methodological flaw is to extensively tune a proposed method while using default hyperparameters for baselines. This creates an unfair comparison that exaggerates the proposed method's advantage. Best practice requires applying the same tuning effort (for example, grid search or random search) to both baselines and proposed methods.

Reproducibility Requirements

Reporting Element	Description
Hyperparameters	All hyperparameter values for both baselines and proposed models
Random seeds	Fixed seeds for reproducibility of stochastic baselines
Data splits	Exact train/validation/test splits or cross-validation folds
Preprocessing	Identical preprocessing applied to baselines and proposed methods
Statistical significance	Confidence intervals or significance tests, not just point estimates
Compute resources	Hardware and training time for fair comparison

The Weak Baseline Problem

The machine learning community has identified a persistent problem with weak or poorly tuned baselines inflating the apparent contribution of new methods. The CMU ML Blog notes that weak baselines create "the false belief that some methods are performing well when they are not." Reviewers are increasingly encouraged to scrutinize baseline validity, and some conferences now require authors to document their baseline tuning efforts.

Performance Metrics for Baseline Comparison

The choice of performance metric depends on the task type and the specific goals of the evaluation.

Task Type	Common Metrics
Classification	Accuracy, precision, recall, F1 score, AUC-ROC
Regression	Mean squared error (MSE), Mean absolute error (MAE), R-squared
Ranking	NDCG, MAP, MRR
Generation	BLEU, ROUGE, perplexity

When comparing against baselines, it is important to report the same metrics for both the baseline and the proposed model, computed on the same test set. Reporting multiple metrics provides a more complete view of model performance than any single number.

Explain Like I'm 5 (ELI5)

Imagine you are trying to guess how many jellybeans are in a jar. You do not know the exact number, but you can make a simple guess, like saying there are 100 jellybeans. This simple guess is your baseline.

Now your friend uses a magnifying glass, counts the layers, and estimates 147 jellybeans. If the real answer is 150, your friend's method is much better than your simple guess. But if the real answer is 102, your friend's fancy method was barely better than just guessing 100.

In machine learning, a baseline works the same way. It is the simplest possible guess. Scientists build complicated models and then check: "Is my fancy model actually better than the simple guess?" If it is not much better, then all that extra work was not worth it. The baseline helps everyone understand whether a new idea is truly an improvement or just unnecessary complexity.

References

Scikit-learn developers. "DummyClassifier." Scikit-learn documentation, version 1.8.0. https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html
Scikit-learn developers. "DummyRegressor." Scikit-learn documentation, version 1.8.0. https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyRegressor.html
Liang, Yuchen, et al. "3 Baselines." Machine Learning Blog, ML@CMU, Carnegie Mellon University, August 31, 2020. https://blog.ml.cmu.edu/2020/08/31/3-baselines/
Kojima, Takeshi, et al. "Large Language Models are Zero-Shot Reasoners." Advances in Neural Information Processing Systems 35, 2022. https://arxiv.org/abs/2205.11916
Brown, Tom B., et al. "Language Models are Few-Shot Learners." Advances in Neural Information Processing Systems 33, 2020. https://arxiv.org/abs/2005.14165
Kiela, Douwe, et al. "Dynabench: Rethinking Benchmarking in NLP." Proceedings of NAACL, 2021.
Bouthillier, Xavier, et al. "Best Practices for Machine Learning Experimentation in Scientific Applications." arXiv:2511.21354, 2025. https://arxiv.org/abs/2511.21354
Maier-Hein, Lena, et al. "Stronger Baseline Models: A Key Requirement for Aligning Machine Learning Research with Clinical Utility." arXiv:2409.12116, 2024. https://arxiv.org/html/2409.12116v1
Papers With Code. "Browse the State-of-the-Art in Machine Learning." https://paperswithcode.com/sota
He, Kaiming, et al. "Deep Residual Learning for Image Recognition." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.

Definition

Why Baselines Matter

Performance Benchmarking

Sanity Checking

Data Quality Assessment

Resource Efficiency

Context for Results in Papers

Common Baselines for Classification

Common Baselines for Regression

Baseline Models by Domain

Natural Language Processing (NLP)

Computer Vision

Tabular Data

Human Performance as a Baseline

Zero-Shot LLM Performance as a Baseline

Previous State-of-the-Art as a Baseline

DummyClassifier and DummyRegressor in Scikit-Learn

DummyClassifier

DummyRegressor

Reporting Baselines in Research Papers

Selection of Baselines

Fair Tuning

Reproducibility Requirements

The Weak Baseline Problem

Performance Metrics for Baseline Comparison

Explain Like I'm 5 (ELI5)

References

Improve this article

Related Articles

ARC-AGI 2

Generalization

Generalization Curve

Model Capacity

AUC (Area Under the ROC Curve)

Accuracy

Definition

Why Baselines Matter

Performance Benchmarking

Sanity Checking

Data Quality Assessment

Resource Efficiency

Context for Results in Papers

Common Baselines for Classification

Common Baselines for Regression

Baseline Models by Domain

Natural Language Processing (NLP)

Computer Vision

Tabular Data

Human Performance as a Baseline

Zero-Shot LLM Performance as a Baseline

Previous State-of-the-Art as a Baseline

DummyClassifier and DummyRegressor in Scikit-Learn

DummyClassifier

DummyRegressor

Reporting Baselines in Research Papers

Selection of Baselines

Fair Tuning

Reproducibility Requirements

The Weak Baseline Problem

Performance Metrics for Baseline Comparison

Explain Like I'm 5 (ELI5)

References

Related Articles

ARC-AGI 2

Generalization

Generalization Curve

Model Capacity

AUC (Area Under the ROC Curve)

Accuracy