See also: Machine learning terms, Feature engineering, Data augmentation
Preprocessing in machine learning refers to the collection of techniques used to transform raw data into a clean, structured format suitable for training predictive models. Real-world data is rarely ready for direct use: it often contains missing values, inconsistent formatting, outliers, and noise that can degrade model performance. Preprocessing addresses these issues systematically, making it one of the most important and time-consuming stages in any machine learning workflow. Estimates suggest that data scientists spend 60 to 80 percent of their time on data preparation tasks, underscoring how central preprocessing is to practical machine learning.
The goal of preprocessing is to improve data quality and ensure compatibility with the chosen algorithm. Different models have different requirements. Distance-based algorithms such as k-nearest neighbors and support vector machines are sensitive to feature scales, while tree-based methods like random forests handle raw features more gracefully. Understanding which preprocessing steps are needed for a given task and model type is a core skill for any practitioner working with machine learning systems.
Data cleaning is the first and most fundamental preprocessing step. It involves detecting and correcting errors, inconsistencies, and inaccuracies in a dataset before any modeling takes place.
Missing data is one of the most common problems in real-world datasets. Values can be missing for many reasons: sensor failures, survey non-responses, data entry errors, or merging datasets with different schemas. There are three main categories of missingness:
| Type | Description | Example |
|---|---|---|
| Missing Completely at Random (MCAR) | Missingness has no relationship to any variable | A lab instrument randomly fails |
| Missing at Random (MAR) | Missingness depends on observed variables but not the missing value itself | Older patients are less likely to report weight |
| Missing Not at Random (MNAR) | Missingness depends on the unobserved value | High-income individuals refuse to disclose salary |
Common strategies for handling missing values include:
| Strategy | How It Works | Best Used When |
|---|---|---|
| Deletion (listwise) | Remove rows with any missing values | Missing data is MCAR and represents less than 5% of the dataset |
| Mean/Median imputation | Replace missing values with the column mean or median | Numerical features with small amounts of missingness |
| Mode imputation | Replace missing values with the most frequent value | Categorical data with small amounts of missingness |
| K-Nearest Neighbors (KNN) imputation | Fill missing values using the average of k nearest neighbors | Moderate missingness where relationships between features exist |
| Multiple Imputation by Chained Equations (MICE) | Iteratively model each feature with missing values as a function of other features | Complex datasets with multiple missing features |
| MissForest | Uses random forest models to impute missing values iteratively | Datasets with both numerical and categorical features |
| Constant/flag imputation | Replace with a constant value and add a binary indicator column | When missingness itself is informative |
Research comparing these methods has found that MissForest and MICE tend to outperform simpler approaches, particularly when more than 10% of values are missing. For small amounts of missing data (under 5%), simpler methods like mean or median imputation are often sufficient.
Duplicate records can inflate the importance of certain data points and bias model training. Duplicates may be exact copies or near-duplicates with minor variations (such as different capitalizations or whitespace). Identifying and removing duplicates ensures each observation contributes once to the learning process.
Outliers are data points that differ significantly from other observations. They may represent genuine rare events, measurement errors, or data entry mistakes. Common approaches to outlier detection include:
Once detected, outliers can be removed, capped (winsorized), or transformed using logarithmic or other nonlinear functions. The correct approach depends on whether the outlier is an error or a genuine extreme value.
Feature scaling transforms numerical features so they share a common scale. Without scaling, features with larger numerical ranges can dominate the learning process in algorithms that rely on distance calculations or gradient-based optimization.
Normalization rescales feature values to a fixed range, typically between 0 and 1. The formula is:
X_normalized = (X - X_min) / (X_max - X_min)
Normalization preserves the original distribution shape and is well-suited for algorithms that expect bounded inputs, such as neural networks and k-nearest neighbors. However, it is sensitive to outliers because extreme values compress the rest of the data into a narrow range.
Standardization transforms features to have a mean of 0 and a standard deviation of 1. The formula is:
X_standardized = (X - mean) / standard_deviation
Standardization is preferred when the data contains outliers or when the algorithm assumes normally distributed features. It is widely used with support vector machines, logistic regression, and principal component analysis (PCA).
| Method | Range | Outlier Sensitivity | Best For |
|---|---|---|---|
| Min-Max Normalization | [0, 1] | High | Neural networks, image pixel data, KNN |
| Standardization (Z-Score) | Unbounded (centered at 0) | Moderate | SVM, logistic regression, PCA |
| Robust Scaling | Unbounded (centered at median) | Low | Datasets with many outliers |
| Max Absolute Scaling | [-1, 1] | High | Sparse data |
| Log Transformation | Varies | Reduces right skew | Highly skewed distributions |
Feature scaling is essential for distance-based algorithms (KNN, SVM, K-means clustering), gradient-based optimizers (linear regression, logistic regression, neural networks), and dimensionality reduction techniques (PCA, t-SNE). Tree-based models such as decision trees, random forests, and gradient boosting typically do not require scaling because they make split decisions based on individual feature thresholds rather than distances between data points.
Most machine learning algorithms work exclusively with numerical inputs, so categorical data must be converted to numbers. The choice of encoding method depends on whether the categories have an inherent order and how many unique categories exist.
One-hot encoding creates a new binary column for each unique category. A value of 1 indicates the presence of that category, and 0 indicates its absence. For example, a "Color" feature with values Red, Green, and Blue becomes three columns: Color_Red, Color_Green, and Color_Blue.
One-hot encoding is the standard approach for nominal variables (categories with no natural order). Its main drawback is that it can create a very large number of columns when applied to features with many unique values (high cardinality), which increases memory usage and can slow training.
Label encoding assigns each category a unique integer. For example, Red = 0, Green = 1, Blue = 2. This is memory-efficient and works well for ordinal variables where the integer ordering reflects a meaningful relationship (such as "Low" < "Medium" < "High"). However, applying label encoding to nominal variables can mislead the model into assuming a nonexistent ordinal relationship.
Target encoding replaces each category with the mean (for regression) or class probability (for classification) of the target variable for that category. For instance, if predicting house prices, each neighborhood category is replaced by the average house price in that neighborhood. Target encoding handles high-cardinality features well but carries a risk of overfitting and data leakage if not applied carefully with cross-validation or regularization.
| Method | Cardinality | Ordinal Relationship | Overfitting Risk | Use Case |
|---|---|---|---|---|
| One-Hot Encoding | Low to medium | No | Low | Nominal categories (colors, countries) |
| Label / Ordinal Encoding | Any | Yes | Low | Ordinal categories (ratings, sizes) |
| Target Encoding | High | No | Moderate to high | High-cardinality features (zip codes, product IDs) |
| Binary Encoding | High | No | Low | High-cardinality nominal features |
| Frequency Encoding | High | No | Low | When category frequency is informative |
Text data requires its own set of preprocessing steps before it can be used in natural language processing (NLP) models. Raw text is unstructured and contains noise that must be removed or standardized.
| Step | Description | Example |
|---|---|---|
| Lowercasing | Convert all text to lowercase | "Machine Learning" becomes "machine learning" |
| Tokenization | Split text into individual words or subword units | "I love AI" becomes ["I", "love", "AI"] |
| Stopword removal | Remove common words that carry little meaning | Remove "the", "is", "and", "a" |
| Punctuation removal | Strip punctuation marks | "Hello, world!" becomes "Hello world" |
| Stemming | Reduce words to their root form using rules | "running", "runs", "ran" all become "run" |
| Lemmatization | Reduce words to their dictionary base form using grammar | "better" becomes "good"; "running" becomes "run" |
| Removing HTML tags/URLs | Strip web-specific markup | Remove <p>, <br>, and URL patterns |
| Spell correction | Fix typos and misspellings | "recieve" becomes "receive" |
Stemming is faster but less accurate because it applies heuristic rules to chop suffixes without consulting a dictionary. This can produce non-words (for example, "studies" might become "studi"). Lemmatization is slower but produces valid dictionary words because it considers the word's part of speech and applies morphological analysis. Lemmatization is generally preferred when linguistic accuracy matters, while stemming is adequate when speed is the priority.
Modern NLP systems use several tokenization strategies:
Images require specialized preprocessing before they can be fed to computer vision models. Raw images vary in size, resolution, color space, and quality, all of which must be standardized.
| Step | Description | Why It Matters |
|---|---|---|
| Resizing | Scale images to a uniform size (e.g., 224x224 pixels) | Neural networks require fixed-size inputs |
| Pixel normalization | Scale pixel values from [0, 255] to [0, 1] or [-1, 1] | Helps gradient-based optimization converge faster |
| Channel-wise standardization | Subtract the dataset mean and divide by standard deviation per color channel | Matches the preprocessing used by pretrained models (ImageNet statistics) |
| Grayscale conversion | Convert color images to single-channel grayscale | Reduces dimensionality when color is not informative |
| Noise removal | Apply filters to smooth out image noise | Improves feature extraction quality |
| Histogram equalization | Redistribute pixel intensities for better contrast | Useful for images with poor lighting conditions |
Data augmentation artificially increases the size and diversity of the training set by applying random transformations to existing images during training. Common augmentation techniques include:
Augmentation helps models generalize better by exposing them to a wider variety of training examples. It is especially valuable when the available dataset is small.
Feature selection identifies and retains only the most relevant features from the dataset, removing redundant or irrelevant variables. This reduces overfitting, decreases training time, and can improve model accuracy.
| Category | How It Works | Examples | Speed | Risk of Overfitting |
|---|---|---|---|---|
| Filter methods | Rank features using statistical measures, independent of any model | Pearson correlation, chi-square test, mutual information, variance threshold | Fast | Low |
| Wrapper methods | Evaluate subsets of features by training a model on each subset | Forward selection, backward elimination, recursive feature elimination (RFE) | Slow | Moderate to high |
| Embedded methods | Perform feature selection during model training | LASSO (L1 regularization), tree-based feature importance, ElasticNet | Moderate | Low to moderate |
Filter methods are computationally efficient and work well as a first pass to eliminate clearly irrelevant features. Wrapper methods are more accurate but expensive because they require training multiple models. Embedded methods strike a balance by incorporating feature selection into the model fitting process itself.
Before training a model, the dataset must be divided into separate subsets. The most common approach is a train-test split, where a portion of the data (typically 70 to 80 percent) is used for training and the remainder is held out for evaluation. A more robust approach adds a validation set, creating a three-way split: training (60 to 70 percent), validation (15 to 20 percent), and test (15 to 20 percent).
The training set is used to fit the model, the validation set is used for hyperparameter tuning and model selection, and the test set provides a final, unbiased estimate of model performance. For small datasets, k-fold cross-validation is often used instead of a fixed validation split to make more efficient use of the available data.
When the target variable is imbalanced (for example, 95% negative and 5% positive), a random split can produce training or test sets that do not reflect the original class distribution. Stratified splitting ensures that each subset maintains the same class proportions as the full dataset.
Data leakage is one of the most common and damaging mistakes in machine learning. It occurs when information from the test set inadvertently influences the training process, leading to overly optimistic performance estimates that do not hold up on genuinely unseen data.
The most frequent source of leakage during preprocessing is fitting transformers on the entire dataset before splitting. For example, if you compute the mean and standard deviation for standardization using all data (including test samples), the test set statistics leak into the training process.
Always split the data first, then preprocess. Fit all transformers (scalers, imputers, encoders) on the training data only, then apply the learned transformations to both the training and test sets.
Scikit-learn's Pipeline class bundles preprocessing steps and the final model into a single object. When used with cross-validation, the pipeline ensures that fitting (learning statistics like means and standard deviations) happens only on the training fold, not on the validation fold. This eliminates the risk of leakage.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
pipeline = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler()),
('classifier', LogisticRegression())
])
# fit and predict safely: no leakage
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)
The ColumnTransformer extends this approach by allowing different preprocessing steps for different columns (for example, scaling numerical features and one-hot encoding categorical features), all within the same leakage-proof pipeline.
Different families of machine learning models have different preprocessing requirements. Applying the right preprocessing for the chosen algorithm is critical for achieving good performance.
| Model Family | Scaling Required | Handles Categoricals | Handles Missing Values | Key Preprocessing Steps |
|---|---|---|---|---|
| Linear models (linear/logistic regression) | Yes | No (encode first) | No (impute first) | Standardization, one-hot encoding, imputation |
| Support Vector Machines | Yes | No (encode first) | No (impute first) | Standardization, encoding, imputation |
| K-Nearest Neighbors | Yes | No (encode first) | No (impute first) | Normalization or standardization, encoding |
| Neural networks | Yes | No (encode first) | No (impute first) | Normalization to [0,1] or [-1,1], encoding, imputation |
| Decision trees | No | Some implementations handle natively | Some implementations handle natively | Minimal; encoding may still be needed |
| Random Forest | No | Some implementations handle natively | Some implementations handle natively | Minimal preprocessing |
| Gradient Boosting (XGBoost, LightGBM) | No | LightGBM handles natively; XGBoost needs encoding | XGBoost handles natively | Minimal; label encoding often sufficient |
| Naive Bayes | No | No (encode first) | No (impute first) | Encoding, imputation |
Tree-based ensemble methods are the most forgiving in terms of preprocessing requirements. They are invariant to monotonic transformations of features and can handle missing values internally. In contrast, distance-based and gradient-based models require careful scaling and encoding.
In production machine learning systems, preprocessing steps are organized into reproducible pipelines. A preprocessing pipeline defines a sequence of transformations that are applied consistently to training data, validation data, and new data at inference time.
Key benefits of using pipelines include:
Scikit-learn provides Pipeline and ColumnTransformer as the primary tools for building preprocessing pipelines. Other frameworks like Apache Spark MLlib, TensorFlow's tf.data, and PyTorch's torchvision.transforms offer similar pipeline abstractions for large-scale or domain-specific workflows.
Imagine you have a messy room full of different toys, and you want to play a specific game. Before you can start, you need to clean up the room, sort the toys by type, and make sure nothing is broken. Preprocessing in machine learning works the same way. Before a computer can learn from data, someone needs to fix the mistakes in the data (like a toy with a missing piece), organize everything so it is easy to understand (like sorting toys into bins), and make sure all the numbers are on the same scale (like making sure all the toy cars are the same size so they fit on the same track). If you skip this cleanup step, the computer gets confused and does not learn as well. Just like you would not try to play a board game with missing pieces and cards from three different games mixed together, a machine learning model cannot do its best work with messy, unorganized data.