Preprocessing

Introduction

Preprocessing in machine learning refers to the collection of techniques used to transform raw data into a clean, structured format suitable for training predictive models. Real-world data is rarely ready for direct use: it often contains missing values, inconsistent formatting, outliers, and noise that can degrade model performance. Preprocessing addresses these issues systematically, making it one of the most important and time-consuming stages in any machine learning workflow. Estimates suggest that data scientists spend 60 to 80 percent of their time on data preparation tasks, underscoring how central preprocessing is to practical machine learning.

The goal of preprocessing is to improve data quality and ensure compatibility with the chosen algorithm. Different models have different requirements. Distance-based algorithms such as k-nearest neighbors and support vector machines are sensitive to feature scales, while tree-based methods like random forests handle raw features more gracefully. Understanding which preprocessing steps are needed for a given task and model type is a core skill for any practitioner working with machine learning systems.

Data Cleaning

Data cleaning is the first and most fundamental preprocessing step. It involves detecting and correcting errors, inconsistencies, and inaccuracies in a dataset before any modeling takes place.

Handling Missing Values

Missing data is one of the most common problems in real-world datasets. Values can be missing for many reasons: sensor failures, survey non-responses, data entry errors, or merging datasets with different schemas. There are three main categories of missingness:

Type	Description	Example
Missing Completely at Random (MCAR)	Missingness has no relationship to any variable	A lab instrument randomly fails
Missing at Random (MAR)	Missingness depends on observed variables but not the missing value itself	Older patients are less likely to report weight
Missing Not at Random (MNAR)	Missingness depends on the unobserved value	High-income individuals refuse to disclose salary

Common strategies for handling missing values include:

Strategy	How It Works	Best Used When
Deletion (listwise)	Remove rows with any missing values	Missing data is MCAR and represents less than 5% of the dataset
Mean/Median imputation	Replace missing values with the column mean or median	Numerical features with small amounts of missingness
Mode imputation	Replace missing values with the most frequent value	Categorical data with small amounts of missingness
K-Nearest Neighbors (KNN) imputation	Fill missing values using the average of k nearest neighbors	Moderate missingness where relationships between features exist
Multiple Imputation by Chained Equations (MICE)	Iteratively model each feature with missing values as a function of other features	Complex datasets with multiple missing features
MissForest	Uses random forest models to impute missing values iteratively	Datasets with both numerical and categorical features
Constant/flag imputation	Replace with a constant value and add a binary indicator column	When missingness itself is informative

Research comparing these methods has found that MissForest and MICE tend to outperform simpler approaches, particularly when more than 10% of values are missing. For small amounts of missing data (under 5%), simpler methods like mean or median imputation are often sufficient.

Removing Duplicates

Duplicate records can inflate the importance of certain data points and bias model training. Duplicates may be exact copies or near-duplicates with minor variations (such as different capitalizations or whitespace). Identifying and removing duplicates ensures each observation contributes once to the learning process.

Handling Outliers

Outliers are data points that differ significantly from other observations. They may represent genuine rare events, measurement errors, or data entry mistakes. Common approaches to outlier detection include:

Statistical methods: Using z-scores (flagging points beyond 3 standard deviations) or the interquartile range (IQR) method
Visualization: Box plots and scatter plots to identify anomalous points
Domain knowledge: Consulting subject-matter experts to determine whether extreme values are plausible

Once detected, outliers can be removed, capped (winsorized), or transformed using logarithmic or other nonlinear functions. The correct approach depends on whether the outlier is an error or a genuine extreme value.

Feature Scaling

Feature scaling transforms numerical features so they share a common scale. Without scaling, features with larger numerical ranges can dominate the learning process in algorithms that rely on distance calculations or gradient-based optimization.

Normalization (Min-Max Scaling)

Normalization rescales feature values to a fixed range, typically between 0 and 1. The formula is:

X_normalized = (X - X_min) / (X_max - X_min)

Normalization preserves the original distribution shape and is well-suited for algorithms that expect bounded inputs, such as neural networks and k-nearest neighbors. However, it is sensitive to outliers because extreme values compress the rest of the data into a narrow range.

Standardization (Z-Score Scaling)

Standardization transforms features to have a mean of 0 and a standard deviation of 1. The formula is:

X_standardized = (X - mean) / standard_deviation

Standardization is preferred when the data contains outliers or when the algorithm assumes normally distributed features. It is widely used with support vector machines, logistic regression, and principal component analysis (PCA).

Comparison of Scaling Methods

Method	Range	Outlier Sensitivity	Best For
Min-Max Normalization	[0, 1]	High	Neural networks, image pixel data, KNN
Standardization (Z-Score)	Unbounded (centered at 0)	Moderate	SVM, logistic regression, PCA
Robust Scaling	Unbounded (centered at median)	Low	Datasets with many outliers
Max Absolute Scaling	[-1, 1]	High	Sparse data
Log Transformation	Varies	Reduces right skew	Highly skewed distributions

When Feature Scaling Matters

Feature scaling is essential for distance-based algorithms (KNN, SVM, K-means clustering), gradient-based optimizers (linear regression, logistic regression, neural networks), and dimensionality reduction techniques (PCA, t-SNE). Tree-based models such as decision trees, random forests, and gradient boosting typically do not require scaling because they make split decisions based on individual feature thresholds rather than distances between data points.

Encoding Categorical Variables

Most machine learning algorithms work exclusively with numerical inputs, so categorical data must be converted to numbers. The choice of encoding method depends on whether the categories have an inherent order and how many unique categories exist.

One-Hot Encoding

One-hot encoding creates a new binary column for each unique category. A value of 1 indicates the presence of that category, and 0 indicates its absence. For example, a "Color" feature with values Red, Green, and Blue becomes three columns: Color_Red, Color_Green, and Color_Blue.

One-hot encoding is the standard approach for nominal variables (categories with no natural order). Its main drawback is that it can create a very large number of columns when applied to features with many unique values (high cardinality), which increases memory usage and can slow training.

Label Encoding

Label encoding assigns each category a unique integer. For example, Red = 0, Green = 1, Blue = 2. This is memory-efficient and works well for ordinal variables where the integer ordering reflects a meaningful relationship (such as "Low" < "Medium" < "High"). However, applying label encoding to nominal variables can mislead the model into assuming a nonexistent ordinal relationship.

Target Encoding

Target encoding replaces each category with the mean (for regression) or class probability (for classification) of the target variable for that category. For instance, if predicting house prices, each neighborhood category is replaced by the average house price in that neighborhood. Target encoding handles high-cardinality features well but carries a risk of overfitting and data leakage if not applied carefully with cross-validation or regularization.

Encoding Methods Comparison

Method	Cardinality	Ordinal Relationship	Overfitting Risk	Use Case
One-Hot Encoding	Low to medium	No	Low	Nominal categories (colors, countries)
Label / Ordinal Encoding	Any	Yes	Low	Ordinal categories (ratings, sizes)
Target Encoding	High	No	Moderate to high	High-cardinality features (zip codes, product IDs)
Binary Encoding	High	No	Low	High-cardinality nominal features
Frequency Encoding	High	No	Low	When category frequency is informative

Text Preprocessing

Text data requires its own set of preprocessing steps before it can be used in natural language processing (NLP) models. Raw text is unstructured and contains noise that must be removed or standardized.

Common Text Preprocessing Steps

Step	Description	Example
Lowercasing	Convert all text to lowercase	"Machine Learning" becomes "machine learning"
Tokenization	Split text into individual words or subword units	"I love AI" becomes ["I", "love", "AI"]
Stopword removal	Remove common words that carry little meaning	Remove "the", "is", "and", "a"
Punctuation removal	Strip punctuation marks	"Hello, world!" becomes "Hello world"
Stemming	Reduce words to their root form using rules	"running", "runs", "ran" all become "run"
Lemmatization	Reduce words to their dictionary base form using grammar	"better" becomes "good"; "running" becomes "run"
Removing HTML tags/URLs	Strip web-specific markup	Remove `<p>`, `<br>`, and URL patterns
Spell correction	Fix typos and misspellings	"recieve" becomes "receive"

Stemming vs. Lemmatization

Stemming is faster but less accurate because it applies heuristic rules to chop suffixes without consulting a dictionary. This can produce non-words (for example, "studies" might become "studi"). Lemmatization is slower but produces valid dictionary words because it considers the word's part of speech and applies morphological analysis. Lemmatization is generally preferred when linguistic accuracy matters, while stemming is adequate when speed is the priority.

Tokenization Approaches

Modern NLP systems use several tokenization strategies:

Word tokenization: Splits text on whitespace and punctuation. Simple but struggles with out-of-vocabulary words.
Subword tokenization: Methods like Byte Pair Encoding (BPE) and WordPiece split rare words into smaller units while keeping common words intact. Used by models like GPT and BERT.
Character tokenization: Treats each character as a token. Handles any input but produces very long sequences.

Image Preprocessing

Images require specialized preprocessing before they can be fed to computer vision models. Raw images vary in size, resolution, color space, and quality, all of which must be standardized.

Key Image Preprocessing Steps

Step	Description	Why It Matters
Resizing	Scale images to a uniform size (e.g., 224x224 pixels)	Neural networks require fixed-size inputs
Pixel normalization	Scale pixel values from [0, 255] to [0, 1] or [-1, 1]	Helps gradient-based optimization converge faster
Channel-wise standardization	Subtract the dataset mean and divide by standard deviation per color channel	Matches the preprocessing used by pretrained models (ImageNet statistics)
Grayscale conversion	Convert color images to single-channel grayscale	Reduces dimensionality when color is not informative
Noise removal	Apply filters to smooth out image noise	Improves feature extraction quality
Histogram equalization	Redistribute pixel intensities for better contrast	Useful for images with poor lighting conditions

Data Augmentation

Data augmentation artificially increases the size and diversity of the training set by applying random transformations to existing images during training. Common augmentation techniques include:

Geometric transformations: Random rotations, horizontal/vertical flips, cropping, and scaling
Color transformations: Brightness, contrast, saturation, and hue adjustments
Noise injection: Adding Gaussian noise or random erasing of image patches
Advanced methods: Mixup (blending two images), CutMix (replacing image regions), and style transfer

Augmentation helps models generalize better by exposing them to a wider variety of training examples. It is especially valuable when the available dataset is small.

Feature Selection

Feature selection identifies and retains only the most relevant features from the dataset, removing redundant or irrelevant variables. This reduces overfitting, decreases training time, and can improve model accuracy.

Feature Selection Methods

Category	How It Works	Examples	Speed	Risk of Overfitting
Filter methods	Rank features using statistical measures, independent of any model	Pearson correlation, chi-square test, mutual information, variance threshold	Fast	Low
Wrapper methods	Evaluate subsets of features by training a model on each subset	Forward selection, backward elimination, recursive feature elimination (RFE)	Slow	Moderate to high
Embedded methods	Perform feature selection during model training	LASSO (L1 regularization), tree-based feature importance, ElasticNet	Moderate	Low to moderate

Filter methods are computationally efficient and work well as a first pass to eliminate clearly irrelevant features. Wrapper methods are more accurate but expensive because they require training multiple models. Embedded methods strike a balance by incorporating feature selection into the model fitting process itself.

Train-Test Split

Before training a model, the dataset must be divided into separate subsets. The most common approach is a train-test split, where a portion of the data (typically 70 to 80 percent) is used for training and the remainder is held out for evaluation. A more robust approach adds a validation set, creating a three-way split: training (60 to 70 percent), validation (15 to 20 percent), and test (15 to 20 percent).

The training set is used to fit the model, the validation set is used for hyperparameter tuning and model selection, and the test set provides a final, unbiased estimate of model performance. For small datasets, k-fold cross-validation is often used instead of a fixed validation split to make more efficient use of the available data.

Stratified Splitting

When the target variable is imbalanced (for example, 95% negative and 5% positive), a random split can produce training or test sets that do not reflect the original class distribution. Stratified splitting ensures that each subset maintains the same class proportions as the full dataset.

Data Leakage During Preprocessing

Data leakage is one of the most common and damaging mistakes in machine learning. It occurs when information from the test set inadvertently influences the training process, leading to overly optimistic performance estimates that do not hold up on genuinely unseen data.

The most frequent source of leakage during preprocessing is fitting transformers on the entire dataset before splitting. For example, if you compute the mean and standard deviation for standardization using all data (including test samples), the test set statistics leak into the training process.

The Golden Rule

Always split the data first, then preprocess. Fit all transformers (scalers, imputers, encoders) on the training data only, then apply the learned transformations to both the training and test sets.

Using Pipelines to Prevent Leakage

Scikit-learn's Pipeline class bundles preprocessing steps and the final model into a single object. When used with cross-validation, the pipeline ensures that fitting (learning statistics like means and standard deviations) happens only on the training fold, not on the validation fold. This eliminates the risk of leakage.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression

pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression())
])

# fit and predict safely: no leakage
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)

The ColumnTransformer extends this approach by allowing different preprocessing steps for different columns (for example, scaling numerical features and one-hot encoding categorical features), all within the same leakage-proof pipeline.

Preprocessing for Different Model Types

Different families of machine learning models have different preprocessing requirements. Applying the right preprocessing for the chosen algorithm is critical for achieving good performance.

Model Family	Scaling Required	Handles Categoricals	Handles Missing Values	Key Preprocessing Steps
Linear models (linear/logistic regression)	Yes	No (encode first)	No (impute first)	Standardization, one-hot encoding, imputation
Support Vector Machines	Yes	No (encode first)	No (impute first)	Standardization, encoding, imputation
K-Nearest Neighbors	Yes	No (encode first)	No (impute first)	Normalization or standardization, encoding
Neural networks	Yes	No (encode first)	No (impute first)	Normalization to [0,1] or [-1,1], encoding, imputation
Decision trees	No	Some implementations handle natively	Some implementations handle natively	Minimal; encoding may still be needed
Random Forest	No	Some implementations handle natively	Some implementations handle natively	Minimal preprocessing
Gradient Boosting (XGBoost, LightGBM)	No	LightGBM handles natively; XGBoost needs encoding	XGBoost handles natively	Minimal; label encoding often sufficient
Naive Bayes	No	No (encode first)	No (impute first)	Encoding, imputation

Tree-based ensemble methods are the most forgiving in terms of preprocessing requirements. They are invariant to monotonic transformations of features and can handle missing values internally. In contrast, distance-based and gradient-based models require careful scaling and encoding.

Preprocessing Pipelines in Practice

In production machine learning systems, preprocessing steps are organized into reproducible pipelines. A preprocessing pipeline defines a sequence of transformations that are applied consistently to training data, validation data, and new data at inference time.

Key benefits of using pipelines include:

Reproducibility: The same transformations are applied in the same order every time
Leakage prevention: Transformers are fit on training data only
Deployment simplicity: The entire pipeline (preprocessing + model) can be serialized and deployed as a single artifact
Experimentation: Different preprocessing configurations can be swapped in and out easily

Scikit-learn provides Pipeline and ColumnTransformer as the primary tools for building preprocessing pipelines. Other frameworks like Apache Spark MLlib, TensorFlow's tf.data, and PyTorch's torchvision.transforms offer similar pipeline abstractions for large-scale or domain-specific workflows.

Best Practices

Understand the data before preprocessing. Explore distributions, data types, and missing value patterns using exploratory data analysis (EDA) before choosing preprocessing steps.
Split before preprocessing. Always perform the train-test split before fitting any transformers to avoid data leakage.
Match preprocessing to the model. Scale features for distance-based and gradient-based models; skip scaling for tree-based models.
Handle missing data thoughtfully. Choose imputation strategies based on the amount and type of missingness, not just convenience.
Avoid high-cardinality traps. Use target encoding or frequency encoding instead of one-hot encoding for features with hundreds or thousands of unique values.
Document every step. Record all preprocessing decisions so experiments can be reproduced and debugged.
Automate with pipelines. Use scikit-learn Pipelines or equivalent tools to bundle preprocessing and modeling into a single, leakage-free workflow.
Validate preprocessing choices. Use cross-validation to confirm that preprocessing decisions improve model performance rather than introducing bias.

Explain Like I'm 5 (ELI5)

Imagine you have a messy room full of different toys, and you want to play a specific game. Before you can start, you need to clean up the room, sort the toys by type, and make sure nothing is broken. Preprocessing in machine learning works the same way. Before a computer can learn from data, someone needs to fix the mistakes in the data (like a toy with a missing piece), organize everything so it is easy to understand (like sorting toys into bins), and make sure all the numbers are on the same scale (like making sure all the toy cars are the same size so they fit on the same track). If you skip this cleanup step, the computer gets confused and does not learn as well. Just like you would not try to play a board game with missing pieces and cards from three different games mixed together, a machine learning model cannot do its best work with messy, unorganized data.

References

Scikit-learn documentation. "Preprocessing data." https://scikit-learn.org/stable/modules/preprocessing.html
Scikit-learn documentation. "Imputation of missing values." https://scikit-learn.org/stable/modules/impute.html
Scikit-learn documentation. "Common pitfalls and recommended practices." https://scikit-learn.org/stable/common_pitfalls.html
Raschka, S. "About Feature Scaling and Normalization." https://sebastianraschka.com/Articles/2014_about_feature_scaling.html
Brownlee, J. "How to Avoid Data Leakage When Performing Data Preparation." Machine Learning Mastery. https://machinelearningmastery.com/data-preparation-without-data-leakage/
Brownlee, J. "Ordinal and One-Hot Encodings for Categorical Data." Machine Learning Mastery. https://machinelearningmastery.com/one-hot-encoding-for-categorical-data/
GeeksforGeeks. "Feature Selection Techniques in Machine Learning." https://www.geeksforgeeks.org/machine-learning/feature-selection-techniques-in-machine-learning/
Wikipedia. "Feature scaling." https://en.wikipedia.org/wiki/Feature_scaling
Ultralytics. "Data Preprocessing Techniques for Annotated Computer Vision Data." https://docs.ultralytics.com/guides/preprocessing_annotated_data/
Dev.to. "The Complete Guide to NLP Text Preprocessing." https://dev.to/themustaphatijani/the-complete-guide-to-nlp-text-preprocessing-tokenization-normalization-stemming-lemmatization-50ap

Introduction

Data Cleaning

Handling Missing Values

Removing Duplicates

Handling Outliers

Feature Scaling

Normalization (Min-Max Scaling)

Standardization (Z-Score Scaling)

Comparison of Scaling Methods

When Feature Scaling Matters

Encoding Categorical Variables

One-Hot Encoding

Label Encoding

Target Encoding

Encoding Methods Comparison

Text Preprocessing

Common Text Preprocessing Steps

Stemming vs. Lemmatization

Tokenization Approaches

Image Preprocessing

Key Image Preprocessing Steps

Data Augmentation

Feature Selection

Feature Selection Methods

Train-Test Split

Stratified Splitting

Data Leakage During Preprocessing

The Golden Rule

Using Pipelines to Prevent Leakage

Preprocessing for Different Model Types

Preprocessing Pipelines in Practice

Best Practices

Explain Like I'm 5 (ELI5)

References

Improve this article

Related Articles

ARC-AGI 2

Dimension Reduction

Bucketing

Class-Imbalanced Dataset

Data Augmentation

Data Set or Dataset

Introduction

Data Cleaning

Handling Missing Values

Removing Duplicates

Handling Outliers

Feature Scaling

Normalization (Min-Max Scaling)

Standardization (Z-Score Scaling)

Comparison of Scaling Methods

When Feature Scaling Matters

Encoding Categorical Variables

One-Hot Encoding

Label Encoding

Target Encoding

Encoding Methods Comparison

Text Preprocessing

Common Text Preprocessing Steps

Stemming vs. Lemmatization

Tokenization Approaches

Image Preprocessing

Key Image Preprocessing Steps

Data Augmentation

Feature Selection

Feature Selection Methods

Train-Test Split

Stratified Splitting

Data Leakage During Preprocessing

The Golden Rule

Using Pipelines to Prevent Leakage

Preprocessing for Different Model Types

Preprocessing Pipelines in Practice

Best Practices

Explain Like I'm 5 (ELI5)

References

Related Articles

ARC-AGI 2

Dimension Reduction

Bucketing