Data preprocessing

Data & Datasets Machine Learning

30 min read

Updated Jun 24, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 24, 2026

Fact-checked

In review queue

Sources

26 citations

Revision

v2 · 6,010 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Data preprocessing is the set of operations applied to raw data to clean and transform it into a form a machine learning model can use, covering deduplication, type fixing, missing-value imputation, outlier handling, feature scaling, categorical encoding, dataset splitting, augmentation, and (for web-scale corpora) language filtering, quality scoring, and tokenizing text. It is the most time-consuming part of applied machine learning: Anaconda's 2020 State of Data Science survey of 2,360 respondents across more than 100 countries found that "on average 45% of their time is spent getting data ready (loading and cleansing)," and an earlier 2016 CrowdFlower (later Figure Eight) survey put data preparation at roughly 80% of the work, with 60% of time on cleaning and organizing data and 19% on collecting data sets.^[1]^[2]^[3] The governing principle is the old computing slogan garbage in, garbage out: a model trained on flawed inputs produces flawed outputs no matter how the algorithm is tuned.

The category covers a long list of activities: removing duplicates, fixing types, imputing missing values, scaling numbers, encoding categories, building features, splitting datasets, augmenting samples, and (for modern web-scale corpora) deduplicating, language filtering, quality scoring, and tokenizing text. None of these steps is glamorous, and most of the time spent on a real project is spent here. The estimate that data wrangling consumes most of a practitioner's time has been remarkably stable since the term "data wrangling" started showing up in the 2010s.^[3]

The topic sits at the intersection of statistics, software engineering, and the messy reality of how data actually arrives. Classical preprocessing for tabular data has well established recipes that you will find in scikit-learn and the textbooks of Hastie, Tibshirani and Friedman or Bishop. The deep learning era added a separate body of practice for images, audio, and text, and the large language model era added another layer on top of that, where you are dealing with petabytes of crawled web pages and have to make defensible choices about deduplication, content filtering, and decontamination of evaluation benchmarks. This article covers all three.

Why does preprocessing matter?

There is an old slogan in computing, garbage in, garbage out, and it survives because it is essentially right. The phrase first appeared in print on November 10, 1957, in The Hammond Times, and is commonly attributed to IBM instructor George Fuechsel around 1958 to 1959; it captures the idea that a program produces erroneous output if given erroneous input.^[4] A model is a function that maps inputs to outputs, and if the inputs are encoded in a form the model cannot use, or contain systematic distortions the model cannot see past, no amount of tuning fixes that downstream. Three concrete reasons preprocessing affects results:

Algorithmic assumptions. Many learning algorithms make implicit distributional assumptions. Linear regression with ridge or L2 regularization assumes features are on comparable scales; k-nearest neighbors and support vector machines compute distances and so depend heavily on scale; gradient descent converges much faster when input features are roughly zero mean and unit variance. Tree based methods are mostly invariant to monotone transformations of features, but even they are sensitive to how categorical variables are encoded.
Numerical stability. Floating point arithmetic loses precision when values span many orders of magnitude. Standardization, log transforms, and clipping outliers are not just statistical decorations; they keep gradients finite and matrices invertible.
Statistical efficiency. Removing leakage, balancing classes, and choosing a sound train/validation/test split are what determine whether your reported numbers generalize. Most of the spectacular failures in deployed ML look in retrospect like preprocessing failures: the test set was contaminated, the imputation was fit on the test data, the encoder leaked the target, the dedup was incomplete and the same example showed up in both splits.

For LLM training the stakes get larger. The cost of a single training run is now in the millions of dollars, and the data choices baked in at the start are essentially permanent for that checkpoint. Mistakes get amplified, and by the time you notice that you trained on the test set, you have already trained on the test set.

What are the categories of preprocessing?

There is no canonical taxonomy, but the activities below cover most of what people mean by "preprocessing" in tabular and classical ML contexts. The categories overlap. Feature engineering is sometimes treated as separate because it is more creative than mechanical, and the same with feature extraction for raw signal data, but in practice it all lives in the same pipeline. Preprocessing is the substrate that feature engineering is built on: cleaned, scaled, and consistently encoded columns are the precondition for constructing the higher-level features that actually drive model accuracy.

Cleaning

Cleaning is the part you cannot skip. Real data has missing entries, mistyped values, duplicate rows, encoding problems, free text that should be a category, dates stored as strings, numbers stored as strings, and the occasional silent unit error (centimeters logged as meters). The first pass usually is:

Standardize types. Coerce columns to the right dtype. In pandas this means using astype or convert_dtypes, and in pyarrow or polars it is built in.
Strip whitespace, normalize case, fix encoding (UTF 8 normalization, removing zero width characters).
Detect and resolve duplicates. Exact duplicates are easy. Near duplicates need fuzzy matching: Levenshtein distance, MinHash, or domain specific keys.
Detect outliers. The cheapest screen is the interquartile range rule, flagging values outside Q1 minus 1.5 IQR or Q3 plus 1.5 IQR. More principled options include z-score thresholds (only meaningful for roughly Gaussian data), Isolation Forest, Local Outlier Factor, and One Class SVM. See outlier detection for a longer treatment.

A recurring pitfall is treating outliers as something to delete. Often they are real, just rare, and dropping them silently biases the model.

Missing values and imputation

Missingness has a structure that matters. Statisticians distinguish three regimes:

MCAR (missing completely at random): the probability of being missing does not depend on either observed or unobserved values. Rare in practice.
MAR (missing at random): missingness depends on observed values but not on the missing value itself. This is the regime where most imputation methods are valid.
MNAR (missing not at random): missingness depends on the unobserved value (income missing because high earners refuse to answer). This is the hard case, and no general purpose method handles it.

Common imputation strategies, in roughly increasing order of sophistication:

Method	How it works	When to use
Drop rows	Remove rows with any missing value	Tiny missing fraction, MCAR
Drop columns	Remove the entire column	Column is mostly missing
Mean / median / mode	Fill with column statistic	Quick baseline; loses variance
Forward / backward fill	Propagate adjacent values	Time series
KNN imputation	Average of k nearest neighbors	Mixed types, moderate size
Iterative (MICE)	Model each column from the others, iterate	Multivariate, MAR
Model based	Use a regressor or classifier per column	Larger datasets
Indicator variables	Add a binary "was missing" feature	Always cheap, often helps

A practical move that is too rarely done: combine imputation with an explicit "is missing" indicator column. The model can then learn whether missingness itself is informative.

In scikit-learn the relevant tools are SimpleImputer, KNNImputer, and the experimental IterativeImputer, which is a port of the MICE algorithm.^[5]^[6]

Encoding categorical variables

Most models cannot consume strings directly. The choice of encoding has a real effect on accuracy, especially for high cardinality categoricals.

Encoding	Output	Notes
One hot	One binary column per level	Standard for low cardinality, blows up for high cardinality
Ordinal / label	Integer per level	Only meaningful when an order exists
Target / mean	Mean of target per level	Powerful but leaks unless done in a CV aware way
Frequency / count	Count of each level in training data	Cheap proxy for prevalence
Hashing	Hash to fixed bucket count	Fast, no vocabulary, hash collisions
Embedding	Learned dense vector per level	Standard in deep learning, see embeddings

Target encoding is genuinely useful for high cardinality features (zip codes, product IDs, user IDs) but is also one of the easiest places to introduce data leakage. The clean way to do it is to compute encodings out of fold inside a cross validation loop, then refit on full training data for prediction.

Scaling and normalization: min-max versus z-score

Scaling puts numerical features on comparable ranges. The four common variants:

Method	Formula	Robust to outliers?
Standardization (z score)	(x minus mean) divided by std	No
Min max	(x minus min) divided by (max minus min)	No
Max abs	x divided by max absolute value	No, preserves sparsity
Robust scaling	(x minus median) divided by IQR	Yes

Standardization (the z-score transform, subtract the mean and divide by the standard deviation) is the default for most linear methods and for neural networks; it centers each feature at zero with unit variance. Min max scaling maps to [0, 1] by subtracting the minimum and dividing by the range, and is sometimes used for image pixel intensities or for inputs to algorithms with bounded activation functions. The practical difference: min-max preserves the exact shape of the original distribution within a fixed bounded interval but is highly sensitive to a single extreme value (one outlier compresses everything else toward zero), whereas standardization is unbounded but less distorted by a stray maximum. Robust scaling, which uses the median and the interquartile range instead of the mean and standard deviation, is the choice when outliers are present and you do not want to remove them. In scikit-learn these are StandardScaler, MinMaxScaler, MaxAbsScaler, and RobustScaler, all fit on the training split only.^[5]

The related concept of normalization sometimes refers to row wise rescaling (each sample to unit norm) and sometimes to column wise scaling. Read the documentation carefully; in scikit-learn StandardScaler is column wise, Normalizer is row wise.

Feature transformation

Transformations change the shape of a feature's distribution. Useful when models assume something approximately Gaussian or when a feature has heavy tails:

Log transform. log(1 + x) is the workhorse for right skewed positive data: counts, prices, durations, anything power law shaped. Use log1p to avoid problems at zero.
Box Cox. A parameterized power transform that interpolates between log, square root, and identity. Requires strictly positive inputs.
Yeo Johnson. A generalization of Box Cox that handles zeros and negatives. Implemented in scikit-learn as PowerTransformer(method='yeo-johnson').
Quantile transform. Maps the empirical distribution onto a uniform or Gaussian target. Strong if you do not care about preserving distances.
Polynomial features. Add cross products and powers up to a degree. Used to be popular for linear models; mostly replaced by tree based methods and neural nets in modern practice.
Binning / discretization. Convert a continuous feature into a categorical one by cutting it at fixed or quantile based boundaries. See bucketing for the longer treatment, including the tradeoffs around quantile bucketing for ML serving.

Feature selection

When you have more features than you need, you can remove them automatically. Three families:

Filter methods. Score each feature against the target without involving the model. Examples: chi square test for categorical variables, ANOVA F test for numerical features against a categorical target, mutual information, correlation with the target. Cheap, and a good first pass.
Wrapper methods. Treat feature selection as a search problem and evaluate the model on subsets. Recursive feature elimination (RFE), forward selection, backward elimination. More accurate, much slower.
Embedded methods. The model itself does selection during fitting. L1 regularized regression (Lasso) drives some coefficients to zero. Tree based models give feature importances from how often each feature is used in splits. Boosted trees with shrinkage provide robust embedded selection.

Dimensionality reduction

Distinct from feature selection: dimensionality reduction projects all features into a lower dimensional space, mixing them together. Standard methods:

Principal component analysis (PCA) finds orthogonal directions that maximize variance. Linear, fast, good for whitening.
t-SNE preserves local neighborhoods. Mostly used for visualization, not as a preprocessing step for downstream models.
UMAP preserves both local and global structure better than t-SNE for many datasets, and is fast enough to use as a preprocessing transform.
Autoencoders (under deep learning) learn nonlinear projections.

For most tabular tasks PCA is the right first try. For modern image and text work, the relevant projections are baked into the model itself and explicit dimensionality reduction during preprocessing is unusual.

Class imbalance

When one class is much rarer than the others (fraud, rare disease, churn), naive training tends to ignore it. Three responses:

Resampling. Oversample the minority class, undersample the majority, or both. The synthetic minority oversampling technique SMOTE generates new minority samples by interpolating between existing ones; ADASYN is a variant that focuses on harder to learn regions.^[7]
Class weighting. Most learners accept per class weights that scale the loss contribution. In scikit-learn, set class_weight='balanced' and the inverse frequency weights are computed automatically.
Threshold adjustment. Train on the original distribution and tune the decision threshold on validation data. Often as effective as resampling and avoids artifacts.

A caveat: resampling should be done inside the cross validation fold, never on the full dataset before splitting, or you will leak.

Augmentation

Data augmentation generates additional training examples by transforming existing ones in label preserving ways. Modality specific:

Images. Flips, crops, color jitter, rotations, scaling, blur, MixUp (linear interpolation of images and labels), CutMix (paste a patch from one image into another and mix labels by area), RandAugment, AutoAugment.^[18]^[19]^[20]
Text. Synonym replacement, random insertion, deletion, swap (the EDA family), back translation, paraphrase models, span masking.
Audio. Time shifting, pitch shifting, time stretching, additive noise, SpecAugment for spectrogram domain masking.^[21]
Tabular. Less standardized, but Gaussian noise injection on continuous columns and SMOTE-like interpolation are used.

The principle in all cases: every augmentation must preserve the label. A horizontal flip is fine for cat-vs-dog classification, not for OCR.

How should you split data into train, validation, and test sets?

The last preprocessing decision before training is how to partition data. Common patterns:

Random train / validation / test split. The default for IID data. Typical ratios 60/20/20 or 80/10/10.
Stratified split. Preserves the class distribution across splits. Important for imbalanced problems.
Group K fold. Ensures that all samples from the same group (patient, user, document) end up in the same fold. Necessary whenever group level structure could leak through random splits.
Time series split. For temporally ordered data, the validation set must come after the training set in time. Use rolling origin evaluation, never a random split.
Leave one group out. When groups are small and you can afford it.

The cardinal rule is that the test set is touched exactly once, at the end. Anything you tune on it (hyperparameters, preprocessing parameters, threshold, ensemble weights) is no longer a fair generalization estimate. The split also has to respect deduplication: if the same record exists in two rows and one lands in train and the other in test, the test number is contaminated. Lee et al. (2022) found roughly a 1% overlap between training and test sets in several standard language modeling corpora, which is exactly this failure at web scale.^[15]

How is data prepared for modern NLP and LLM training?

The LLM era turned preprocessing into a discipline of its own. The pipeline that produces a training data corpus from raw web crawls now has many stages, and the design choices at each stage have measurable downstream effects on model quality. The reference points here are the documented pipelines for C4, The Pile, RefinedWeb, FineWeb and FineWeb Edu, Dolma, and DataComp LM. Each one published enough detail that you can reconstruct what they did.

The shape of a web scale text pipeline

Most pipelines that start from Common Crawl WARC files run something close to the following sequence. The order matters; getting it wrong wastes compute or, worse, produces a corpus that looks fine but trains a worse model.

Raw extraction. Pull plain text from HTML. Trafilatura and resiliparse are common choices. The C4 paper used a custom extractor; RefinedWeb used trafilatura tuned for web text; FineWeb extended this with their own optimizations.
Language identification. Filter to the languages you want. Common tools are CLD3 (Google's compact language detector), and fastText language identification. For a model intended to be English only, you typically keep documents whose detected language is English with confidence above a threshold like 0.65.
URL based filtering. Drop URLs that match blocklists of adult content, malware, and spam. C4 used a list maintained at lafrog.org of around 6500 phrases; later corpora maintain their own lists.
Heuristic quality filters. Rules of thumb that throw out obviously broken or low quality documents. The Gopher paper from DeepMind documented a clean set, since reused widely: drop documents with too few words, with mean word length outside [3, 10], with a high fraction of symbols, with too few or too many bullet points, with too few lines ending in punctuation. C4 added rules for sentence length and presence of curly braces (a heuristic for code in a text only corpus).^[10]
Repetition filters. Drop documents with a high fraction of repeated lines, paragraphs, or n-grams. These rules catch SEO spam, navigation menus that leaked through extraction, and template generated text.^[10]
Deduplication. This is the single highest impact step. Two main approaches:
- Exact deduplication on long substrings. Remove documents whose 50-token spans match other documents above some threshold. Implemented with suffix arrays.
- Approximate deduplication. Compute MinHash signatures over n-gram shingles, then use locality sensitive hashing to find near duplicate document pairs and remove all but one. Lee et al. (2022) showed that thorough dedup substantially improves models: they found one 61-word English sentence repeated over 60,000 times in C4, and reported that deduplicated models emit memorized training text about ten times less often and reach the same or better accuracy in fewer training steps. Every later corpus has taken dedup more seriously than its predecessors.^[15]
Model based quality filtering. Optional, but increasingly standard. Train a classifier (often fastText or a small BERT) to discriminate "high quality" reference text (Wikipedia, books, OpenWebText) from random web text, and use the classifier score as a filter. KenLM perplexity on a Wikipedia trained 5-gram model is another option used by CCNet and others.
Content safety filtering. Remove personally identifiable information, child sexual abuse material indicators, and other clearly disallowed content. This is mostly a hash matching and regex problem at this stage.
Decontamination. Remove documents that overlap with the evaluation benchmarks you plan to use. The simple version checks for n-gram matches against benchmark inputs and outputs; Gopher used 13-gram Jaccard similarity against its test sets. Decontamination is famously imperfect, and most published models still leak some test sets.^[10]^[15]
Tokenization. The corpus is finally fed through a tokenizer and converted to integer IDs. Common tokenizer algorithms are byte pair encoding, WordPiece, SentencePiece (which is BPE or unigram operating directly on Unicode bytes), and tiktoken (OpenAI's BPE implementation). Vocabulary sizes are typically in the range 32k to 256k tokens for modern LLMs. The tokenizer is trained on a sample of the cleaned corpus, then applied to all of it.

Notable corpus pipelines

A short tour of decisions that are now public:

C4 (Raffel et al., 2020). Extracted from a single Common Crawl snapshot, filtered with a dense set of heuristic rules: keep only English (langdetect with threshold 0.99), drop pages with curly braces, drop pages with fewer than three sentences, drop pages with the placeholder "lorem ipsum," drop pages whose URL matches the blocklist. About 750 GB after filtering, used to train T5.^[8]
The Pile (Gao et al., 2020). 825.18 GiB across 22 sub corpora (211 million documents), deliberately diverse: PubMed Central, ArXiv, GitHub, Books3, Wikipedia, FreeLaw, Hacker News, and others. Heavier emphasis on curated sources than on web filtering. Used by EleutherAI for GPT NeoX 20B.^[9]
RefinedWeb (Penedo et al., 2023). The TII Falcon corpus. Strategy: minimal heuristic filtering, very heavy deduplication combining exact and fuzzy passes. The paper, subtitled "Outperforming Curated Corpora with Web Data, and Web Data Only," argues that "properly filtered and deduplicated web data alone can lead to powerful models." Roughly 5 trillion tokens were extracted; 600 billion tokens were released publicly.^[11]
Dolma (Soldaini et al., 2024). 3 trillion tokens (sampled from about 200 TB of raw text, curated down to an 11 TB dataset), designed by AI2 for OLMo. Notable for documenting provenance of every document and publishing the full open toolkit. Mixes Common Crawl, Stack code, peS2o academic papers, project Gutenberg books, Reddit, and Wikipedia.^[12]
FineWeb (Penedo et al., 2024). Hugging Face follow up to RefinedWeb, derived from 96 Common Crawl snapshots. 15 trillion tokens. The paper reports that FineWeb "produces better-performing LLMs than other open pretraining datasets" and publishes sequenced ablations measuring the effect of each individual filtering step.^[13]
FineWeb Edu (2024). The same source filtered down to roughly 1.3 trillion tokens using an educational quality classifier trained on annotations from Llama 3 70B Instruct, keeping documents scored at or above the threshold. Smaller, but produces stronger downstream models for the same training compute.^[13]
DataComp LM (Li et al., 2024). A benchmark style approach: fixed evaluation, variable data filtering, leaderboard for what data choices win. Produced DCLM Baseline, a 4.1 trillion token corpus using model based filtering.^[14]

The broad arc of progress is that filtering has gotten more selective and dedup has gotten more aggressive. The total data after filtering keeps getting smaller relative to the input, and models trained on the filtered data keep getting better.

Tokenization in detail

Tokenization deserves its own treatment but is part of preprocessing here because it is the last step before the corpus becomes integer IDs. The two design axes are the algorithm and the unit.

Algorithms:

Byte pair encoding (BPE). Start with characters, repeatedly merge the most frequent adjacent pair, stop at a target vocabulary size. Used in GPT 2, GPT 3, GPT 4, Llama 2 and 3.^[16]
WordPiece. Like BPE but the merge criterion maximizes likelihood under a unigram language model. Used in BERT and DistilBERT.
Unigram. Start with a large vocabulary, prune low probability tokens. Used in some SentencePiece deployments and in T5.
SentencePiece. A wrapper that runs BPE or unigram directly on raw text, treating spaces as ordinary characters. Avoids language specific pre-tokenization. Used in Llama, mT5, ALBERT.^[17]

Units: most modern tokenizers operate on Unicode bytes rather than codepoints, which gives full coverage of any input text including emoji, code, and rare scripts. Byte level BPE is now the default for general purpose LLMs.

The tokenizer is fit once and then frozen. Changing the tokenizer mid training is essentially impossible; you have to start over.

How are images preprocessed for computer vision?

For images the pipeline is shorter and more standardized.

Decode and convert color space. Read JPEG or PNG into RGB. For tasks that depend on color invariance, sometimes convert to YCbCr or LAB.
Resize. Most architectures expect a fixed input size. Resize while preserving aspect ratio (with letterboxing) for object detection; center crop or random crop for classification.
Normalize pixel values. Scale to [0, 1] by dividing by 255, then subtract per channel mean and divide by per channel standard deviation. The exact constants used are part of the model card. For ImageNet pretrained models the standard constants are mean [0.485, 0.456, 0.406] and standard deviation [0.229, 0.224, 0.225] in RGB order (the green mean is lower because ImageNet contains so much vegetation); these are computed from the ImageNet training set and have been reused for nearly every downstream task, applied in PyTorch via transforms.Normalize.^[22]
Augmentation. Stochastic transforms applied during training. Common ones: random crop, random horizontal flip, color jitter (brightness, contrast, saturation, hue), random erasing. Stronger pipelines: AutoAugment (a learned policy), RandAugment (a simpler, hyperparameter free version), TrivialAugment, MixUp (linear interpolation of pairs of images and their labels), CutMix (paste a region from another image and combine labels in proportion to area).^[18]^[19]^[20]
Patchify (for vision transformers). Split the image into non overlapping patches of fixed size, typically 14 by 14 or 16 by 16, and flatten each patch into a vector for the transformer.

Video preprocessing adds frame sampling decisions: uniform, dense, or learned, and the choice has a measurable effect on downstream action recognition.

How is audio preprocessed?

Audio comes in as a one dimensional waveform sampled at 8 kHz, 16 kHz, 22.05 kHz, 44.1 kHz, or 48 kHz. The first decisions are sample rate normalization and channel mixdown to mono.

From there the pipeline depends on the model. Classical audio models work in a time frequency domain:

Short time Fourier transform (STFT). Slide a windowed FFT across the waveform. Output is a complex spectrogram.
Magnitude or power spectrogram. Drop the phase, square the magnitude.
Mel spectrogram. Re-bin frequency axis onto the mel scale, which approximates human auditory frequency perception. Standard input for speech recognition and audio classification.
Log mel. Take the log of the mel spectrogram. Compresses dynamic range, matches human loudness perception.
MFCC. Discrete cosine transform of the log mel spectrogram. Fewer coefficients, classical speech recognition feature.

Other specialized preprocessing:

Voice activity detection (VAD). Segment the audio into speech and non speech regions. Tools include WebRTC VAD and silero VAD. Used to drop silence or split long recordings.
Speaker diarization. "Who spoke when" labels for multi speaker recordings.
Noise reduction. Spectral subtraction, Wiener filtering, or learned denoisers.
Augmentation. Time stretching, pitch shifting, additive noise from a background corpus, room impulse response convolution to simulate reverberation, SpecAugment (mask random time and frequency bands of the spectrogram).^[21]

Modern end to end speech models like Whisper consume log mel spectrograms directly. Some recent models work on raw waveforms, but spectrograms remain the dominant input representation.

What tools and frameworks are used for preprocessing?

A small zoo of libraries handles the operations above. The right choice depends on data size and the rest of your stack.

Tool	Domain	Notes
pandas	Tabular	Single machine, the de facto standard for in memory dataframe work
polars	Tabular	Single machine or out of core, faster than pandas, lazy execution
dask	Tabular	Distributed pandas like API, scales to a cluster
Spark / PySpark	Tabular	Cluster scale, mature, widely used in industry
Ray Data	Tabular and unstructured	Python native distributed data processing
Apache Beam	Tabular and unstructured	Programming model that runs on Dataflow, Flink, Spark
scikit-learn preprocessing	Tabular ML	Imputers, scalers, encoders, pipelines
TensorFlow Data Validation (TFDV)	Schema and stats	Compute and compare statistics across splits, detect drift
Great Expectations	Validation	Declarative data quality checks with Python or YAML
Hugging Face Datasets	NLP / ML	Loading and streaming, memory mapped Arrow tables
torchvision.transforms	Vision	The standard image augmentation library for PyTorch
albumentations	Vision	Faster augmentation with more transforms, widely used in Kaggle
torchaudio	Audio	Audio I/O, transforms, datasets
librosa	Audio	Mel spectrograms, MFCCs, onset detection
trafilatura	Text extraction	HTML to plain text, used in many LLM corpora
datatrove	Text pipeline	Hugging Face library specifically for LLM data preparation

Most projects end up using two or three of these. A typical modern stack might be polars or Spark for the heavy lifting, scikit-learn for the model facing transforms, and torchvision plus albumentations for the image side.

Pipelines and reproducibility

One of the main reasons to use a structured tool is that preprocessing is the easiest place to introduce subtle, hard to reproduce bugs. A few practices reduce the failure rate.

Use a Pipeline object. In scikit-learn, Pipeline and ColumnTransformer chain transforms with the model, fit each step on training data, apply to validation and test. The point is that everything that has parameters (means for imputation, std for scaling, vocabulary for encoders) is fit only on the training fold. There is no longer a manual step where you might apply the wrong transform to the wrong split.^[5]
Track experiments. MLflow, Weights and Biases, and Neptune log preprocessing parameters along with model hyperparameters and metrics. When a result mysteriously moves, the log is the only way to back out which step changed.
Version data. DVC (Data Version Control) and lakeFS treat datasets the way Git treats source code: content addressable storage, branches, diffs. Crucial when raw data is changing or the cleaning rules are changing or both.
Containerize the pipeline. Reproducibility across machines is harder than it sounds; even a different blas version can shift floating point results enough to change downstream metrics. A pinned container image is the cheapest defense.
Snapshot. For LLM training, store an immutable snapshot of the tokenized corpus. The training run is then reproducible as long as the snapshot survives, which is usually easier to guarantee than reproducing the whole filtering pipeline from scratch.

What are the most common preprocessing pitfalls?

The pitfalls below cost real engineering time. They are mostly in service of one principle: do not let any information from the test set leak into training, even by accident.

Fitting transforms on the full dataset. A StandardScaler fit on the union of train and test before splitting leaks the test set's mean and standard deviation into training. Always fit on training only, then apply to test.
Imputing before splitting. Same problem, applied to imputation. The mean used to fill missing values must come from training data.
Resampling before splitting. SMOTE applied to the full dataset before cross validation places synthetic samples derived from test set neighbors into training. Apply resampling inside the CV fold.
Target encoding without out of fold. The encoded value for a category in training row r should not see the target of row r. Use an out of fold encoder or smooth heavily.
Train test contamination. In LLMs especially, ensure the corpus does not contain the evaluation benchmarks. Run n-gram decontamination explicitly. The default is some leakage, not none.
Time series leakage. Random splits across time are almost never appropriate for time series. The model gets to see future information, validation looks great, production looks awful.
Group leakage. If multiple rows belong to the same patient, user, or document, splitting at the row level lets the model recognize the unit instead of generalizing. Use group aware splitters.
Encoding mismatches between fit and predict. A category that appears at inference time but not in training data crashes a one hot encoder unless you set handle_unknown='ignore'. Decide your policy and bake it in.
Silent type coercion. Pandas can cast columns silently when you concatenate. Schema validation (Great Expectations, pandera, pyarrow) catches these before they reach the model.
Order dependence. Some preprocessing steps are not commutative. Standardizing then PCA gives different results than PCA then standardizing. Pick an order, document it, and stick to it.

Most of these have the same root: the preprocessing pipeline must respect the boundary between training and evaluation, and that boundary is easy to violate in code.

References

Anaconda (2020). *2020 State of Data Science*. Survey of 2,360 respondents across more than 100 countries; finds that on average 45% of working time is spent loading and cleansing data. https://www.anaconda.com/resources/whitepaper/state-of-data-science-2020 ↩
Anaconda press release (June 30, 2020). "Anaconda Releases 2020 State of Data Science Survey Results." GlobeNewswire. Quote: "Respondents reported that on average 45% of their time is spent getting data ready (loading and cleansing)." ↩
CrowdFlower (2016). *2016 Data Science Report*. Reports data scientists spend about 60% of time cleaning and organizing data and 19% collecting data sets, roughly 80% on data preparation. https://www.kdnuggets.com/2016/04/crowdflower-2016-data-science-repost.html ↩
"Garbage in, garbage out." First appeared in print in The Hammond Times, November 10, 1957; commonly attributed to IBM instructor George Fuechsel (c. 1958 to 1959). TechTarget definition. https://www.techtarget.com/searchsoftwarequality/definition/garbage-in-garbage-out ↩
scikit-learn user guide, sections on preprocessing and imputation (StandardScaler, MinMaxScaler, RobustScaler, SimpleImputer, KNNImputer, IterativeImputer, Pipeline, ColumnTransformer). https://scikit-learn.org/stable/modules/preprocessing.html ↩
Pedregosa, F. et al. (2011). "Scikit-learn: Machine Learning in Python." *JMLR* 12, 2825 to 2830. The reference implementation for tabular preprocessing. ↩
Chawla, N. V. et al. (2002). "SMOTE: Synthetic Minority Over-sampling Technique." *JAIR* 16, 321 to 357. ↩
Raffel, C. et al. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." *JMLR*. Introduces C4, about 750 GB of English text used to train T5. ↩
Gao, L. et al. (2020). "The Pile: An 800GB Dataset of Diverse Text for Language Modeling." arXiv:2101.00027. 825.18 GiB across 22 sub-datasets. ↩
Rae, J. W. et al. (2021). "Scaling Language Models: Methods, Analysis and Insights from Training Gopher." arXiv:2112.11446. Source of the Gopher quality and repetition filters and 13-gram test-set decontamination. ↩
Penedo, G. et al. (2023). "The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only." arXiv:2306.01116. About 5 trillion tokens extracted; 600 billion released. ↩
Soldaini, L. et al. (2024). "Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research." *ACL 2024*. arXiv:2402.00159. ↩
Penedo, G. et al. (2024). "The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale." arXiv:2406.17557. 15 trillion tokens from 96 Common Crawl snapshots; FineWeb-Edu is about 1.3 trillion tokens filtered with a Llama 3 70B trained educational classifier. ↩
Li, J. et al. (2024). "DataComp-LM: In search of the next generation of training sets for language models." arXiv:2406.11794. ↩
Lee, K. et al. (2022). "Deduplicating Training Data Makes Language Models Better." *ACL*. arXiv:2107.06499. Finds a 61-word sentence repeated over 60,000 times in C4, about 1% train-test overlap in standard corpora, and that deduplicated models emit memorized text roughly ten times less often. ↩
Sennrich, R., Haddow, B., Birch, A. (2016). "Neural Machine Translation of Rare Words with Subword Units." *ACL*. Introduces BPE for NMT. ↩
Kudo, T., Richardson, J. (2018). "SentencePiece: A simple and language independent subword tokenizer and detokenizer." *EMNLP*. ↩
Cubuk, E. D. et al. (2020). "RandAugment: Practical automated data augmentation with a reduced search space." *CVPR*. ↩
Yun, S. et al. (2019). "CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features." *ICCV*. ↩
Zhang, H. et al. (2018). "mixup: Beyond Empirical Risk Minimization." *ICLR*. ↩
Park, D. S. et al. (2019). "SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition." *Interspeech*. ↩
torchvision documentation, `transforms.Normalize`, and the ImageNet pretrained model normalization constants mean [0.485, 0.456, 0.406] and std [0.229, 0.224, 0.225]. https://pytorch.org/vision/stable/transforms.html ↩
Hastie, T., Tibshirani, R., and Friedman, J. (2009). *The Elements of Statistical Learning*, 2nd ed. Springer. Chapters 3 and 18 cover scaling, basis expansions, and high dimensional preprocessing.
Bishop, C. (2006). *Pattern Recognition and Machine Learning*. Springer. Sections on preprocessing, normalization, and missing data.
Goodfellow, I., Bengio, Y., Courville, A. (2016). *Deep Learning*. MIT Press. Chapter 7 on regularization, including augmentation, and Chapter 14 on autoencoders.
van Buuren, S. (2018). *Flexible Imputation of Missing Data*, 2nd ed. Chapman and Hall. The standard reference for MICE.

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Bucketing Common Crawl FineWeb Input Layer Normalization Numerical Data Outliers Queue Z-Score Normalization

Why does preprocessing matter?

What are the categories of preprocessing?

Cleaning

Missing values and imputation

Encoding categorical variables

Scaling and normalization: min-max versus z-score

Feature transformation

Feature selection

Dimensionality reduction

Class imbalance

Augmentation

How should you split data into train, validation, and test sets?

How is data prepared for modern NLP and LLM training?

The shape of a web scale text pipeline

Notable corpus pipelines

Tokenization in detail

How are images preprocessed for computer vision?

How is audio preprocessed?

What tools and frameworks are used for preprocessing?

Pipelines and reproducibility

What are the most common preprocessing pitfalls?

See also

References

Improve this article

Related Articles

Dimension Reduction

Discrete Feature

Proxy labels

Bucketing

Categorical Data

Class-Imbalanced Dataset

What links here

Related Articles

Dimension Reduction

Discrete Feature

Proxy labels

Bucketing

Categorical Data

Class-Imbalanced Dataset

What links here