Data preprocessing
Data preprocessing is the set of operations applied to raw data before it is used to train or evaluate a machine learning model. The category covers a long list of activities: removing duplicates, fixing types, imputing missing values, scaling numbers, encoding categories, building features, splitting datasets, augmenting samples, and (for modern web-scale corpora) deduplicating, language filtering, quality scoring, and tokenizing text. None of these steps is glamorous, and most of the time spent on a real project is spent here. Various practitioner surveys put the share of time on data preparation somewhere between half and four fifths of the total work, and that estimate has been remarkably stable since the term "data wrangling" started showing up in the 2010s.
The topic sits at the intersection of statistics, software engineering, and the messy reality of how data actually arrives. Classical preprocessing for tabular data has well established recipes that you will find in scikit-learn and the textbooks of Hastie, Tibshirani and Friedman or Bishop. The deep learning era added a separate body of practice for images, audio, and text, and the large language model era added another layer on top of that, where you are dealing with petabytes of crawled web pages and have to make defensible choices about deduplication, content filtering, and decontamination of evaluation benchmarks. This article covers all three.
Why preprocessing matters
There is an old slogan in computing, garbage in, garbage out, and it survives because it is essentially right. A model is a function that maps inputs to outputs, and if the inputs are encoded in a form the model cannot use, or contain systematic distortions the model cannot see past, no amount of tuning fixes that downstream. Three concrete reasons preprocessing affects results:
- Algorithmic assumptions. Many learning algorithms make implicit distributional assumptions. Linear regression with ridge or L2 regularization assumes features are on comparable scales; k-nearest neighbors and support vector machines compute distances and so depend heavily on scale; gradient descent converges much faster when input features are roughly zero mean and unit variance. Tree based methods are mostly invariant to monotone transformations of features, but even they are sensitive to how categorical variables are encoded.
- Numerical stability. Floating point arithmetic loses precision when values span many orders of magnitude. Standardization, log transforms, and clipping outliers are not just statistical decorations; they keep gradients finite and matrices invertible.
- Statistical efficiency. Removing leakage, balancing classes, and choosing a sound train/validation/test split are what determine whether your reported numbers generalize. Most of the spectacular failures in deployed ML look in retrospect like preprocessing failures: the test set was contaminated, the imputation was fit on the test data, the encoder leaked the target, the dedup was incomplete and the same example showed up in both splits.
For LLM training the stakes get larger. The cost of a single training run is now in the millions of dollars, and the data choices baked in at the start are essentially permanent for that checkpoint. Mistakes get amplified, and by the time you notice that you trained on the test set, you have already trained on the test set.
Categories of preprocessing
There is no canonical taxonomy, but the activities below cover most of what people mean by "preprocessing" in tabular and classical ML contexts. The categories overlap. Feature engineering is sometimes treated as separate because it is more creative than mechanical, and the same with feature extraction for raw signal data, but in practice it all lives in the same pipeline.
Cleaning
Cleaning is the part you cannot skip. Real data has missing entries, mistyped values, duplicate rows, encoding problems, free text that should be a category, dates stored as strings, numbers stored as strings, and the occasional silent unit error (centimeters logged as meters). The first pass usually is:
- Standardize types. Coerce columns to the right dtype. In pandas this means using
astype or convert_dtypes, and in pyarrow or polars it is built in.
- Strip whitespace, normalize case, fix encoding (UTF 8 normalization, removing zero width characters).
- Detect and resolve duplicates. Exact duplicates are easy. Near duplicates need fuzzy matching: Levenshtein distance, MinHash, or domain specific keys.
- Detect outliers. The cheapest screen is the interquartile range rule, flagging values outside Q1 minus 1.5 IQR or Q3 plus 1.5 IQR. More principled options include z-score thresholds (only meaningful for roughly Gaussian data), Isolation Forest, Local Outlier Factor, and One Class SVM. See outlier detection for a longer treatment.
A recurring pitfall is treating outliers as something to delete. Often they are real, just rare, and dropping them silently biases the model.
Missing values and imputation
Missingness has a structure that matters. Statisticians distinguish three regimes:
- MCAR (missing completely at random): the probability of being missing does not depend on either observed or unobserved values. Rare in practice.
- MAR (missing at random): missingness depends on observed values but not on the missing value itself. This is the regime where most imputation methods are valid.
- MNAR (missing not at random): missingness depends on the unobserved value (income missing because high earners refuse to answer). This is the hard case, and no general purpose method handles it.
Common imputation strategies, in roughly increasing order of sophistication:
| Method | How it works | When to use |
|---|
| Drop rows | Remove rows with any missing value | Tiny missing fraction, MCAR |
| Drop columns | Remove the entire column | Column is mostly missing |
| Mean / median / mode | Fill with column statistic | Quick baseline; loses variance |
| Forward / backward fill | Propagate adjacent values | Time series |
| KNN imputation | Average of k nearest neighbors | Mixed types, moderate size |
| Iterative (MICE) | Model each column from the others, iterate | Multivariate, MAR |
| Model based | Use a regressor or classifier per column | Larger datasets |
| Indicator variables | Add a binary "was missing" feature | Always cheap, often helps |
A practical move that is too rarely done: combine imputation with an explicit "is missing" indicator column. The model can then learn whether missingness itself is informative.
In scikit-learn the relevant tools are SimpleImputer, KNNImputer, and the experimental IterativeImputer, which is a port of the MICE algorithm.
Encoding categorical variables
Most models cannot consume strings directly. The choice of encoding has a real effect on accuracy, especially for high cardinality categoricals.
| Encoding | Output | Notes |
|---|
| One hot | One binary column per level | Standard for low cardinality, blows up for high cardinality |
| Ordinal / label | Integer per level | Only meaningful when an order exists |
| Target / mean | Mean of target per level | Powerful but leaks unless done in a CV aware way |
| Frequency / count | Count of each level in training data | Cheap proxy for prevalence |
| Hashing | Hash to fixed bucket count | Fast, no vocabulary, hash collisions |
| Embedding | Learned dense vector per level | Standard in deep learning, see embeddings |
Target encoding is genuinely useful for high cardinality features (zip codes, product IDs, user IDs) but is also one of the easiest places to introduce data leakage. The clean way to do it is to compute encodings out of fold inside a cross validation loop, then refit on full training data for prediction.
Scaling and normalization
Scaling puts numerical features on comparable ranges. The four common variants:
| Method | Formula | Robust to outliers? |
|---|
| Standardization (z score) | (x minus mean) divided by std | No |
| Min max | (x minus min) divided by (max minus min) | No |
| Max abs | x divided by max absolute value | No, preserves sparsity |
| Robust scaling | (x minus median) divided by IQR | Yes |
Standardization is the default for most linear methods and for neural networks. Min max scaling maps to [0, 1] and is sometimes used for image pixel intensities or for inputs to algorithms with bounded activation functions. Robust scaling, which uses the median and the interquartile range instead of the mean and standard deviation, is the choice when outliers are present and you do not want to remove them.
The related concept of normalization sometimes refers to row wise rescaling (each sample to unit norm) and sometimes to column wise scaling. Read the documentation carefully; in scikit-learn StandardScaler is column wise, Normalizer is row wise.
Transformations change the shape of a feature's distribution. Useful when models assume something approximately Gaussian or when a feature has heavy tails:
- Log transform.
log(1 + x) is the workhorse for right skewed positive data: counts, prices, durations, anything power law shaped. Use log1p to avoid problems at zero.
- Box Cox. A parameterized power transform that interpolates between log, square root, and identity. Requires strictly positive inputs.
- Yeo Johnson. A generalization of Box Cox that handles zeros and negatives. Implemented in scikit-learn as
PowerTransformer(method='yeo-johnson').
- Quantile transform. Maps the empirical distribution onto a uniform or Gaussian target. Strong if you do not care about preserving distances.
- Polynomial features. Add cross products and powers up to a degree. Used to be popular for linear models; mostly replaced by tree based methods and neural nets in modern practice.
- Binning / discretization. Convert a continuous feature into a categorical one by cutting it at fixed or quantile based boundaries. See bucketing for the longer treatment, including the tradeoffs around quantile bucketing for ML serving.
Feature selection
When you have more features than you need, you can remove them automatically. Three families:
- Filter methods. Score each feature against the target without involving the model. Examples: chi square test for categorical variables, ANOVA F test for numerical features against a categorical target, mutual information, correlation with the target. Cheap, and a good first pass.
- Wrapper methods. Treat feature selection as a search problem and evaluate the model on subsets. Recursive feature elimination (RFE), forward selection, backward elimination. More accurate, much slower.
- Embedded methods. The model itself does selection during fitting. L1 regularized regression (Lasso) drives some coefficients to zero. Tree based models give feature importances from how often each feature is used in splits. Boosted trees with shrinkage provide robust embedded selection.
Dimensionality reduction
Distinct from feature selection: dimensionality reduction projects all features into a lower dimensional space, mixing them together. Standard methods:
- Principal component analysis (PCA) finds orthogonal directions that maximize variance. Linear, fast, good for whitening.
- t-SNE preserves local neighborhoods. Mostly used for visualization, not as a preprocessing step for downstream models.
- UMAP preserves both local and global structure better than t-SNE for many datasets, and is fast enough to use as a preprocessing transform.
- Autoencoders (under deep learning) learn nonlinear projections.
For most tabular tasks PCA is the right first try. For modern image and text work, the relevant projections are baked into the model itself and explicit dimensionality reduction during preprocessing is unusual.
Class imbalance
When one class is much rarer than the others (fraud, rare disease, churn), naive training tends to ignore it. Three responses:
- Resampling. Oversample the minority class, undersample the majority, or both. The synthetic minority oversampling technique SMOTE generates new minority samples by interpolating between existing ones; ADASYN is a variant that focuses on harder to learn regions.
- Class weighting. Most learners accept per class weights that scale the loss contribution. In scikit-learn, set
class_weight='balanced' and the inverse frequency weights are computed automatically.
- Threshold adjustment. Train on the original distribution and tune the decision threshold on validation data. Often as effective as resampling and avoids artifacts.
A caveat: resampling should be done inside the cross validation fold, never on the full dataset before splitting, or you will leak.
Augmentation
Data augmentation generates additional training examples by transforming existing ones in label preserving ways. Modality specific:
- Images. Flips, crops, color jitter, rotations, scaling, blur, MixUp (linear interpolation of images and labels), CutMix (paste a patch from one image into another and mix labels by area), RandAugment, AutoAugment.
- Text. Synonym replacement, random insertion, deletion, swap (the EDA family), back translation, paraphrase models, span masking.
- Audio. Time shifting, pitch shifting, time stretching, additive noise, SpecAugment for spectrogram domain masking.
- Tabular. Less standardized, but Gaussian noise injection on continuous columns and SMOTE-like interpolation are used.
The principle in all cases: every augmentation must preserve the label. A horizontal flip is fine for cat-vs-dog classification, not for OCR.
Splitting
The last preprocessing decision before training is how to partition data. Common patterns:
- Random train / validation / test split. The default for IID data. Typical ratios 60/20/20 or 80/10/10.
- Stratified split. Preserves the class distribution across splits. Important for imbalanced problems.
- Group K fold. Ensures that all samples from the same group (patient, user, document) end up in the same fold. Necessary whenever group level structure could leak through random splits.
- Time series split. For temporally ordered data, the validation set must come after the training set in time. Use rolling origin evaluation, never a random split.
- Leave one group out. When groups are small and you can afford it.
The cardinal rule is that the test set is touched exactly once, at the end. Anything you tune on it (hyperparameters, preprocessing parameters, threshold, ensemble weights) is no longer a fair generalization estimate.
Modern NLP and LLM data preparation
The LLM era turned preprocessing into a discipline of its own. The pipeline that produces a training corpus from raw web crawls now has many stages, and the design choices at each stage have measurable downstream effects on model quality. The reference points here are the documented pipelines for C4, The Pile, RefinedWeb, FineWeb and FineWeb Edu, Dolma, and DataComp LM. Each one published enough detail that you can reconstruct what they did.
The shape of a web scale text pipeline
Most pipelines that start from Common Crawl WARC files run something close to the following sequence. The order matters; getting it wrong wastes compute or, worse, produces a corpus that looks fine but trains a worse model.
- Raw extraction. Pull plain text from HTML. Trafilatura and resiliparse are common choices. The C4 paper used a custom extractor; RefinedWeb used trafilatura tuned for web text; FineWeb extended this with their own optimizations.
- Language identification. Filter to the languages you want. Common tools are CLD3 (Google's compact language detector), and fastText language identification. For a model intended to be English only, you typically keep documents whose detected language is English with confidence above a threshold like 0.65.
- URL based filtering. Drop URLs that match blocklists of adult content, malware, and spam. C4 used a list maintained at lafrog.org of around 6500 phrases; later corpora maintain their own lists.
- Heuristic quality filters. Rules of thumb that throw out obviously broken or low quality documents. The Gopher paper from DeepMind documented a clean set, since reused widely: drop documents with too few words, with mean word length outside [3, 10], with a high fraction of symbols, with too few or too many bullet points, with too few lines ending in punctuation. C4 added rules for sentence length and presence of curly braces (a heuristic for code in a text only corpus).
- Repetition filters. Drop documents with a high fraction of repeated lines, paragraphs, or n-grams. These rules catch SEO spam, navigation menus that leaked through extraction, and template generated text.
- Deduplication. This is the single highest impact step. Two main approaches:
- Exact deduplication on long substrings. Remove documents whose 50-token spans match other documents above some threshold. Implemented with suffix arrays.
- Approximate deduplication. Compute MinHash signatures over n-gram shingles, then use locality sensitive hashing to find near duplicate document pairs and remove all but one. Lee et al. (2022) showed that thorough dedup substantially improves scaling, and every later corpus has taken dedup more seriously than its predecessors.
- Model based quality filtering. Optional, but increasingly standard. Train a classifier (often fastText or a small BERT) to discriminate "high quality" reference text (Wikipedia, books, OpenWebText) from random web text, and use the classifier score as a filter. KenLM perplexity on a Wikipedia trained 5-gram model is another option used by CCNet and others.
- Content safety filtering. Remove personally identifiable information, child sexual abuse material indicators, and other clearly disallowed content. This is mostly a hash matching and regex problem at this stage.
- Decontamination. Remove documents that overlap with the evaluation benchmarks you plan to use. The simple version checks for n-gram matches against benchmark inputs and outputs; the careful version normalizes and uses larger n. Decontamination is famously imperfect, and most published models still leak some test sets.
- Tokenization. The corpus is finally fed through a tokenizer and converted to integer IDs. Common tokenizer algorithms are byte pair encoding, WordPiece, SentencePiece (which is BPE or unigram operating directly on Unicode bytes), and tiktoken (OpenAI's BPE implementation). Vocabulary sizes are typically in the range 32k to 256k tokens for modern LLMs. The tokenizer is trained on a sample of the cleaned corpus, then applied to all of it.
Notable corpus pipelines
A short tour of decisions that are now public:
- C4 (Raffel et al., 2020). Extracted from a single Common Crawl snapshot, filtered with a dense set of heuristic rules: keep only English (langdetect with threshold 0.99), drop pages with curly braces, drop pages with fewer than three sentences, drop pages with the placeholder "lorem ipsum", drop pages whose URL matches the blocklist. About 750 GB after filtering, used to train T5.
- The Pile (Gao et al., 2020). 825 GB, deliberately diverse: 22 sub corpora including PubMed, ArXiv, GitHub, Books3, Wikipedia, FreeLaw, Hacker News. Heavier emphasis on curated sources than on web filtering. Used by EleutherAI for GPT NeoX 20B.
- RefinedWeb (Penedo et al., 2023). The TII Falcon corpus. Strategy: minimal heuristic filtering, very heavy deduplication. The paper argued that web data alone, with sufficient dedup, is enough for state of the art training. Roughly 5 trillion tokens before subsampling.
- Dolma (Soldaini et al., 2024). 3 trillion tokens, designed by AI2 for OLMo. Notable for documenting provenance of every document and publishing the full toolkit. Mixes Common Crawl, Stack code, peS2o academic papers, project Gutenberg books, Reddit, Wikipedia.
- FineWeb (Penedo et al., 2024). Hugging Face follow up to RefinedWeb. 15 trillion tokens. Sequenced ablations to test individual filtering steps; the paper publishes the actual effect of each rule.
- FineWeb Edu (2024). The same source filtered down to roughly 1.3 trillion tokens by training a Llama 3 70B classifier to score educational quality, then keeping documents above a threshold. Smaller, but produces stronger downstream models for the same training compute.
- DataComp LM (Li et al., 2024). A benchmark style approach: fixed evaluation, variable data filtering, leaderboard for what data choices win. Produced DCLM Baseline, a 4.1 trillion token corpus using model based filtering.
The broad arc of progress is that filtering has gotten more selective and dedup has gotten more aggressive. The total data after filtering keeps getting smaller relative to the input, and models trained on the filtered data keep getting better.
Tokenization in detail
Tokenization deserves its own treatment but is part of preprocessing here because it is the last step before the corpus becomes integer IDs. The two design axes are the algorithm and the unit.
Algorithms:
- Byte pair encoding (BPE). Start with characters, repeatedly merge the most frequent adjacent pair, stop at a target vocabulary size. Used in GPT 2, GPT 3, GPT 4, Llama 2 and 3.
- WordPiece. Like BPE but the merge criterion maximizes likelihood under a unigram language model. Used in BERT and DistilBERT.
- Unigram. Start with a large vocabulary, prune low probability tokens. Used in some SentencePiece deployments and in T5.
- SentencePiece. A wrapper that runs BPE or unigram directly on raw text, treating spaces as ordinary characters. Avoids language specific pre-tokenization. Used in Llama, mT5, ALBERT.
Units: most modern tokenizers operate on Unicode bytes rather than codepoints, which gives full coverage of any input text including emoji, code, and rare scripts. Byte level BPE is now the default for general purpose LLMs.
The tokenizer is fit once and then frozen. Changing the tokenizer mid training is essentially impossible; you have to start over.
Computer vision preprocessing
For images the pipeline is shorter and more standardized.
- Decode and convert color space. Read JPEG or PNG into RGB. For tasks that depend on color invariance, sometimes convert to YCbCr or LAB.
- Resize. Most architectures expect a fixed input size. Resize while preserving aspect ratio (with letterboxing) for object detection; center crop or random crop for classification.
- Normalize pixel values. Scale to [0, 1] by dividing by 255, then subtract per channel mean and divide by per channel standard deviation. The exact constants used are part of the model card. For ImageNet pretrained models the mean is around [0.485, 0.456, 0.406] and the std is around [0.229, 0.224, 0.225] in RGB order; these numbers come from the ImageNet training set and have been reused for nearly every downstream task.
- Augmentation. Stochastic transforms applied during training. Common ones: random crop, random horizontal flip, color jitter (brightness, contrast, saturation, hue), random erasing. Stronger pipelines: AutoAugment (a learned policy), RandAugment (a simpler, hyperparameter free version), TrivialAugment, MixUp (linear interpolation of pairs of images and their labels), CutMix (paste a region from another image and combine labels in proportion to area).
- Patchify (for vision transformers). Split the image into non overlapping patches of fixed size, typically 14 by 14 or 16 by 16, and flatten each patch into a vector for the transformer.
Video preprocessing adds frame sampling decisions: uniform, dense, or learned, and the choice has a measurable effect on downstream action recognition.
Audio preprocessing
Audio comes in as a one dimensional waveform sampled at 8 kHz, 16 kHz, 22.05 kHz, 44.1 kHz, or 48 kHz. The first decisions are sample rate normalization and channel mixdown to mono.
From there the pipeline depends on the model. Classical audio models work in a time frequency domain:
- Short time Fourier transform (STFT). Slide a windowed FFT across the waveform. Output is a complex spectrogram.
- Magnitude or power spectrogram. Drop the phase, square the magnitude.
- Mel spectrogram. Re-bin frequency axis onto the mel scale, which approximates human auditory frequency perception. Standard input for speech recognition and audio classification.
- Log mel. Take the log of the mel spectrogram. Compresses dynamic range, matches human loudness perception.
- MFCC. Discrete cosine transform of the log mel spectrogram. Fewer coefficients, classical speech recognition feature.
Other specialized preprocessing:
- Voice activity detection (VAD). Segment the audio into speech and non speech regions. Tools include WebRTC VAD and silero VAD. Used to drop silence or split long recordings.
- Speaker diarization. "Who spoke when" labels for multi speaker recordings.
- Noise reduction. Spectral subtraction, Wiener filtering, or learned denoisers.
- Augmentation. Time stretching, pitch shifting, additive noise from a background corpus, room impulse response convolution to simulate reverberation, SpecAugment (mask random time and frequency bands of the spectrogram).
Modern end to end speech models like Whisper consume log mel spectrograms directly. Some recent models work on raw waveforms, but spectrograms remain the dominant input representation.
A small zoo of libraries handles the operations above. The right choice depends on data size and the rest of your stack.
| Tool | Domain | Notes |
|---|
| pandas | Tabular | Single machine, the de facto standard for in memory dataframe work |
| polars | Tabular | Single machine or out of core, faster than pandas, lazy execution |
| dask | Tabular | Distributed pandas like API, scales to a cluster |
| Spark / PySpark | Tabular | Cluster scale, mature, widely used in industry |
| Ray Data | Tabular and unstructured | Python native distributed data processing |
| Apache Beam | Tabular and unstructured | Programming model that runs on Dataflow, Flink, Spark |
| scikit-learn preprocessing | Tabular ML | Imputers, scalers, encoders, pipelines |
| TensorFlow Data Validation (TFDV) | Schema and stats | Compute and compare statistics across splits, detect drift |
| Great Expectations | Validation | Declarative data quality checks with Python or YAML |
| Hugging Face Datasets | NLP / ML | Loading and streaming, memory mapped Arrow tables |
| torchvision.transforms | Vision | The standard image augmentation library for PyTorch |
| albumentations | Vision | Faster augmentation with more transforms, widely used in Kaggle |
| torchaudio | Audio | Audio I/O, transforms, datasets |
| librosa | Audio | Mel spectrograms, MFCCs, onset detection |
| trafilatura | Text extraction | HTML to plain text, used in many LLM corpora |
| datatrove | Text pipeline | Hugging Face library specifically for LLM data preparation |
Most projects end up using two or three of these. A typical modern stack might be polars or Spark for the heavy lifting, scikit-learn for the model facing transforms, and torchvision plus albumentations for the image side.
Pipelines and reproducibility
One of the main reasons to use a structured tool is that preprocessing is the easiest place to introduce subtle, hard to reproduce bugs. A few practices reduce the failure rate.
- Use a Pipeline object. In scikit-learn,
Pipeline and ColumnTransformer chain transforms with the model, fit each step on training data, apply to validation and test. The point is that everything that has parameters (means for imputation, std for scaling, vocabulary for encoders) is fit only on the training fold. There is no longer a manual step where you might apply the wrong transform to the wrong split.
- Track experiments. MLflow, Weights and Biases, and Neptune log preprocessing parameters along with model hyperparameters and metrics. When a result mysteriously moves, the log is the only way to back out which step changed.
- Version data. DVC (Data Version Control) and lakeFS treat datasets the way Git treats source code: content addressable storage, branches, diffs. Crucial when raw data is changing or the cleaning rules are changing or both.
- Containerize the pipeline. Reproducibility across machines is harder than it sounds; even a different blas version can shift floating point results enough to change downstream metrics. A pinned container image is the cheapest defense.
- Snapshot. For LLM training, store an immutable snapshot of the tokenized corpus. The training run is then reproducible as long as the snapshot survives, which is usually easier to guarantee than reproducing the whole filtering pipeline from scratch.
Common pitfalls
The pitfalls below cost real engineering time. They are mostly in service of one principle: do not let any information from the test set leak into training, even by accident.
- Fitting transforms on the full dataset. A
StandardScaler fit on the union of train and test before splitting leaks the test set's mean and standard deviation into training. Always fit on training only, then apply to test.
- Imputing before splitting. Same problem, applied to imputation. The mean used to fill missing values must come from training data.
- Resampling before splitting. SMOTE applied to the full dataset before cross validation places synthetic samples derived from test set neighbors into training. Apply resampling inside the CV fold.
- Target encoding without out of fold. The encoded value for a category in training row r should not see the target of row r. Use an out of fold encoder or smooth heavily.
- Train test contamination. In LLMs especially, ensure the corpus does not contain the evaluation benchmarks. Run n-gram decontamination explicitly. The default is some leakage, not none.
- Time series leakage. Random splits across time are almost never appropriate for time series. The model gets to see future information, validation looks great, production looks awful.
- Group leakage. If multiple rows belong to the same patient, user, or document, splitting at the row level lets the model recognize the unit instead of generalizing. Use group aware splitters.
- Encoding mismatches between fit and predict. A category that appears at inference time but not in training data crashes a one hot encoder unless you set
handle_unknown='ignore'. Decide your policy and bake it in.
- Silent type coercion. Pandas can cast columns silently when you concatenate. Schema validation (Great Expectations, pandera, pyarrow) catches these before they reach the model.
- Order dependence. Some preprocessing steps are not commutative. Standardizing then PCA gives different results than PCA then standardizing. Pick an order, document it, and stick to it.
Most of these have the same root: the preprocessing pipeline must respect the boundary between training and evaluation, and that boundary is easy to violate in code.
See also
References
- Hastie, T., Tibshirani, R., and Friedman, J. (2009). *The Elements of Statistical Learning*, 2nd ed. Springer. Chapters 3 and 18 cover scaling, basis expansions, and high dimensional preprocessing.
- Bishop, C. (2006). *Pattern Recognition and Machine Learning*. Springer. Sections on preprocessing, normalization, and missing data.
- Goodfellow, I., Bengio, Y., Courville, A. (2016). *Deep Learning*. MIT Press. Chapter 7 on regularization, including augmentation, and Chapter 14 on autoencoders.
- Pedregosa, F. et al. (2011). "Scikit-learn: Machine Learning in Python." *JMLR* 12, 2825 to 2830. The reference implementation for tabular preprocessing.
- scikit-learn user guide, sections on preprocessing and on imputation. https://scikit-learn.org/stable/modules/preprocessing.html
- van Buuren, S. (2018). *Flexible Imputation of Missing Data*, 2nd ed. Chapman and Hall. The standard reference for MICE.
- Chawla, N. V. et al. (2002). "SMOTE: Synthetic Minority Over-sampling Technique." *JAIR* 16, 321 to 357.
- Raffel, C. et al. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." *JMLR*. Introduces C4.
- Gao, L. et al. (2020). "The Pile: An 800GB Dataset of Diverse Text for Language Modeling." arXiv:2101.00027.
- Rae, J. W. et al. (2021). "Scaling Language Models: Methods, Analysis and Insights from Training Gopher." arXiv:2112.11446. Source of the Gopher quality filtering rules.
- Penedo, G. et al. (2023). "The RefinedWeb Dataset for Falcon LLM." arXiv:2306.01116.
- Soldaini, L. et al. (2024). "Dolma: An Open Corpus of Three Trillion Tokens." arXiv:2402.00159.
- Penedo, G. et al. (2024). "The FineWeb Datasets." arXiv:2406.17557.
- Li, J. et al. (2024). "DataComp-LM: In search of the next generation of training sets." arXiv:2406.11794.
- Lee, K. et al. (2022). "Deduplicating Training Data Makes Language Models Better." *ACL*.
- Sennrich, R., Haddow, B., Birch, A. (2016). "Neural Machine Translation of Rare Words with Subword Units." *ACL*. Introduces BPE for NMT.
- Kudo, T., Richardson, J. (2018). "SentencePiece: A simple and language independent subword tokenizer and detokenizer." *EMNLP*.
- Cubuk, E. D. et al. (2020). "RandAugment: Practical automated data augmentation with a reduced search space." *CVPR*.
- Yun, S. et al. (2019). "CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features." *ICCV*.
- Zhang, H. et al. (2018). "mixup: Beyond Empirical Risk Minimization." *ICLR*.
- Park, D. S. et al. (2019). "SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition." *Interspeech*.