See also: Machine learning terms
In the field of machine learning, data augmentation refers to the process of expanding the size and diversity of a training set by applying various transformations and manipulations to existing data, rather than collecting new samples from scratch. The primary goal of data augmentation is to improve generalization, reducing overfitting and enhancing performance on unseen data. Since labeled data is often expensive and time-consuming to acquire, data augmentation has become one of the most widely adopted techniques in modern deep learning pipelines.
The technique is used across virtually every data modality, including images, text, audio, and tabular data. Its applications range from image recognition and object detection to speech recognition, text classification, and medical imaging. This article covers the principles, techniques, tools, and theoretical foundations of data augmentation in machine learning.
Data augmentation is the practice of artificially increasing the size of a training dataset by creating modified copies of existing data points. These modifications are designed to preserve the label or semantic meaning of the original sample while introducing variation that helps the model learn more robust representations.
The motivation for data augmentation comes from several practical challenges:
Data augmentation can be applied in two ways. Offline augmentation (also called pre-computation) generates augmented copies before training begins and stores them alongside the original data. Online augmentation (also called on-the-fly augmentation) applies random transformations to each sample as it is loaded during training, meaning the model sees different variations every epoch. Online augmentation is preferred in most deep learning workflows because it provides effectively unlimited variation without increasing storage requirements.
Data augmentation relies on the concept of invariance, which refers to a model's ability to recognize the same object or pattern regardless of its orientation, position, scale, or other surface-level properties. By applying transformations to the training data, data augmentation ensures that a model learns to be invariant to these changes, improving its ability to recognize patterns in novel scenarios.
For example, a convolutional neural network trained to classify images of cats should produce the same prediction whether the cat appears on the left side or the right side of the image, whether the image is slightly brighter or darker, and whether the cat is photographed from a slightly different angle. Training with horizontally flipped, brightness-adjusted, and rotated versions of each image teaches the network these invariances directly.
Data augmentation acts as an implicit form of regularization. Research by Hernandez-Garcia and Konig (2018) showed that data augmentation can replace explicit regularization techniques such as dropout and weight decay in certain settings. A theoretical framework by Dao et al. (2019), published at ICML, demonstrated that data augmentation with kernel classifiers can be decomposed into two components: an averaged version of transformed features (inducing invariance) and a data-dependent variance regularization term (reducing model complexity). Together, these components explain why augmentation improves generalization: it both teaches the model what variations to ignore and penalizes overly complex decision boundaries.
This dual role means that data augmentation is not simply about "having more data." Even when the augmented samples are deterministic transformations of existing samples (and therefore contain no new information in a strict sense), they change the optimization landscape in ways that favor simpler, more generalizable solutions.
A core requirement of data augmentation is that transformations must preserve the label of the original sample. Flipping a photograph of a dog horizontally still yields a photograph of a dog, so horizontal flipping is a valid augmentation for image classification. However, flipping a photograph of handwritten digits can change a "6" into a "9," making it an invalid augmentation for digit recognition. Choosing appropriate augmentations requires domain knowledge about which transformations are label-preserving for a given task.
Image-based data augmentation is the most established and widely used form. Techniques range from simple geometric and photometric transformations to sophisticated learned or mixing-based strategies.
Geometric transformations modify the spatial arrangement of pixels in an image.
| Technique | Description | Typical use case |
|---|---|---|
| Horizontal flip | Mirrors the image along the vertical axis | General image classification, object detection |
| Vertical flip | Mirrors the image along the horizontal axis | Satellite imagery, medical imaging |
| Random rotation | Rotates the image by a random angle within a specified range | Classification tasks where orientation varies |
| Random crop | Extracts a random sub-region of the image and resizes it | Scale-invariant recognition, reducing positional bias |
| Translation (shift) | Shifts the image horizontally or vertically by a random offset | Object detection, reducing center bias |
| Scaling (zoom) | Enlarges or shrinks the image by a random factor | Scale-invariant recognition |
| Shearing | Applies a shear transformation along one axis | Handwriting recognition, document analysis |
| Elastic deformation | Applies smooth, random displacement fields to the image | Medical image segmentation (introduced by Simard et al., 2003) |
Photometric transformations modify pixel intensity values without changing spatial structure.
| Technique | Description | Typical use case |
|---|---|---|
| Brightness adjustment | Increases or decreases overall image brightness | Handling lighting variation |
| Contrast adjustment | Modifies the range between light and dark regions | Handling varied imaging conditions |
| Saturation adjustment | Changes the intensity of colors | Reducing sensitivity to color capture differences |
| Hue shift | Rotates the color wheel of the image | Making models robust to color cast |
| Gaussian noise | Adds random noise sampled from a Gaussian distribution | Simulating sensor noise, improving robustness |
| Gaussian blur | Applies a Gaussian smoothing kernel | Simulating out-of-focus conditions |
| Color jitter | Randomly perturbs brightness, contrast, saturation, and hue together | General-purpose color robustness |
| Channel shuffle | Randomly permutes the RGB channels | Reducing reliance on color-specific features |
Erasing-based methods occlude parts of an image during training, forcing the model to rely on a broader set of features rather than focusing on any single discriminative region.
Cutout was introduced by DeVries and Taylor (2017). It randomly masks out a square region of the input image with zeros (or a constant value). On CIFAR-10, CIFAR-100, and SVHN, Cutout improved test accuracy for several architectures. The method is simple to implement and can be combined with other augmentation and regularization techniques.
Random Erasing, proposed by Zhong et al. (2017), is closely related to Cutout but erases a randomly sized rectangular region and fills it with random pixel values rather than zeros. This added randomness in both the region shape and fill values provides additional variation.
Both methods share the intuition that occluding parts of an object during training simulates real-world occlusion and forces the network to learn from multiple parts of the object simultaneously.
Mixing-based augmentation methods create new training samples by combining two or more existing samples. These approaches have become foundational techniques in modern computer vision.
Mixup was introduced by Zhang, Cisse, Dauphin, and Lopez-Paz (2017). It creates new training examples by taking a weighted linear interpolation of two randomly sampled training images and their labels. Given two samples (x_i, y_i) and (x_j, y_j), Mixup generates a new sample as:
where lambda is drawn from a Beta distribution. Mixup operates under the principle of Vicinal Risk Minimization (VRM), encouraging the model to behave linearly between training examples and producing smoother decision boundaries. The original paper reported consistent improvements across CIFAR-10, CIFAR-100, and ImageNet benchmarks.
CutMix was proposed by Yun, Han, Oh, Chun, Choe, and Yoo (2019) and published at ICCV 2019. Instead of blending entire images like Mixup, CutMix cuts a rectangular patch from one training image and pastes it onto another. The labels are mixed proportionally to the area of the patch. CutMix preserves local pixel structures (unlike Mixup, which creates ghosting artifacts from global blending), and it improves localization accuracy. On CUB200-2011 and ImageNet, CutMix outperformed Mixup on localization metrics by +5.4% and +1.4%, respectively.
Cutout vs. Mixup vs. CutMix can be summarized as follows:
| Method | Input modification | Label modification | Key advantage |
|---|---|---|---|
| Cutout | Masks a square region with zeros | No change | Forces reliance on multiple features |
| Mixup | Blends two full images pixel-by-pixel | Weighted combination of both labels | Smoother decision boundaries |
| CutMix | Pastes a patch from one image onto another | Proportional to patch area | Preserves local structure, improves localization |
Manually selecting augmentation strategies requires domain expertise and extensive trial-and-error. Automated augmentation methods learn effective augmentation policies from the data itself.
AutoAugment, introduced by Cubuk, Zoph, Mane, Vasudevan, and Le (2019) at CVPR, uses reinforcement learning to search for optimal augmentation policies. A policy consists of multiple sub-policies, each containing two image operations (such as rotation, shearing, or color adjustment) along with their associated probabilities and magnitudes. A controller network proposes policies, and a child network is trained with those policies; the child network's validation accuracy serves as the reward signal. AutoAugment achieved state-of-the-art results on CIFAR-10, CIFAR-100, SVHN, and ImageNet. The policies found on one dataset were also shown to transfer well to other datasets.
The main drawback of AutoAugment is its computational cost: the search process requires training thousands of child models, consuming thousands of GPU hours.
RandAugment, proposed by Cubuk, Zoph, Shlens, and Le (2020), dramatically simplifies the search by reducing it to just two hyperparameters: N (the number of augmentation operations to apply sequentially) and M (a shared magnitude for all operations). Rather than learning a complex policy, RandAugment randomly selects N operations from a predefined set and applies each at magnitude M. Despite its simplicity, RandAugment matched or exceeded AutoAugment's performance on CIFAR-10/100, SVHN, and ImageNet, achieving 85.0% top-1 accuracy on ImageNet with EfficientNet. The insight behind RandAugment is that a simple grid search over N and M is sufficient, and the optimal magnitude scales with the dataset size and model capacity.
TrivialAugment, introduced by Muller and Hutter (2021) at ICCV, takes simplification even further. For each image in the training batch, TrivialAugment uniformly samples a single augmentation operation from the predefined set and uniformly samples a magnitude for that operation. There are no hyperparameters to tune at all, making it a truly parameter-free augmentation scheme. Despite (or perhaps because of) this simplicity, TrivialAugment matched or exceeded the performance of AutoAugment and RandAugment across multiple benchmarks, including CIFAR-10/100, SVHN, and ImageNet. The result suggests that random sampling of augmentations during training can be more important for generalization than an extensive search for carefully tuned policies.
AugMax, published at NeurIPS 2021 by Wang, Xiao, Kossaifi, Yu, Anandkumar, and Wang, takes a different approach. It combines random augmentation sampling with adversarial optimization, searching for the worst-case mixture of augmentation operations during training. While methods like AugMix randomly sample both operations and mixing weights, AugMax adversarially learns the mixing weights to maximize the training loss, producing harder augmented examples that improve robustness. Combined with a novel Dual Batch Normalization (DuBIN) technique, AugMax achieved improvements of 3.03%, 3.49%, 1.82%, and 0.71% over prior methods on CIFAR-10-C, CIFAR-100-C, Tiny ImageNet-C, and ImageNet-C, respectively.
The evolution from AutoAugment to TrivialAugment illustrates a broader trend in augmentation research: simpler, randomized approaches often perform as well as expensive search-based methods.
| Method | Year | Search cost | Hyperparameters | Key idea |
|---|---|---|---|---|
| AutoAugment | 2019 | Thousands of GPU hours | Sub-policy probabilities and magnitudes | RL-based policy search |
| RandAugment | 2020 | Simple grid search | N (number of ops), M (magnitude) | Random selection at fixed magnitude |
| TrivialAugment | 2021 | None | None (parameter-free) | Single random op with random magnitude |
| AugMax | 2021 | Per-batch adversarial step | Mixing weights (learned adversarially) | Adversarial worst-case augmentation |
Text augmentation is more challenging than image augmentation because even small changes to a sentence can alter its meaning. Despite this, a variety of effective techniques have been developed for natural language processing tasks.
EDA, introduced by Wei and Zou (2019) at EMNLP, defines four simple token-level operations:
| Operation | Description |
|---|---|
| Synonym replacement (SR) | Randomly select n non-stop words in the sentence and replace each with a randomly chosen synonym from WordNet |
| Random insertion (RI) | Find a random synonym of a random non-stop word and insert it at a random position in the sentence |
| Random swap (RS) | Randomly choose two words in the sentence and swap their positions; repeat n times |
| Random deletion (RD) | Randomly remove each word in the sentence with probability p |
EDA showed that on five text classification benchmarks, training with EDA using only 50% of the available data achieved the same accuracy as normal training with the full dataset. The technique is particularly useful for small datasets and is simple enough to implement without any external models or resources.
Back-translation creates paraphrases by translating a sentence into an intermediate language and then translating it back to the original language. Originally proposed by Sennrich, Haddow, and Birch (2016) for improving neural machine translation with monolingual data, back-translation has since been widely adopted as a general-purpose text augmentation technique.
For example, translating the English sentence "The weather is nice today" into French ("Le temps est beau aujourd'hui") and then back to English might yield "The weather is beautiful today." This produces a natural-sounding paraphrase that preserves the original meaning while varying the surface form. Back-translation has been reported to boost F1 scores by up to 1.58% for SVM classifiers using static word embeddings, and larger improvements have been observed in multilingual classification tasks.
The quality of back-translation depends on the translation model used. With modern neural machine translation systems, back-translated paraphrases tend to be fluent and semantically faithful.
Beyond back-translation, dedicated paraphrase generation models can be used to create augmented text. Fine-tuned sequence-to-sequence models, such as T5 or BART, can be trained on paraphrase corpora (for example, the MRPC dataset or ParaNMT) to generate diverse rephrasings of input sentences. This approach generally produces higher-quality paraphrases than back-translation but requires a paraphrase model to be trained or available.
Contextual augmentation uses pretrained language models such as BERT to replace words with alternatives that fit the surrounding context. Unlike synonym replacement (which relies on fixed synonym lists), contextual augmentation draws replacements from a language model's vocabulary distribution conditioned on the sentence, producing more natural and contextually appropriate substitutions.
The rise of large language models (LLMs) has opened a powerful new avenue for text augmentation: generating entirely synthetic training data. Rather than applying surface-level perturbations, an LLM can be prompted to generate new examples for a given class or task.
For instance, to augment a sentiment classification dataset, one can prompt an LLM with instructions like "Write a negative movie review about a romantic comedy" and collect the outputs as additional training samples. Studies have shown that LLM-generated synthetic data can significantly improve downstream classifier performance, especially in low-resource settings.
Key considerations when using LLMs for augmentation include:
Audio augmentation techniques modify sound recordings to create variation that improves the robustness of models for tasks such as speech recognition, speaker identification, music genre classification, and audio event detection.
| Technique | Description | Effect |
|---|---|---|
| Time stretching | Changes the playback speed of the audio without altering pitch | Exposes the model to different speaking rates |
| Pitch shifting | Raises or lowers the pitch without changing duration | Simulates variation in speaker vocal characteristics |
| Noise injection | Adds background noise (Gaussian, environmental, babble) to the signal | Builds robustness to noisy real-world environments |
| Time shifting | Shifts the audio forward or backward in time by a random offset | Reduces sensitivity to alignment |
| Volume perturbation | Randomly scales the amplitude of the waveform | Simulates varying recording levels |
| Room impulse response (RIR) convolution | Convolves the audio with a room impulse response | Simulates different acoustic environments |
| Speed perturbation | Resamples audio at a slightly different rate, changing both speed and pitch | Common in speech recognition pipelines (Kaldi toolkit) |
SpecAugment, introduced by Park, Chan, Zhang, Chiu, Zoph, Cubuk, and Le (2019) at Interspeech, applies augmentation directly to the log mel-spectrogram representation of audio rather than to the raw waveform. It uses three operations:
SpecAugment is conceptually analogous to Cutout for images. Applied to Listen, Attend and Spell networks on the LibriSpeech dataset, SpecAugment helped achieve a word error rate (WER) of 6.8% on the test-other split without a language model, and 5.8% with shallow fusion. A follow-up study (Park et al., 2019) confirmed that SpecAugment also scales effectively to larger datasets. The method has since become a standard component in speech recognition training pipelines.
Augmenting tabular (structured) data is less straightforward than augmenting images or text because tabular features lack the spatial or sequential structure that makes transformation-based augmentation natural. The most common approaches focus on synthesizing new samples, particularly for addressing class imbalance.
The Synthetic Minority Over-sampling Technique (SMOTE), proposed by Chawla, Bowyer, Hall, and Kegelmeyer (2002), is the most widely used method for oversampling minority classes in tabular datasets. SMOTE works by:
SMOTE generates samples uniformly across the minority-class feature space. While effective at balancing class distributions, it can also generate noisy samples in regions where minority and majority classes overlap.
Several variants improve on the original SMOTE algorithm:
| Variant | Modification |
|---|---|
| Borderline-SMOTE | Generates synthetic samples only from minority examples near the class boundary |
| SMOTE-ENN | Combines SMOTE oversampling with Edited Nearest Neighbors undersampling to clean noisy samples |
| SMOTE-Tomek | Combines SMOTE with Tomek link removal to sharpen the decision boundary |
| SVM-SMOTE | Uses a support vector machine to identify the region for synthetic sample generation |
ADASYN (Adaptive Synthetic Sampling), proposed by He, Bai, Garcia, and Li (2008), builds on SMOTE but adapts the number of synthetic samples generated for each minority-class instance based on local difficulty. Minority examples surrounded by many majority-class neighbors (harder to classify) receive more synthetic neighbors, while those in safer regions receive fewer. This focuses the augmentation effort where it is most needed.
In controlled comparisons at identical oversampling ratios, ADASYN has been shown to outperform standard SMOTE by concentrating synthetic samples in difficult-to-learn regions of the feature space, with Borderline-SMOTE occupying an intermediate position in performance.
Beyond transformation-based augmentation, generative models can produce entirely new training samples. This approach is particularly useful when the original dataset is very small or when privacy constraints prevent sharing real data.
Generative adversarial networks train a generator network to produce samples that are indistinguishable from real data, as judged by a discriminator network. Once trained, the generator can produce unlimited synthetic examples. GANs have been used to augment training data for medical imaging (generating synthetic X-rays, retinal scans, and histopathology patches), satellite imagery, and face recognition.
GAN-based augmentation faces challenges including mode collapse (the generator producing only a narrow subset of the data distribution), training instability, and the need for sufficient data to train the GAN itself. StyleGAN and its variants have shown strong results in producing perceptually high-quality images with structural coherence.
Diffusion models generate data by learning to reverse a gradual noising process. Text-conditioned diffusion models such as Stable Diffusion and DALL-E can generate diverse, high-fidelity images from text prompts, making them powerful tools for data augmentation. Trabucco et al. (2023) demonstrated in their paper "Effective Data Augmentation With Diffusion Models" (published at ICLR 2024) that using a pretrained text-to-image diffusion model to generate additional training images improved classification accuracy across several benchmarks. In certain out-of-domain tasks (such as ImageNet-Sketch and ImageNet-R), synthetic images generated by diffusion models were actually more efficient per sample than real data.
Diffusion models generally produce more diverse outputs than GANs and do not suffer from mode collapse, though they are slower at generation time and require more computational resources. The shift from GAN-based to diffusion-based synthetic data generation has been a significant trend in the field since 2023.
As discussed in the text augmentation section, LLMs such as GPT-4, Claude, and LLaMA can generate synthetic text data for augmenting NLP datasets. This has proven especially effective for few-shot learning scenarios where only a handful of examples per class are available.
Most data augmentation is applied during training, but test-time augmentation (TTA) applies transformations at inference time as well. Instead of making a single prediction on the original test input, TTA creates multiple augmented versions of the input (for example, the original image plus horizontally flipped, slightly rotated, and cropped variants), runs the model on each, and aggregates the predictions (typically by averaging probabilities or voting).
TTA offers several advantages:
The tradeoff is increased inference time, since each input requires multiple forward passes. Inference with TTA typically takes 2 to 3 times longer than standard inference. Multi-scale TTA, which also processes the input at multiple resolutions, can further improve results but at an even higher computational cost.
TTA is not limited to images. In NLP, selective test-time augmentation (STTA) has been explored for text classification, where multiple paraphrases of the input text are classified and the results are aggregated.
Certain application domains require specialized augmentation strategies tailored to the unique properties of their data.
Medical imaging is one of the domains where data augmentation is most critical, because labeled datasets are typically small (expert annotation is expensive and time-consuming) and model performance has direct clinical implications. Common domain-specific techniques include:
Domain-guided augmentation in medical imaging must be applied carefully: transformations that would be harmless for natural images (such as aggressive color jitter) can distort clinically relevant features and degrade model performance.
Satellite imagery has its own augmentation considerations:
For NLP tasks in languages with limited training data, back-translation and cross-lingual transfer augmentation are commonly used. Translating training examples from a high-resource language into the target low-resource language (and vice versa) provides additional training signal without requiring native-language annotation.
Data augmentation is not universally beneficial. Understanding when it helps and when it can hurt is important for practitioners.
A variety of open-source libraries make it easy to apply data augmentation across different modalities.
| Library | Language | Key features |
|---|---|---|
| Albumentations | Python | Fast (OpenCV-based), 70+ transforms, supports images, masks, bounding boxes, and keypoints. Widely used in Kaggle competitions. |
| imgaug | Python | Extensive set of augmentors, stochastic pipeline support, bounding box and keypoint augmentation. |
| torchvision.transforms | Python (PyTorch) | Native PyTorch integration, supports both PIL and tensor inputs. Includes AutoAugment, RandAugment, and TrivialAugment policies. |
| tf.image / Keras preprocessing | Python (TensorFlow) | Built-in augmentation layers that run on GPU as part of the model graph. |
| Kornia | Python (PyTorch) | Differentiable augmentation on GPU tensors. Includes built-in implementations of AutoAugment, RandAugment, and TrivialAugment. |
In benchmarks, Albumentations consistently outperforms torchvision, Keras, and imgaug in CPU-based processing speed, often by a factor of 2x or more for common transforms.
| Library | Key features |
|---|---|
| nlpaug | Supports character-level, word-level, and sentence-level augmentations. Integrates with WordNet, word embeddings, and pretrained language models for contextual augmentation. Also supports audio augmentation. |
| TextAttack | Primarily an adversarial attack library, but includes augmentation recipes such as EDA, CheckList, and embedding-based transformations. |
| Library | Key features |
|---|---|
| audiomentations | API inspired by Albumentations. Includes AddGaussianNoise, TimeStretch, PitchShift, Shift, and more. CPU-based with a PyTorch variant (torch-audiomentations) for GPU. |
| torchaudio | PyTorch-native audio transforms including spectrogram computation, SpecAugment masking, and waveform effects. |
| librosa | General audio analysis library with functions for time stretching, pitch shifting, and other manipulations that can be used for augmentation. |
| Library | Key features |
|---|---|
| imbalanced-learn | Implements SMOTE, ADASYN, Borderline-SMOTE, SMOTE-ENN, SMOTE-Tomek, and other oversampling/undersampling methods. Compatible with scikit-learn pipelines. |
| SDV (Synthetic Data Vault) | Uses statistical and deep learning models (including CTGAN) to generate synthetic tabular data that preserves the statistical properties of the original dataset. |
The effectiveness of data augmentation can be understood from several theoretical perspectives.
Augmentation increases the effective size of the training set, which reduces the capacity of the model relative to the data. This acts as an implicit regularizer. Hernandez-Garcia and Konig (2018) showed empirically that, for several architectures, data augmentation alone could replace explicit regularization (dropout, weight decay, batch normalization) without loss of accuracy, and in some cases improved it.
By training on augmented data, models learn to produce similar representations for inputs that differ only by the applied transformations. This invariance is a form of inductive bias that encodes prior knowledge about which variations are irrelevant to the task. Benton, Finzi, Izmailov, and Wilson (2020) formalized this idea, showing that learning augmentation-induced invariances can be cast as a constrained optimization problem.
Classical empirical risk minimization (ERM) trains models on the exact training points. Chapelle, Weston, Bottou, and Vapnik (2001) proposed vicinal risk minimization (VRM), which trains on a vicinity distribution around each training point. Data augmentation is a concrete instantiation of VRM: each augmented sample lies in the "vicinity" of its original, and training on these neighbors smooths the learned function. Mixup makes this connection explicit by interpolating between training points.
Dao, Gu, Ratner, Smith, De Sa, and Re (2019) provided a kernel-theoretic framework for understanding data augmentation. They showed that for kernel classifiers, training on augmented data is equivalent to (1) using an averaged feature map that induces invariance and (2) adding a data-dependent variance regularization term. This decomposition provides a principled explanation for why augmentation reduces overfitting.
| Modality | Method | Description | Typical use case |
|---|---|---|---|
| Image | Horizontal/vertical flip | Mirror the image along an axis | General image classification |
| Image | Random rotation | Rotate by a random angle | Orientation-invariant recognition |
| Image | Random crop and resize | Extract and resize a random sub-region | Scale-invariant recognition |
| Image | Color jitter | Randomly vary brightness, contrast, saturation, hue | Robustness to lighting and camera variation |
| Image | Gaussian noise | Add random Gaussian noise to pixels | Robustness to sensor noise |
| Image | Cutout | Mask a random square region with zeros | Forces use of multiple features (DeVries and Taylor, 2017) |
| Image | Random erasing | Mask a random rectangle with random values | Simulates occlusion (Zhong et al., 2017) |
| Image | Mixup | Linearly blend two images and their labels | Smoother decision boundaries (Zhang et al., 2017) |
| Image | CutMix | Paste a patch from one image onto another; mix labels by area | Localization and classification (Yun et al., 2019) |
| Image | AutoAugment | RL-searched augmentation policy | Optimal augmentation without manual tuning (Cubuk et al., 2019) |
| Image | RandAugment | Random selection of N transforms at magnitude M | Simple, strong baseline (Cubuk et al., 2020) |
| Image | TrivialAugment | Single random op with random magnitude, parameter-free | Tuning-free state-of-the-art (Muller and Hutter, 2021) |
| Image | AugMax | Adversarial worst-case augmentation mixing | Robustness to distribution shift (Wang et al., 2021) |
| Text | Synonym replacement | Replace words with WordNet synonyms | Text classification with limited data |
| Text | Random insertion/deletion/swap (EDA) | Token-level perturbations | General text classification (Wei and Zou, 2019) |
| Text | Back-translation | Translate to another language and back | Paraphrase generation, NMT augmentation |
| Text | Contextual word replacement | Replace words using BERT or similar LM predictions | Context-aware substitution |
| Text | LLM-generated synthetic data | Prompt an LLM to generate new labeled examples | Low-resource classification, few-shot tasks |
| Audio | Time stretching | Change speed without altering pitch | Speech recognition with variable speaking rates |
| Audio | Pitch shifting | Change pitch without altering speed | Speaker variation simulation |
| Audio | Noise injection | Add background or Gaussian noise | Noisy environment robustness |
| Audio | SpecAugment | Mask frequency bands and time steps on spectrograms | Speech recognition (Park et al., 2019) |
| Audio | Room impulse response convolution | Convolve with RIR to simulate room acoustics | Acoustic environment variation |
| Tabular | SMOTE | Interpolate between minority-class nearest neighbors | Class imbalance correction (Chawla et al., 2002) |
| Tabular | ADASYN | Adaptive SMOTE focusing on hard-to-classify samples | Class imbalance in difficult regions (He et al., 2008) |
| Tabular | GAN-based generation | Train a GAN on tabular data to produce synthetic rows | Privacy-preserving data sharing, small datasets |
Data augmentation is used across a wide range of machine learning tasks:
Data augmentation is like making more and different-looking copies of your favorite toy. Imagine you have a toy car, and you make copies of it in different colors, sizes, and positions. Now, if someone shows you a car you have never seen before, you can still recognize it as a car because you have seen so many different versions of it. In the same way, data augmentation helps computers learn by giving them more examples to learn from, making them better at understanding new things they have never seen before.