# Noise

> Source: https://aiwiki.ai/wiki/noise
> Updated: 2026-06-24
> Categories: Data & Datasets, Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Noise** in [machine learning](/wiki/machine_learning) is any unwanted, irrelevant, or random variation in data that obscures the true underlying patterns a model is trying to learn. It is the part of the data that carries no useful signal: measurement errors, mislabeled examples, sensor imprecision, environmental variability, or inherent randomness in the phenomenon being modeled. Noise is usually a problem that degrades accuracy and drives [overfitting](/wiki/overfitting), yet the same randomness, when deliberately injected during [training](/wiki/training), becomes one of the most effective regularizers in deep learning: Bishop (1995) proved that adding noise to inputs is mathematically equivalent to Tikhonov regularization,[10] and modern [diffusion models](/wiki/diffusion_model) generate images by learning to reverse a process that turns data into pure Gaussian noise.[8]

Researchers broadly divide noise by where it lives in the data: **label noise**, errors in the target values, versus **feature noise** (also called attribute noise), errors in the input variables. A long-standing empirical finding is that label noise is the more damaging of the two. Frenay and Verleysen (2014), in their widely cited survey, conclude that "label noise can have a more important impact on classifiers than attribute noise."[9]

## What are the types of noise in machine learning?

Noise in machine learning can be categorized into several distinct types based on its source and the component of the data it affects.

| Type | Description | Common Sources |
|------|-------------|----------------|
| **[Label](/wiki/label) Noise** | Incorrect or inconsistent target values in the training data | Human annotation errors, ambiguous examples, crowdsourcing disagreements |
| **Feature Noise** | Errors or random variation in input attributes | Sensor malfunction, data entry mistakes, environmental interference |
| **Measurement Noise** | Inaccuracies introduced during data collection | Instrument limitations, quantization effects, calibration drift |
| **Process Noise** | Errors introduced during data preprocessing or transformation | Incorrect feature engineering, rounding, lossy compression |
| **Algorithm Noise** | Variability from the learning algorithm itself | Random initialization, stochastic optimization, hyperparameter sensitivity |

### What is label noise?

Label noise occurs when the target values assigned to training examples are incorrect or inconsistent. This is one of the most studied forms of noise because even small rates of label corruption can significantly degrade classifier performance.[3] Research has identified three main categories of label noise:[9]

- **Class-independent (symmetric) noise:** Labels are flipped at random with a uniform probability across all classes. For example, in a 10-class problem, each label has an equal chance of being changed to any other class.[3]
- **Class-dependent (asymmetric) noise:** The probability of mislabeling depends on the true class. Similar-looking classes are more likely to be confused with each other, such as labeling a cat image as a dog.
- **Instance-dependent noise:** The probability of mislabeling depends on both the true class and the specific features of the instance. Ambiguous or borderline examples are mislabeled more frequently, which closely mirrors real-world annotation errors in domains like medical imaging.

Studies have shown that the negative impact of label noise on model performance often exceeds that of feature noise, making it a critical concern in practical machine learning systems.[9] Real-world datasets are noisier than they appear: an audit of ten widely used benchmarks, including ImageNet and CIFAR-10/100, found an average label-error rate of at least 3.3% in the test sets, with roughly 6% of ImageNet validation labels estimated to be wrong.[11]

### What is feature noise?

Feature noise involves errors or random variation in the input attributes used by a model. This type of noise can arise from sensor errors, environmental factors affecting measurements, or data entry mistakes. For instance, a temperature sensor with limited precision might introduce small random errors into readings, or a camera might produce grainy images under low-light conditions. Feature noise makes it harder for models to learn precise decision boundaries and can reduce the effective [signal-to-noise ratio](/wiki/signal_to_noise_ratio) in the data.

## How does noise in training data affect models?

The presence of noise in [training](/wiki/training) data poses fundamental challenges for machine learning models, particularly [deep neural networks](/wiki/deep_neural_network), which have enough capacity to memorize noisy examples. A landmark study by Zhang et al. (2017) demonstrated the extent of this capacity: standard convolutional networks can fit CIFAR-10 to zero training error even when every label is replaced with a random one, showing that the effective capacity of these networks "is sufficient for memorizing the entire data set."[12]

### How do models react to noise?

During training, neural networks tend to first learn the general, clean patterns in the data before gradually memorizing noisy samples. As training progresses and the loss on clean samples converges, the gradient in the direction of noise begins to dominate, leading the model to overfit on noisy examples. This memorization degrades the model's generalization ability and is a primary driver of [overfitting](/wiki/overfitting).

### How do you learn with noisy labels?

Learning with noisy labels (LNL) has become a major subfield of machine learning research. A comprehensive survey by Song et al. (2022) categorized approaches into several families:[4]

| Approach | Description | Example Methods |
|----------|-------------|-----------------|
| **Robust Loss Functions** | Design loss functions that are inherently tolerant to label noise | Symmetric cross-entropy, generalized cross-entropy, mean absolute error |
| **Loss Correction** | Estimate the noise transition matrix and correct the loss accordingly | Forward correction, backward correction, T-revision |
| **Sample Selection** | Identify and select likely-clean samples for training | Co-teaching, MentorNet, small-loss selection |
| **Sample Reweighting** | Assign lower weights to suspected noisy samples | Meta-learning-based reweighting, curriculum learning |
| **[Regularization](/wiki/regularization)** | Apply regularization techniques to prevent memorization of noise | Mixup, label smoothing, [dropout](/wiki/dropout_regularization) |
| **Semi-supervised Methods** | Treat suspected noisy samples as unlabeled data | DivideMix, SOP |

The small-loss trick, where samples with smaller loss values are assumed to be correctly labeled, has become a foundational technique in this area. It rests on the memorization effect described above: Song et al. (2022) note that "many true-labeled examples tend to exhibit smaller losses than false-labeled examples" early in training.[4] However, this criterion can fail in class-imbalanced settings, prompting researchers to develop alternative strategies such as small-distance criteria that leverage the robustness of learned representations.

## How is noise used as regularization?

Paradoxically, noise can also be beneficial when introduced deliberately during training. Adding controlled noise acts as a form of [regularization](/wiki/regularization) that helps prevent [overfitting](/wiki/overfitting) and improves generalization. Bishop (1995) gave this idea a precise theoretical footing, showing that training with added input noise is, to second order in the noise amplitude, equivalent to minimizing a regularized error function; the strength of the regularization scales with the noise variance.[10]

### How does dropout act as noise?

[Dropout](/wiki/dropout_regularization) is one of the most widely used regularization techniques in deep learning. During training, dropout randomly sets a fraction of neuron activations to zero, effectively injecting binary multiplicative noise into the network. In the words of Srivastava et al. (2014), "The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much."[2] This forces the network to learn more robust features, and the paper reports that dropout "significantly reduces overfitting" and yields state-of-the-art results across vision, speech, and document-classification benchmarks.[2] Research has shown that dropout can also be interpreted as a form of [data augmentation](/wiki/data_augmentation): adding noise to the data.[2]

### Where can noise be injected during training?

Noise can be injected at multiple points in the training pipeline, each with distinct effects:

| Injection Point | Mechanism | Effect |
|-----------------|-----------|--------|
| **Input noise** | Add Gaussian or other noise to input features | Acts as data augmentation; makes the model robust to input perturbations |
| **Weight noise** | Add noise to network weights during training | Pushes the model toward flat minima; equivalent to a form of L2 regularization |
| **Gradient noise** | Add noise to computed gradients | Helps escape sharp local minima; can improve convergence |
| **Activation noise** | Add noise to hidden layer activations | Similar to dropout; encourages distributed representations |
| **Output noise** | Add noise to target labels (label smoothing) | Prevents overconfident predictions; improves calibration |

Adding noise to weights can be interpreted as a traditional form of regularization that encourages the model to be relatively insensitive to small variations in weights, finding points that are not merely minima but minima surrounded by flat regions in the loss landscape.[10]

### What is Gaussian noise injection?

Gaussian noise injection involves adding random values drawn from a normal distribution (typically with zero mean and a tunable standard deviation) to inputs, weights, or activations during training. This technique is simple to implement and provides effective regularization. Parametric Noise Injection (PNI), proposed by He et al. (2019), extends this idea by making the noise parameters trainable, allowing the network to learn the optimal noise level at each layer.[6] PNI has been shown to improve robustness against adversarial attacks, including PGD, FGSM, and C&W attacks.[6]

However, naive noise injection can degrade accuracy on clean data. More recent approaches address this trade-off by using class-wise feature alignment mechanisms that bring noisy data clusters closer to their clean counterparts.

### How does data augmentation relate to noise?

[Data augmentation](/wiki/data_augmentation) is closely related to noise injection. Techniques such as random cropping, flipping, color jittering, and adding synthetic noise to images all introduce controlled perturbations that expand the effective training set and improve model robustness. In text domains, augmentation strategies include synonym replacement, random insertion, and back-translation, all of which introduce a form of noise into the training data.

## Why does noise in gradient estimation help?

[Stochastic gradient descent](/wiki/stochastic_gradient_descent_sgd) (SGD), the workhorse optimization algorithm of modern deep learning, operates on noisy gradient estimates by design. Instead of computing the gradient over the entire training set (full-batch gradient descent), SGD computes gradients on small random subsets (mini-batches), producing an approximation of the true gradient that includes sampling noise.

### What are the benefits of gradient noise?

The noise in stochastic gradient estimates provides several important benefits:

- **Generalization improvement:** Empirical and theoretical studies have shown that the noise in SGD helps models generalize better. Smith and Le (2018) characterized an SGD "noise scale" that grows with the learning rate and the dataset-to-batch-size ratio, and showed there is an optimal batch size that maximizes test accuracy because it sets this noise scale to an optimal value.[5]
- **Escape from sharp minima:** Sharp minima (narrow valleys in the loss surface) are unstable under stochastic updates. The noise causes the optimizer to bounce around and eventually escape these regions. Flat minima (wide valleys) are more stable and empirically generalize better.
- **Implicit regularization:** The stochasticity from random mini-batch sampling acts as an implicit regularizer, biasing the optimization toward flatter, more generalizable solutions.

Neelakantan et al. (2017) showed that deliberately adding decaying Gaussian noise to computed gradients during training can improve learning, particularly in very deep and complex networks where the standard gradient signal becomes weaker.[7]

## What is the signal-to-noise ratio?

The signal-to-noise ratio (SNR) is a concept borrowed from signal processing that quantifies the proportion of meaningful information (signal) relative to unwanted variation (noise) in the data. In machine learning, a high SNR means the underlying patterns are strong relative to the noise, making them easier for models to learn. Conversely, a low SNR indicates that noise dominates, making pattern detection more difficult.

Several factors influence the effective SNR in a machine learning pipeline:

- **Data quality:** Cleaner data collection processes produce higher SNR.
- **Feature engineering:** Well-designed features can amplify the signal relative to noise.
- **Dimensionality reduction:** Techniques like [PCA](/wiki/dimension_reduction) can remove noise-dominated dimensions.
- **Sample size:** Larger datasets tend to average out noise, effectively increasing SNR.

## What is denoising?

[Denoising](/wiki/denoising) is the process of recovering clean signals from noisy observations. It has become a fundamental concept in modern deep learning, underpinning several important model architectures.

### What are denoising autoencoders?

Denoising autoencoders (DAEs), introduced by Vincent et al. (2008), are neural networks trained to reconstruct clean inputs from deliberately corrupted versions.[1] The authors describe their method as "a new training principle for unsupervised learning of a representation based on the idea of making the learned representations robust to partial corruption of the input pattern."[1] The encoder receives a noisy input and must produce a latent representation that captures only the essential features, while the decoder reconstructs the original clean input. By learning to remove noise, the model learns robust and meaningful feature representations that transfer well to downstream tasks.[1]

### How do diffusion models use noise?

[Diffusion models](/wiki/diffusion_model) represent one of the most significant developments in generative AI. These models work by learning to reverse a gradual noising process:

1. **Forward process:** Gaussian noise is incrementally added to data over a series of timesteps until the data becomes pure noise.
2. **Reverse process:** A neural network learns to denoise at each timestep, gradually transforming random noise back into coherent data samples.

In the foundational Denoising Diffusion Probabilistic Models paper, Ho et al. (2020) used a Markov chain of T = 1000 timesteps with a fixed linear variance schedule, and the trained model achieved an Inception Score of 9.46 and a state-of-the-art FID of 3.17 on unconditional CIFAR-10 image generation.[8] Research has shown that the representation capability of diffusion models is primarily gained through the denoising-driven process, not the diffusion process itself, and that using multiple noise levels during training is analogous to data augmentation.[8] Diffusion models have since achieved state-of-the-art results in image generation, audio synthesis, video generation, and molecular design.

## What is irreducible error (Bayes error)?

Every prediction task has a minimum achievable error rate known as the Bayes error rate or irreducible error. This error floor exists because of inherent noise and ambiguity in the data that no model, no matter how powerful, can overcome.

The Bayes error rate arises from two primary sources:

- **Feature overlap:** When instances from different classes share identical or very similar feature values, the best any classifier can do is predict the most likely class, accepting some unavoidable misclassification.
- **Measurement noise:** Random inconsistencies, sensor limitations, or labeling errors introduce irreducible uncertainty into the data.

In a regression setting with squared error, the irreducible error equals the noise variance. This noise term is one of the three components of the bias-variance decomposition of expected prediction error, alongside (bias)^2 and variance. Understanding the Bayes error rate is practically important: if a model's error is close to the estimated Bayes error, there is little room for algorithmic improvement, and further gains must come from collecting better data or extracting more informative features.

## What strategies reduce noise?

Practitioners employ a variety of strategies to mitigate the effects of noise:

| Strategy | Description |
|----------|-------------|
| **Data Cleaning** | Identify and correct errors, remove duplicates, fill missing values, and handle outliers |
| **Feature Selection** | Remove noisy or irrelevant features using mutual information, recursive elimination, or correlation analysis |
| **Dimensionality Reduction** | Use PCA, t-SNE, or autoencoders to project data into lower-dimensional spaces that preserve signal while discarding noise |
| **Ensemble Methods** | Combine predictions from multiple models (bagging, boosting) to average out individual model noise |
| **Robust Algorithms** | Use algorithms inherently resistant to noise, such as [random forests](/wiki/random_forest), SVMs with soft margins, or Huber loss regression |
| **Cross-validation** | Use [cross-validation](/wiki/cross-validation) to detect and prevent overfitting to noisy training data |
| **[Regularization](/wiki/regularization)** | Apply L1, L2, or dropout regularization to constrain model complexity |

## What are the types of noise in sensors and images?

In [computer vision](/wiki/computer_vision) and sensor-based applications, noise is a pervasive challenge. Common types of image noise include:

- **Gaussian noise:** Random intensity variations following a normal distribution, common in low-light photography and electronic sensors.
- **Salt-and-pepper noise:** Random white and black pixels caused by transmission errors or faulty sensor elements.
- **Shot noise (Poisson noise):** Caused by the quantum nature of light; particularly significant in low-photon-count imaging such as astronomy and microscopy.
- **Speckle noise:** Granular noise common in radar, ultrasound, and coherent imaging systems.

Deep learning-based denoising methods, including convolutional neural networks and diffusion-based approaches, have largely surpassed traditional filtering methods (Gaussian blur, median filters, bilateral filters) for image noise reduction.

## Explain Like I'm 5 (ELI5)

Imagine you are trying to listen to your favorite song on the radio, but there is static and crackling mixed in with the music. The music is the "signal" (the real pattern), and the static is the "noise" (the unwanted stuff). In machine learning, the computer is trying to listen to the "music" hidden inside messy data, but noise makes it harder to hear the real tune. Sometimes, a little bit of static can actually help the computer learn better, because it forces the computer to focus on the loudest, most important parts of the music instead of memorizing every tiny detail, including the crackles.

## References

1. Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P.-A. (2008). "Extracting and Composing Robust Features with Denoising Autoencoders." *Proceedings of the 25th International Conference on Machine Learning (ICML)*.
2. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). "Dropout: A Simple Way to Prevent Neural Networks from Overfitting." *Journal of Machine Learning Research*, 15, 1929-1958.
3. Natarajan, N., Dhillon, I. S., Ravikumar, P., & Tewari, A. (2013). "Learning with Noisy Labels." *Advances in Neural Information Processing Systems (NeurIPS)*.
4. Song, H., Kim, M., Park, D., Shin, Y., & Lee, J.-G. (2022). "Learning from Noisy Labels with Deep Neural Networks: A Survey." *IEEE Transactions on Neural Networks and Learning Systems*, 34(11), 8135-8153.
5. Smith, S. L., & Le, Q. V. (2018). "A Bayesian Perspective on Generalization and Stochastic Gradient Descent." *International Conference on Learning Representations (ICLR)*.
6. He, Z., Rakin, A. S., & Fan, D. (2019). "Parametric Noise Injection: Trainable Randomness to Improve Deep Neural Network Robustness Against Adversarial Attack." *IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*.
7. Neelakantan, A., Vilnis, L., Le, Q. V., Sutskever, I., Kaiser, L., Kurach, K., & Martens, J. (2017). "Adding Gradient Noise Improves Learning for Very Deep Networks." *ICLR Workshop*.
8. Ho, J., Jain, A., & Abbeel, P. (2020). "Denoising Diffusion Probabilistic Models." *Advances in Neural Information Processing Systems (NeurIPS)*.
9. Frenay, B., & Verleysen, M. (2014). "Classification in the Presence of Label Noise: A Survey." *IEEE Transactions on Neural Networks and Learning Systems*, 25(5), 845-869.
10. Bishop, C. M. (1995). "Training with Noise is Equivalent to Tikhonov Regularization." *Neural Computation*, 7(1), 108-116.
11. Northcutt, C. G., Athalye, A., & Mueller, J. (2021). "Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks." *NeurIPS Datasets and Benchmarks Track*.
12. Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2017). "Understanding Deep Learning Requires Rethinking Generalization." *International Conference on Learning Representations (ICLR)*.