# Training Set

> Source: https://aiwiki.ai/wiki/training_set
> Updated: 2026-06-20
> Categories: Data & Datasets, Machine Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

A **training set** is the portion of a [dataset](/wiki/dataset) that a [machine learning](/wiki/machine_learning) model learns from: the labeled examples a model processes during training to adjust its internal [parameters](/wiki/parameter) and build the mathematical relationships it uses to make predictions on new, unseen data [1]. In Tom Mitchell's classic formulation, a program "learn[s] from experience E with respect to some task T and some performance measure P" [31], and the training set is that experience E. It is one of three standard data partitions in [supervised machine learning](/wiki/supervised_machine_learning), alongside the [validation set](/wiki/validation_set) (used to tune the model) and the [test set](/wiki/test_set) (used to score it once at the end).

The quality, size, and representativeness of the training set have a direct impact on a model's accuracy, generalization ability, and fairness. Poorly constructed training sets can lead to [overfitting](/wiki/overfitting), [bias](/wiki/bias), and unreliable predictions in production [3]. The phrase "garbage in, garbage out" is a useful summary: a model can only learn patterns that exist in its training data, so practitioners spend a significant fraction of any project on collecting, cleaning, labeling, and validating the data that goes into training rather than on the model architecture itself. Andrew Ng, who popularized this emphasis, defines the broader practice as "the discipline of systematically engineering the data needed to successfully build an AI system" [32].

## How do training sets work?

In a typical [supervised machine learning](/wiki/supervised_machine_learning) workflow, a practitioner starts with a full dataset of [labeled examples](/wiki/labeled_example). Each example consists of input [features](/wiki/feature) paired with a known output (the label or target). The dataset is then split into non-overlapping subsets:

| Subset | Purpose | When Used |
|---|---|---|
| **Training set** | Teaches the model by exposing it to labeled examples so it can learn patterns and adjust weights | During model training |
| **[Validation set](/wiki/validation_set)** | Provides feedback for tuning [hyperparameters](/wiki/hyperparameter) and detecting [overfitting](/wiki/overfitting) | During model development |
| **[Test set](/wiki/test_set)** | Gives an unbiased estimate of final model performance on data it has never seen | After training is complete |

The model iterates over the training set multiple times (each full pass is called an [epoch](/wiki/epoch)). In each pass, it computes a [loss function](/wiki/loss_function) that measures prediction errors, then uses an [optimizer](/wiki/optimizer) (such as [stochastic gradient descent](/wiki/stochastic_gradient_descent_sgd)) to update its [parameters](/wiki/parameter) in the direction that reduces the loss. This cycle repeats until the model converges or a stopping criterion is met.

For most supervised tasks the training set is fixed once the project starts. In [reinforcement learning](/wiki/reinforcement_learning) the equivalent is a stream of episodes generated by interaction with an environment, and in [self-supervised learning](/wiki/self-supervised_learning) the labels are derived from the data itself (such as predicting the next token in a sequence). The discussion below focuses on supervised settings, though most of the same principles apply once the training distribution is fixed.

## How are training, validation, and test splits chosen?

### What are common split ratios?

There is no single "correct" ratio for dividing data. The best split depends on the total dataset size, the complexity of the model, and the number of [hyperparameters](/wiki/hyperparameter) to tune. The following ratios are widely used in practice:

| Split Scheme | Training | Validation | Test | Best For |
|---|---|---|---|---|
| 80 / 10 / 10 | 80% | 10% | 10% | Large datasets (100k+ samples) |
| 70 / 15 / 15 | 70% | 15% | 15% | Medium datasets (10k to 100k samples) |
| 60 / 20 / 20 | 60% | 20% | 20% | Smaller datasets or complex models |
| 98 / 1 / 1 | 98% | 1% | 1% | Very large datasets (millions of examples) |

When only a train/test split is needed (without a separate validation set), common ratios are 80/20 or 75/25. In deep learning with very large datasets, practitioners sometimes allocate as much as 98% to training, because even 1% of a massive dataset provides thousands of validation or test examples. Joseph (2022) studied the question of optimal ratios formally and showed that for ordinary least-squares regression the test fraction should scale roughly with the square root of the number of model parameters divided by the dataset size, which often justifies test fractions much smaller than the rule-of-thumb 20% [4].

### Why three splits and not two?

The purpose of the [validation set](/wiki/validation_set) is to provide a feedback signal for choices made by the human or by an automated search procedure: model architecture, learning rate, regularization strength, the number of training epochs, and so on. Every time a [hyperparameter](/wiki/hyperparameter) is changed because of validation performance, the validation set leaks a small amount of information into the model. After many iterations the validation score becomes optimistically biased, in much the same way that running thousands of significance tests will eventually find a spurious result. The [test set](/wiki/test_set) is held back and ideally consulted only once at the end, so its score is a faithful estimate of generalization to new data [2]. When test performance is repeatedly checked during development, practitioners say the test set has been "burned" and a new held-out set must be collected.

### What are the main splitting strategies?

How data is divided matters as much as the ratio itself. The three main strategies are random splitting, stratified splitting, and temporal splitting. The right choice depends on the structure of the data and the prediction task.

#### random splitting

Random splitting shuffles the dataset and assigns samples to each subset based on the chosen ratio. It is the simplest and most common approach, and works well when the data is large and roughly balanced across classes. However, random splitting can produce subsets with skewed class distributions, especially when working with imbalanced datasets, and it ignores any group structure that may be present.

#### stratified splitting

Stratified splitting forces each subset (training, validation, test) to preserve the same class distribution as the original dataset. For example, if the full dataset contains 90% negative examples and 10% positive examples, each split will maintain that 90/10 ratio. Stratified splitting is essential for imbalanced [classification](/wiki/classification) problems, because a purely random split could place very few minority-class examples in the validation or test sets, producing unreliable performance estimates. Scikit-learn provides `StratifiedShuffleSplit` and `StratifiedKFold` for this purpose. Stratification can also be done on continuous targets by binning the values first, and on multi-label data via iterative stratification.

#### temporal splitting for time series

For [time series](/wiki/time_series_analysis) data, random shuffling is not appropriate because it would cause data leakage: the model would train on future data and be tested on past data. Temporal splitting uses a chronological cutoff point. All data before the cutoff becomes the training set, and all data after the cutoff forms the validation or test set. Scikit-learn's `TimeSeriesSplit` implements an expanding-window variant where each fold adds more historical data to the training window while testing on the next time period. A walk-forward variant uses a sliding window of fixed length, simulating a model that is retrained periodically and used only on the immediately following period.

#### group-aware splitting

Many datasets contain natural groups: multiple measurements per patient in a medical dataset, multiple sentences per document in NLP, multiple frames per video in computer vision. If a single group is split across training and test sets, the model can effectively memorize group-specific signals during training and recognize them in the test set, producing optimistic but misleading scores. The fix is group-aware splitting, where every example from a given group is assigned to exactly one subset. Scikit-learn provides `GroupKFold` and `GroupShuffleSplit` for this purpose, and `GroupTimeSeriesSplit` (in the mlxtend library) combines group-aware splitting with chronological ordering.

## What is data leakage and how do you prevent it?

Data leakage occurs when information from outside the training set sneaks into the model during training, producing test scores that overestimate real-world performance. Leakage is one of the most common and damaging errors in applied machine learning, and it can be subtle enough that even experienced practitioners miss it.

### common forms of leakage

| Leakage Type | Description | Example |
|---|---|---|
| **Train/test contamination** | Examples from the test set appear in the training set, often through duplicates or near-duplicates | Web-scraped data containing reposts and mirrors of the same article |
| **Target leakage** | A feature carries information about the target that would not be available at prediction time | Including "customer churned date" as a feature when predicting churn |
| **Preprocessing leakage** | Statistics computed on the full dataset are used to transform training data | Standardizing features using the mean of the entire dataset before splitting |
| **Group leakage** | Different examples from the same group appear in both training and test | One patient's MRI scan in training, another in test, with identifying signal in both |
| **Temporal leakage** | The training set contains examples from after the test period | Random splits of time-stamped data |
| **Label leakage** | The label is encoded indirectly in another feature | A free-text note that says "approved" when predicting loan approval |

### preventing leakage

The core principle is to split first, then process. Any imputation of missing values, scaling, encoding, feature selection, or oversampling must be fit on the training set only and then applied to the validation and test sets [29]. Scikit-learn `Pipeline` objects automate this for many transformations. Other safeguards include deduplicating before splitting (with hash-based or near-duplicate detection), checking that timestamps and group identifiers respect the split boundaries, and inspecting any feature whose validation score looks suspiciously good.

## How do you handle class imbalance in a training set?

Many real classification problems involve a dominant majority class and one or more minority classes. Fraud detection, rare-disease diagnosis, defect detection, and click prediction all share this pattern. A model that always predicts the majority class can achieve high accuracy while being useless for the actual task. Several techniques address this in the training set itself.

### resampling techniques

| Technique | What It Does | Trade-off |
|---|---|---|
| **Random oversampling** | Duplicates random minority-class examples | Simple, but can overfit to the duplicated samples |
| **Random undersampling** | Discards random majority-class examples | Reduces overfitting risk, but loses information |
| **SMOTE** | Generates synthetic minority examples by interpolating between nearest neighbors | Creates new examples, but can blur class boundaries |
| **Borderline-SMOTE** | Synthesizes only near the decision boundary | Focuses learning where it matters, more sensitive to noise |
| **ADASYN** | Adaptive synthetic sampling that emphasizes harder minority examples | Adapts to local difficulty, but amplifies outliers |
| **Tomek links / ENN** | Cleans noisy and overlapping examples after over- or under-sampling | Improves boundary clarity, but is computationally heavier |

SMOTE (Synthetic Minority Over-sampling Technique) was introduced by Chawla, Bowyer, Hall, and Kegelmeyer in 2002 [5]. The algorithm picks a minority-class example, finds its k nearest neighbors in the same class (typically k=5), and creates a synthetic example along the line segment to a randomly chosen neighbor. The original paper, published in the Journal of Artificial Intelligence Research, showed that combining SMOTE with undersampling of the majority class produced higher area under the ROC curve than undersampling alone [5]. SMOTE remains a baseline for tabular imbalanced classification today, with implementations in the imbalanced-learn library.

Cost-sensitive learning is an alternative to resampling: instead of changing the data, the loss function is reweighted so that errors on minority examples cost more. Many classifiers (logistic regression, gradient boosting, neural networks) accept class weights or sample weights for this purpose. Resampling and cost weighting can also be combined.

## How does data augmentation expand a training set?

[Data augmentation](/wiki/data_augmentation) is a family of techniques that create new training examples by applying transformations to existing ones. Unlike synthetic data, which is generated from scratch, augmentation modifies real samples to increase diversity and to encode invariances the model should respect (such as the fact that a rotated cat is still a cat) [10].

### common augmentation techniques

| Domain | Techniques |
|---|---|
| **Images** | Horizontal/vertical flip, rotation, random crop, scaling, color jitter, Gaussian noise, cutout, random erasing, mixup, CutMix, AutoAugment, RandAugment |
| **Text** | Synonym replacement, random insertion, random deletion, random swap (EDA), back-translation, paraphrase generation, token masking, prompt-based augmentation with LLMs |
| **Audio** | Time stretching, pitch shifting, noise injection, speed perturbation, SpecAugment (time and frequency masking on spectrograms), reverberation |
| **Tabular** | SMOTE and variants, Gaussian noise on numeric columns, swap-noise, permutation of independent columns |
| **Graph** | Edge dropping, node dropping, attribute masking, subgraph sampling |

Mixup, introduced by Zhang and colleagues in 2017, trains the model on linear combinations of pairs of inputs and their labels, encouraging the model to behave linearly between training examples [7]. CutMix, proposed by Yun and colleagues in 2019, replaces a patch of one image with a patch from another and mixes the labels in proportion to the patch area [8]. Both techniques act as strong [regularizers](/wiki/regularization) and have become standard in image classification recipes. SpecAugment plays an analogous role in speech recognition, masking time and frequency bands directly on the input spectrogram.

For text, Easy Data Augmentation (EDA) by Wei and Zou (2019) combines four simple operations (synonym replacement, random insertion, random swap, random deletion) and reports consistent gains on small classification datasets [9]. Back-translation, where text is translated into another language and back to the original, produces fluent paraphrases and is particularly useful for low-resource languages and informal text.

Data augmentation is bounded by the quality and diversity of the original data. If the original training set contains systematic biases, augmented copies will carry those same biases, and aggressive augmentation can introduce label noise when transformations break the assumption of label invariance.

## How does cross-validation differ from a fixed split?

[Cross-validation](/wiki/cross_validation) provides a way to use all available data for both training and evaluation, which is valuable when the dataset is too small to afford a large held-out set [2]. In k-fold cross-validation, the dataset is divided into k equal-sized folds. The model is trained k times, each time using a different fold as the validation set and the remaining k minus 1 folds as the training set. The final performance estimate is the average across all k runs.

### cross-validation variants

| Variant | How It Splits | When to Use |
|---|---|---|
| **K-fold** | k disjoint folds; common k is 5 or 10 | Standard choice for moderate datasets |
| **Stratified K-fold** | Preserves class proportions in each fold | Imbalanced classification |
| **Leave-one-out (LOO)** | One example per fold, n folds total | Very small datasets; computationally expensive |
| **Leave-p-out** | All possible p-sample test sets | Theoretical analyses; rarely used at scale |
| **Repeated K-fold** | K-fold repeated multiple times with different random seeds | Noisy estimates that need tighter confidence intervals |
| **Group K-fold** | Each group goes entirely into one fold | Patient data, document data, video frames |
| **Time-series split** | Expanding or sliding windows in chronological order | Forecasting; any time-stamped data |
| **Nested cross-validation** | Inner loop tunes hyperparameters, outer loop estimates generalization | Honest performance estimates when many hyperparameters are tuned |

The scikit-learn user guide recommends 5-fold or 10-fold cross-validation as a default, noting that leave-one-out tends to have high variance because the training sets across folds are nearly identical [28]. Nested cross-validation is the most rigorous option for small datasets where both model selection and final evaluation must come from the same pool, at the cost of training the model on the order of k times m additional times.

## How does training set size affect performance?

The relationship between training set size and model performance is one of the most studied topics in machine learning. Understanding this relationship helps practitioners decide whether to invest in collecting more data or to focus on improving the model architecture.

### learning curves

A learning curve plots model performance (such as accuracy or loss) on the y-axis against the number of training examples on the x-axis. Two curves are typically drawn together: one for training performance and one for validation performance.

| Observation | What It Means |
|---|---|
| Training score is high, validation score is low | The model is [overfitting](/wiki/overfitting); it memorizes training examples but fails to generalize |
| Both scores are low | The model is [underfitting](/wiki/underfitting); it lacks the capacity to learn the patterns |
| Both scores converge to a high value | The model generalizes well; additional data may not help much |
| Validation score keeps climbing as training size grows | More data is likely to improve the model further |

Empirical research shows that learning curves often follow a power law: performance improves rapidly at first, then the rate of improvement slows as data grows. Perrone and colleagues (2021) reviewed the shape of learning curves across many tasks and found that this power-law form is robust, with the exponent depending on the difficulty of the task and the capacity of the model [11]. For neural [language models](/wiki/large_language_model), Kaplan and colleagues (2020) and Hoffmann and colleagues (2022, the Chinchilla paper) showed that loss decreases as a power law in both data size and model size, with an optimal ratio between the two for a fixed compute budget [12] [13]. The Chinchilla authors found that model size and training-set size should grow in lockstep: "for every doubling of model size the number of training tokens should also be doubled" [13]. Their compute-optimal 70B-parameter model was trained on roughly 1.4 trillion tokens, about 20 tokens per parameter, and outperformed the 280B-parameter Gopher despite being four times smaller [13].

### when more data helps

More training data generally helps when the model has high variance (i.e., it overfits on small datasets), the problem is complex with many possible patterns to learn, or the feature space is high-dimensional. Deep [neural networks](/wiki/neural_network), which can have millions or billions of [parameters](/wiki/parameter), are particularly data-hungry and tend to improve steadily with larger training sets.

### when more data stops helping

Models eventually reach a plateau where additional data yields diminishing returns. This happens when the model has already captured the key patterns and further examples are largely redundant. At this point, improving data quality, feature engineering, or switching to a better model architecture may produce larger gains than simply adding more data. Recent evidence from [large language model](/wiki/large_language_model) scaling research suggests that high-quality data can outperform raw quantity, with smaller models trained on curated data sometimes matching the performance of larger models trained on noisier corpora.

## Does the order of training examples matter? (curriculum learning)

Classical training treats the examples in the training set as exchangeable: the model sees them in a random order, often reshuffled each epoch. [Curriculum learning](/wiki/curriculum_learning), proposed by Yoshua Bengio and colleagues in 2009, asks whether ordering matters. Their hypothesis was that humans and animals learn better when examples are presented in increasing order of difficulty, and that the same may hold for machine learning models. The 2009 ICML paper showed that on shape recognition and language modeling tasks, starting with easier examples and gradually adding harder ones improved both convergence speed and the quality of the local minimum found [6].

Curriculum learning has since been applied throughout machine learning. Self-paced learning lets the model itself decide which examples are easy enough to include at each stage. Reverse curricula start with hard examples and add easier ones for fine-tuning. In [large language model](/wiki/large_language_model) training, curricula are used to schedule data mixtures over the course of pretraining, often starting with short, clean sequences and adding longer or noisier ones later. Variants such as anti-curriculum and uniform sampling sometimes outperform straightforward easy-to-hard schedules, and the effect size depends heavily on how difficulty is measured.

## How do active and semi-supervised learning shape the training set?

When labeled examples are expensive but unlabeled examples are cheap, the structure of the training set itself becomes a design decision. Two related families of methods address this regime.

[Active learning](/wiki/active_learning) iteratively grows the training set by querying a human oracle for labels on the examples the model is most uncertain about. The model is trained, scores all unlabeled examples by uncertainty (or by expected information gain, query-by-committee disagreement, or another acquisition function), and the top examples are sent for labeling. Studies in semantic segmentation and single-cell biology have shown that active learning can roughly halve the amount of labeled data required to reach a target accuracy, compared to random sampling.

[Semi-supervised learning](/wiki/semi_supervised_learning) trains on a mix of labeled and unlabeled examples, typically by generating pseudo-labels for the unlabeled portion or by enforcing consistency under augmentation. Methods like FixMatch combine both ideas: a strong-augmentation prediction is required to match a confident pseudo-label from a weak augmentation. Self-supervised pretraining followed by supervised fine-tuning can be viewed as an extreme form of this strategy, where the unlabeled stage produces a strong representation that the labeled stage refines.

## What makes a high-quality training set?

The principle of "garbage in, garbage out" applies directly to training sets. A model can only learn patterns that exist in its training data, so the quality of that data determines the ceiling on model performance. This is the core insight behind the data-centric AI movement: as Andrew Ng put it, the field's "dominant paradigm over the last decade was to download the data set while you focus on improving the code," and the proposed shift is toward "systematically engineering the data needed to successfully build an AI system" [32].

### key quality dimensions

| Dimension | Description |
|---|---|
| **Accuracy** | Labels must be correct. Mislabeled examples teach the model wrong associations. |
| **Completeness** | Missing values or incomplete records can introduce noise and reduce learning effectiveness. |
| **Consistency** | Contradictory examples (same input mapped to different outputs) confuse the learning algorithm. |
| **Relevance** | The training data should reflect the actual conditions and distribution the model will encounter in production. |
| **Freshness** | Outdated data can lead to models that do not reflect current patterns or trends. |
| **Diversity** | Coverage of edge cases, rare classes, and underrepresented subgroups. |
| **Provenance** | A clear, auditable record of where each example came from and how it was labeled. |

### data preprocessing

Before training, data typically goes through several [preprocessing](/wiki/preprocessing) steps: cleaning (removing duplicates, correcting errors), handling missing values (imputation or removal), [normalization](/wiki/normalization) or standardization of numerical features, encoding categorical variables, and outlier detection. These steps can significantly improve the effectiveness of the training set without adding a single new example. As noted above, all preprocessing must be fit only on the training portion of the data to avoid leakage.

### labeling and label noise

For supervised tasks, labels are usually produced by humans, which introduces both cost and noise. Inter-annotator agreement (often measured with Cohen's kappa or Fleiss' kappa) gives a rough ceiling on the accuracy a model can achieve, since systematic disagreement between humans means the labels themselves are not consistent. Common strategies for managing label noise include using multiple annotators per example with majority voting, training on noisy labels with loss functions robust to label noise, and post-hoc cleaning with confident learning. Tools such as cleanlab automate the detection of likely label errors in standard datasets like ImageNet, where studies have estimated that several percent of labels are incorrect.

## How does training set bias arise, and how is it mitigated?

A training set is biased when it does not accurately represent the population or conditions the model will encounter in deployment. Bias in training data leads to models that perform well on certain groups or scenarios but poorly on others [24].

### common types of bias

| Bias Type | Description | Example |
|---|---|---|
| **Selection bias** | The training data is not randomly sampled from the target population | A hiring model trained only on data from one company |
| **Sampling bias** | Certain groups are over- or underrepresented | A facial recognition model trained mostly on light-skinned faces |
| **Temporal bias** | The training data reflects conditions from a specific time period that may not hold in the future | A credit scoring model trained on pre-pandemic financial data |
| **Measurement bias** | Systematic errors in how data was collected or labeled | Inconsistent labeling criteria across different annotators |
| **Historical bias** | The training data reflects existing societal prejudices | A language model trained on text that contains gender stereotypes |
| **Confirmation bias** | The data is filtered through a model whose mistakes propagate forward | Active learning queries that miss a region of input space the model never asks about |

### mitigating bias

Practitioners can reduce training set bias by collecting more diverse and representative data, applying stratified sampling to ensure all subgroups are proportionally included, auditing datasets for demographic and distributional imbalances, using [data augmentation](/wiki/data_augmentation) to synthetically increase representation of underrepresented groups, and implementing fairness-aware preprocessing techniques. Documentation practices such as Datasheets for Datasets (Gebru et al., 2018) and Data Cards (Pushkarna et al., 2022) encourage explicit reporting of dataset provenance, intended use, and known limitations [25].

## What training data do large language models use?

The training sets used for modern [large language models](/wiki/large_language_model) (LLMs) are orders of magnitude larger than those used in traditional machine learning. These models consume trillions of tokens from diverse text sources during pretraining.

### primary data sources

| Source | Description | Scale |
|---|---|---|
| **[Common Crawl](/wiki/common_crawl)** | Petabytes of raw web data extracted from billions of web pages, updated monthly | Hundreds of billions of tokens per snapshot |
| **Wikipedia** | Structured encyclopedia articles across hundreds of languages | About 6.8 million English articles, around 4.7 billion words |
| **Books** | BookCorpus (around 11,000 books), Project Gutenberg (around 70,000 public domain books) | Tens of billions of tokens |
| **Code** | GitHub repositories, Stack Overflow, Jupyter notebooks | StarCoder dataset: 783 GB across 86 languages |
| **Scientific papers** | arXiv, PubMed, Semantic Scholar | Billions of tokens of technical text |
| **Curated collections** | [The Pile](/wiki/pile) (825 GiB), [RedPajama](/wiki/red_pajama) (1.2 trillion tokens), [FineWeb](/wiki/fineweb) | Purpose-built for LLM training |

### major open pretraining corpora

| Corpus | Year | Size | Notes |
|---|---|---|---|
| **C4** | 2019 | About 156 billion tokens | Filtered Common Crawl, built for T5 |
| **The Pile** | 2020 | 825 GiB | 22 sub-datasets including ArXiv, PubMed, GitHub |
| **RedPajama** | 2023 | 1.2 trillion tokens | Open replication of the LLaMA data mix |
| **RefinedWeb** | 2023 | 5 trillion tokens (600B public) | Web-only, used for Falcon |
| **Dolma** | 2024 | 3 trillion tokens | Used for OLMo, AI2 release with full transparency |
| **FineWeb** | 2024 | About 15 trillion tokens | Cleaned and deduplicated Common Crawl, 96 dumps |
| **FineWeb-Edu** | 2024 | 1.3 trillion tokens | Subset of FineWeb filtered for educational value |
| **DCLM-Baseline** | 2024 | About 4 trillion tokens | Built from a 240T token DataComp-LM pool with model-based filtering |

GPT-3 was trained on a mixture of approximately 300 billion tokens drawn from Common Crawl (filtered), WebText2, Books1, Books2, and Wikipedia [14]. LLaMA drew from Common Crawl, C4, GitHub, Wikipedia, books, arXiv, and Stack Exchange. These training sets are carefully weighted: higher-quality sources like Wikipedia and books are often sampled multiple times per epoch, while noisier web data is downsampled.

The Dolma paper (Soldaini et al., 2024, arXiv:2402.00159) was notable for releasing both the corpus and the full data pipeline, allowing other researchers to reproduce filtering decisions [17]. FineWeb (Penedo et al., 2024, arXiv:2406.17557) demonstrated that careful filtering of Common Crawl can match or beat curated mixes [18]. The DataComp-LM benchmark (Li et al., 2024, arXiv:2406.11794) treats data curation as an experimental science: participants run controlled comparisons of filtering and mixing strategies on a fixed 240 trillion token pool, with model quality on 53 downstream tasks as the metric [19].

Research projections by Villalobos and colleagues (2022) suggest that publicly available, high-quality human-generated text could be largely exhausted between 2026 and 2032, which is driving interest in synthetic data generation and more efficient data curation methods [26].

### instruction tuning datasets

After pretraining, large language models are usually fine-tuned on instruction-following data. The training set in this stage is much smaller (tens of thousands to a few million examples) but its quality has an outsized effect on the resulting model.

| Dataset | Year | Approximate Size | Notes |
|---|---|---|---|
| **FLAN / FLAN v2** | 2021, 2022 | About 1.4 million examples | Mixture of NLP tasks reformulated as instructions |
| **Self-Instruct / Alpaca** | 2022, 2023 | 52,000 examples | Generated from GPT-3 text-davinci-003 |
| **OpenHermes** | 2023 | About 1 million examples | Curated mix of instruction sources |
| **OpenOrca** | 2023 | About 4.1 million examples | FLAN augmented with GPT-4 reasoning traces |
| **Tülu 2 / Tülu 3** | 2023, 2024 | Hundreds of thousands of examples | Open recipes for instruction tuning at AI2 |
| **No Robots** | 2023 | 10,000 examples | Entirely human-written |

Self-Instruct (Wang et al., 2022, arXiv:2212.10560) showed that a strong base model can generate its own instruction data: starting from 175 seed tasks, the authors used GPT-3 to bootstrap 52,000 instructions, with the model fine-tuned on this data improving by 33 percentage points on Super-NaturalInstructions [20]. Stanford Alpaca followed the same recipe with text-davinci-003 and released the resulting instruction set, which became a template for many later projects.

### preference datasets for alignment

Reinforcement learning from human feedback ([RLHF](/wiki/reinforcement_learning_from_human_feedback)) and direct preference optimization ([DPO](/wiki/direct_preference_optimization)) require a different kind of training set: pairs of model responses with a label indicating which one is preferred [23]. Major preference datasets include:

| Dataset | Year | Size | Notes |
|---|---|---|---|
| **Anthropic HH-RLHF** | 2022 | About 161,000 comparisons | Human preferences over helpfulness and harmlessness |
| **OpenAssistant** | 2023 | About 161,000 messages | Human-generated conversations and rankings |
| **UltraFeedback** | 2023 | About 64,000 prompts, 256,000 completions | Multi-dimensional GPT-4 feedback (instruction following, truthfulness, honesty, helpfulness) |
| **Nectar** | 2023 | About 183,000 prompts | Used for the Starling models, with GPT-4 ranking responses from many models |
| **PKU-SafeRLHF** | 2023 | About 30,000 examples | Safety-focused harmlessness preferences |

UltraFeedback was widely adopted as a training set for direct preference optimization because its multi-dimensional scores allow more nuanced reward signals than binary preferences alone [22].

## What is synthetic training data?

Synthetic training data is artificially generated data that mimics the statistical properties of real-world data. It is produced using rule-based systems, simulation engines, or generative models such as [GANs](/wiki/generative_adversarial_network_gan) and [diffusion models](/wiki/diffusion_model).

Synthetic data is useful when real data is scarce, expensive, or restricted by privacy rules. Medical imaging datasets can be augmented with synthetic scans to cover rare pathologies, and autonomous driving systems use simulated environments to generate training data for edge cases that are dangerous or impractical to capture on real roads.

### model-generated training data

A distinct strand of synthetic data uses one model to produce training data for another. The Microsoft Phi series, beginning with the 2023 paper Textbooks Are All You Need (Gunasekar et al., arXiv:2306.11644), trained a 1.3 billion parameter code model on 6 billion tokens of "textbook quality" web data plus 1 billion tokens of synthetic textbooks and exercises generated with GPT-3.5. The resulting model, Phi-1, reached 50.6% pass@1 on HumanEval despite being more than ten times smaller than competitors trained on far more data [21]. The follow-on Phi-1.5 and Phi-2 models extended the synthetic data approach, and similar techniques are now standard in fine-tuning pipelines for instruction following and reasoning.

Knowledge distillation is another form of model-generated training data: a smaller "student" model is trained on the soft outputs of a larger "teacher" model rather than on hard labels, often with much smaller datasets.

### limitations of synthetic data

If the generative process does not capture the full complexity of real-world data, models trained on synthetic data may underperform in deployment. Validating the quality of synthetic data is itself a non-trivial challenge. Recent research on "model collapse" (Shumailov et al., 2024) has shown that recursively training on outputs of earlier model generations can degrade quality and diversity over time, since rare modes of the original distribution are gradually lost [27].

## What are the legal and ethical issues with training data?

The data that goes into a training set is not just a technical asset; it is also a legal and ethical one. Several issues have come to the foreground as training datasets have grown.

### copyright and fair use

The most prominent legal dispute is The New York Times v. OpenAI and Microsoft, filed in December 2023 [30]. The Times alleges that millions of its articles were used without license to train OpenAI models, and that those models can produce near-verbatim reproductions of its journalism. OpenAI and Microsoft argue fair use. The case entered discovery in 2024, and in May 2025 a preservation order required OpenAI to retain output logs that might evidence reproduction. As of early 2026, summary judgment on the fair-use question is not expected before the summer of 2026 at the earliest. Dozens of similar suits have been filed by authors, music publishers, and other rights holders, making training-set provenance a significant business and legal concern.

### privacy and the right to erasure

GDPR Article 17 (the right to erasure) gives individuals the right to demand deletion of personal data. Applied to a trained model, this right is technically difficult: model parameters encode something about every example seen during training, but there is no straightforward way to remove a single individual's contribution short of retraining from scratch. The CCPA in California offers similar opt-out rights. "[Machine unlearning](/wiki/machine_unlearning)" is an active research area aimed at approximate erasure: techniques include retraining only affected subsets, certified unlearning with formal guarantees, and influence-function-based corrections. None of these is fully general yet, and most commercial systems handle erasure requests by removing source data and committing to retraining at the next scheduled cycle.

### opt-out and data governance

Several data publishers have introduced opt-out mechanisms for AI training. Common Crawl supports the `User-Agent: GPTBot` and `Common Crawl` directives in robots.txt; the IETF has discussed an `ai.txt` extension; and the Hugging Face hub supports per-dataset access controls. Many large web platforms (Reddit, Stack Exchange, X/Twitter) have moved to license access to their archives rather than allowing free scraping. For teams building training sets, the practical implication is that the legal status of a corpus has to be tracked example by example, not just at the dataset level.

## data versioning and tooling

Training sets are rarely static. New examples are added, labels are corrected, and earlier subsets are deprecated. Reproducible machine learning therefore requires tracking which exact version of a dataset was used to train each model. Several tools and practices have emerged:

| Tool | Role |
|---|---|
| **[DVC](/wiki/dvc)** (Data Version Control) | Git-like versioning of large data and model files, with cloud storage backends |
| **[Hugging Face Datasets](/wiki/hugging_face_datasets)** | Hub-hosted datasets with revision SHAs, dataset cards, and streaming access |
| **lakeFS** | Git-like branching, merging, and rollback for object stores |
| **MLflow / Weights & Biases** | Experiment tracking that logs dataset hashes alongside model artifacts |
| **Pachyderm** | Data pipelines with content-addressed storage and lineage |
| **Datasheets and Data Cards** | Structured documentation of dataset purpose, composition, and limitations |

DVC integrates natively with Hugging Face datasets and supports loading data from the hub via a `dvc://` filesystem URL. The combination of Git for code, DVC or lakeFS for data, and a model registry for trained artifacts is a common pattern for reproducible training pipelines.

## explain like i'm 5 (eli5)

Imagine you are learning to tell the difference between cats and dogs by looking at pictures. Your parent shows you a big stack of photos, each labeled "cat" or "dog." That stack is the training set. You study the photos and start to notice things: cats have pointy ears, dogs have floppy ears, and so on.

After you finish studying, your parent gives you a new batch of photos you have never seen before and asks, "Is this a cat or a dog?" Those new photos are the test set. The better and more varied your study stack was, the better you will be on the new batch. If all the cats in your study stack were orange tabbies, you might not recognize a black cat. And if your parent accidentally mixed one of the test photos into your study stack, you would ace that one photo only because you had already seen the answer, which is why training and test photos must stay strictly separate.

## references

1. Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press. Chapter 5: Machine Learning Basics. https://www.deeplearningbook.org/
2. Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning*, 2nd ed. Springer. Chapter 7: Model Assessment and Selection.
3. Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Springer.
4. Joseph, V. R. (2022). "Optimal ratio for data splitting." *Statistical Analysis and Data Mining*, 15(4), 531 to 538. https://doi.org/10.1002/sam.11583
5. Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). "SMOTE: Synthetic Minority Over-sampling Technique." *Journal of Artificial Intelligence Research*, 16, 321 to 357. https://www.jair.org/index.php/jair/article/view/10302
6. Bengio, Y., Louradour, J., Collobert, R., & Weston, J. (2009). "Curriculum learning." *Proceedings of the 26th International Conference on Machine Learning*. https://dl.acm.org/doi/10.1145/1553374.1553380
7. Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2018). "mixup: Beyond Empirical Risk Minimization." *International Conference on Learning Representations*.
8. Yun, S. et al. (2019). "CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features." *ICCV 2019*.
9. Wei, J., & Zou, K. (2019). "EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks." *EMNLP-IJCNLP 2019*.
10. Shorten, C., & Khoshgoftaar, T. M. (2019). "A survey on Image Data Augmentation for Deep Learning." *Journal of Big Data*, 6, 60.
11. Perrone, V. et al. (2021). "The Shape of Learning Curves: A Review." *arXiv:2103.10948*.
12. Kaplan, J. et al. (2020). "Scaling Laws for Neural Language Models." *arXiv:2001.08361*.
13. Hoffmann, J. et al. (2022). "Training Compute-Optimal Large Language Models." *arXiv:2203.15556*. (the Chinchilla paper)
14. Brown, T. et al. (2020). "Language Models are Few-Shot Learners." *Advances in Neural Information Processing Systems*, 33, 1877 to 1901.
15. Gao, L. et al. (2020). "The Pile: An 800GB Dataset of Diverse Text for Language Modeling." *arXiv:2101.00027*.
16. Penedo, G. et al. (2023). "The RefinedWeb Dataset for Falcon LLM." *arXiv:2306.01116*.
17. Soldaini, L. et al. (2024). "Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research." *arXiv:2402.00159*.
18. Penedo, G. et al. (2024). "The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale." *arXiv:2406.17557*.
19. Li, J. et al. (2024). "DataComp-LM: In search of the next generation of training sets for language models." *arXiv:2406.11794*.
20. Wang, Y. et al. (2022). "Self-Instruct: Aligning Language Models with Self-Generated Instructions." *arXiv:2212.10560*.
21. Gunasekar, S. et al. (2023). "Textbooks Are All You Need." *arXiv:2306.11644*.
22. Cui, G. et al. (2023). "UltraFeedback: Boosting Language Models with High-quality Feedback." *arXiv:2310.01377*.
23. Bai, Y. et al. (2022). "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback." *arXiv:2204.05862*. (HH-RLHF)
24. Suresh, H., & Guttag, J. (2021). "A Framework for Understanding Sources of Harm throughout the Machine Learning Life Cycle." *Equity and Access in Algorithms, Mechanisms, and Optimization*.
25. Gebru, T. et al. (2018). "Datasheets for Datasets." *arXiv:1803.09010*.
26. Villalobos, P. et al. (2022). "Will we run out of data? Limits of LLM scaling based on human-generated data." *arXiv:2211.04325*.
27. Shumailov, I. et al. (2024). "AI models collapse when trained on recursively generated data." *Nature*, 631, 755 to 759.
28. Scikit-learn developers (2026). "3.1. Cross-validation: evaluating estimator performance." *scikit-learn 1.8 documentation*. https://scikit-learn.org/stable/modules/cross_validation.html
29. Scikit-learn developers (2026). "11. Common pitfalls and recommended practices." *scikit-learn 1.8 documentation*. https://scikit-learn.org/stable/common_pitfalls.html
30. The New York Times Company v. Microsoft Corp. and OpenAI, Inc., No. 1:23-cv-11195 (S.D.N.Y. filed Dec. 27, 2023).
31. Mitchell, T. M. (1997). *Machine Learning*. McGraw-Hill, p. 2.
32. Strickland, E. (2022). "Andrew Ng: Unbiggen AI." *IEEE Spectrum*, February 9, 2022. https://spectrum.ieee.org/andrew-ng-data-centric-ai

