Training Set
Last reviewed
May 9, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v5 · 6,491 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 9, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v5 · 6,491 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms
A training set is the portion of a dataset used to fit a machine learning model. During training, the model processes the examples in this set to learn patterns, adjust its internal parameters, and build the mathematical relationships needed to make predictions on new, unseen data. The training set is one of three standard data partitions in supervised machine learning, alongside the validation set and the test set.
The quality, size, and representativeness of the training set have a direct impact on a model's accuracy, generalization ability, and fairness. Poorly constructed training sets can lead to overfitting, bias, and unreliable predictions in production. The phrase "garbage in, garbage out" is a useful summary: a model can only learn patterns that exist in its training data, so practitioners spend a significant fraction of any project on collecting, cleaning, labeling, and validating the data that goes into training rather than on the model architecture itself.
In a typical supervised machine learning workflow, a practitioner starts with a full dataset of labeled examples. Each example consists of input features paired with a known output (the label or target). The dataset is then split into non-overlapping subsets:
| Subset | Purpose | When Used |
|---|---|---|
| Training set | Teaches the model by exposing it to labeled examples so it can learn patterns and adjust weights | During model training |
| Validation set | Provides feedback for tuning hyperparameters and detecting overfitting | During model development |
| Test set | Gives an unbiased estimate of final model performance on data it has never seen | After training is complete |
The model iterates over the training set multiple times (each full pass is called an epoch). In each pass, it computes a loss function that measures prediction errors, then uses an optimizer (such as stochastic gradient descent) to update its parameters in the direction that reduces the loss. This cycle repeats until the model converges or a stopping criterion is met.
For most supervised tasks the training set is fixed once the project starts. In reinforcement learning the equivalent is a stream of episodes generated by interaction with an environment, and in self-supervised learning the labels are derived from the data itself (such as predicting the next token in a sequence). The discussion below focuses on supervised settings, though most of the same principles apply once the training distribution is fixed.
There is no single "correct" ratio for dividing data. The best split depends on the total dataset size, the complexity of the model, and the number of hyperparameters to tune. The following ratios are widely used in practice:
| Split Scheme | Training | Validation | Test | Best For |
|---|---|---|---|---|
| 80 / 10 / 10 | 80% | 10% | 10% | Large datasets (100k+ samples) |
| 70 / 15 / 15 | 70% | 15% | 15% | Medium datasets (10k to 100k samples) |
| 60 / 20 / 20 | 60% | 20% | 20% | Smaller datasets or complex models |
| 98 / 1 / 1 | 98% | 1% | 1% | Very large datasets (millions of examples) |
When only a train/test split is needed (without a separate validation set), common ratios are 80/20 or 75/25. In deep learning with very large datasets, practitioners sometimes allocate as much as 98% to training, because even 1% of a massive dataset provides thousands of validation or test examples. Joseph (2022) studied the question of optimal ratios formally and showed that for ordinary least-squares regression the test fraction should scale roughly with the square root of the number of model parameters divided by the dataset size, which often justifies test fractions much smaller than the rule-of-thumb 20%.
The purpose of the validation set is to provide a feedback signal for choices made by the human or by an automated search procedure: model architecture, learning rate, regularization strength, the number of training epochs, and so on. Every time a hyperparameter is changed because of validation performance, the validation set leaks a small amount of information into the model. After many iterations the validation score becomes optimistically biased, in much the same way that running thousands of significance tests will eventually find a spurious result. The test set is held back and ideally consulted only once at the end, so its score is a faithful estimate of generalization to new data. When test performance is repeatedly checked during development, practitioners say the test set has been "burned" and a new held-out set must be collected.
How data is divided matters as much as the ratio itself. The three main strategies are random splitting, stratified splitting, and temporal splitting. The right choice depends on the structure of the data and the prediction task.
Random splitting shuffles the dataset and assigns samples to each subset based on the chosen ratio. It is the simplest and most common approach, and works well when the data is large and roughly balanced across classes. However, random splitting can produce subsets with skewed class distributions, especially when working with imbalanced datasets, and it ignores any group structure that may be present.
Stratified splitting forces each subset (training, validation, test) to preserve the same class distribution as the original dataset. For example, if the full dataset contains 90% negative examples and 10% positive examples, each split will maintain that 90/10 ratio. Stratified splitting is essential for imbalanced classification problems, because a purely random split could place very few minority-class examples in the validation or test sets, producing unreliable performance estimates. Scikit-learn provides StratifiedShuffleSplit and StratifiedKFold for this purpose. Stratification can also be done on continuous targets by binning the values first, and on multi-label data via iterative stratification.
For time series data, random shuffling is not appropriate because it would cause data leakage: the model would train on future data and be tested on past data. Temporal splitting uses a chronological cutoff point. All data before the cutoff becomes the training set, and all data after the cutoff forms the validation or test set. Scikit-learn's TimeSeriesSplit implements an expanding-window variant where each fold adds more historical data to the training window while testing on the next time period. A walk-forward variant uses a sliding window of fixed length, simulating a model that is retrained periodically and used only on the immediately following period.
Many datasets contain natural groups: multiple measurements per patient in a medical dataset, multiple sentences per document in NLP, multiple frames per video in computer vision. If a single group is split across training and test sets, the model can effectively memorize group-specific signals during training and recognize them in the test set, producing optimistic but misleading scores. The fix is group-aware splitting, where every example from a given group is assigned to exactly one subset. Scikit-learn provides GroupKFold and GroupShuffleSplit for this purpose, and GroupTimeSeriesSplit (in the mlxtend library) combines group-aware splitting with chronological ordering.
Data leakage occurs when information from outside the training set sneaks into the model during training, producing test scores that overestimate real-world performance. Leakage is one of the most common and damaging errors in applied machine learning, and it can be subtle enough that even experienced practitioners miss it.
| Leakage Type | Description | Example |
|---|---|---|
| Train/test contamination | Examples from the test set appear in the training set, often through duplicates or near-duplicates | Web-scraped data containing reposts and mirrors of the same article |
| Target leakage | A feature carries information about the target that would not be available at prediction time | Including "customer churned date" as a feature when predicting churn |
| Preprocessing leakage | Statistics computed on the full dataset are used to transform training data | Standardizing features using the mean of the entire dataset before splitting |
| Group leakage | Different examples from the same group appear in both training and test | One patient's MRI scan in training, another in test, with identifying signal in both |
| Temporal leakage | The training set contains examples from after the test period | Random splits of time-stamped data |
| Label leakage | The label is encoded indirectly in another feature | A free-text note that says "approved" when predicting loan approval |
The core principle is to split first, then process. Any imputation of missing values, scaling, encoding, feature selection, or oversampling must be fit on the training set only and then applied to the validation and test sets. Scikit-learn Pipeline objects automate this for many transformations. Other safeguards include deduplicating before splitting (with hash-based or near-duplicate detection), checking that timestamps and group identifiers respect the split boundaries, and inspecting any feature whose validation score looks suspiciously good.
Many real classification problems involve a dominant majority class and one or more minority classes. Fraud detection, rare-disease diagnosis, defect detection, and click prediction all share this pattern. A model that always predicts the majority class can achieve high accuracy while being useless for the actual task. Several techniques address this in the training set itself.
| Technique | What It Does | Trade-off |
|---|---|---|
| Random oversampling | Duplicates random minority-class examples | Simple, but can overfit to the duplicated samples |
| Random undersampling | Discards random majority-class examples | Reduces overfitting risk, but loses information |
| SMOTE | Generates synthetic minority examples by interpolating between nearest neighbors | Creates new examples, but can blur class boundaries |
| Borderline-SMOTE | Synthesizes only near the decision boundary | Focuses learning where it matters, more sensitive to noise |
| ADASYN | Adaptive synthetic sampling that emphasizes harder minority examples | Adapts to local difficulty, but amplifies outliers |
| Tomek links / ENN | Cleans noisy and overlapping examples after over- or under-sampling | Improves boundary clarity, but is computationally heavier |
SMOTE (Synthetic Minority Over-sampling Technique) was introduced by Chawla, Bowyer, Hall, and Kegelmeyer in 2002. The algorithm picks a minority-class example, finds its k nearest neighbors in the same class (typically k=5), and creates a synthetic example along the line segment to a randomly chosen neighbor. The original paper, published in the Journal of Artificial Intelligence Research, showed that combining SMOTE with undersampling of the majority class produced higher area under the ROC curve than undersampling alone. SMOTE remains a baseline for tabular imbalanced classification today, with implementations in the imbalanced-learn library.
Cost-sensitive learning is an alternative to resampling: instead of changing the data, the loss function is reweighted so that errors on minority examples cost more. Many classifiers (logistic regression, gradient boosting, neural networks) accept class weights or sample weights for this purpose. Resampling and cost weighting can also be combined.
Data augmentation is a family of techniques that create new training examples by applying transformations to existing ones. Unlike synthetic data, which is generated from scratch, augmentation modifies real samples to increase diversity and to encode invariances the model should respect (such as the fact that a rotated cat is still a cat).
| Domain | Techniques |
|---|---|
| Images | Horizontal/vertical flip, rotation, random crop, scaling, color jitter, Gaussian noise, cutout, random erasing, mixup, CutMix, AutoAugment, RandAugment |
| Text | Synonym replacement, random insertion, random deletion, random swap (EDA), back-translation, paraphrase generation, token masking, prompt-based augmentation with LLMs |
| Audio | Time stretching, pitch shifting, noise injection, speed perturbation, SpecAugment (time and frequency masking on spectrograms), reverberation |
| Tabular | SMOTE and variants, Gaussian noise on numeric columns, swap-noise, permutation of independent columns |
| Graph | Edge dropping, node dropping, attribute masking, subgraph sampling |
Mixup, introduced by Zhang and colleagues in 2017, trains the model on linear combinations of pairs of inputs and their labels, encouraging the model to behave linearly between training examples. CutMix, proposed by Yun and colleagues in 2019, replaces a patch of one image with a patch from another and mixes the labels in proportion to the patch area. Both techniques act as strong regularizers and have become standard in image classification recipes. SpecAugment plays an analogous role in speech recognition, masking time and frequency bands directly on the input spectrogram.
For text, Easy Data Augmentation (EDA) by Wei and Zou (2019) combines four simple operations (synonym replacement, random insertion, random swap, random deletion) and reports consistent gains on small classification datasets. Back-translation, where text is translated into another language and back to the original, produces fluent paraphrases and is particularly useful for low-resource languages and informal text.
Data augmentation is bounded by the quality and diversity of the original data. If the original training set contains systematic biases, augmented copies will carry those same biases, and aggressive augmentation can introduce label noise when transformations break the assumption of label invariance.
Cross-validation provides a way to use all available data for both training and evaluation, which is valuable when the dataset is too small to afford a large held-out set. In k-fold cross-validation, the dataset is divided into k equal-sized folds. The model is trained k times, each time using a different fold as the validation set and the remaining k minus 1 folds as the training set. The final performance estimate is the average across all k runs.
| Variant | How It Splits | When to Use |
|---|---|---|
| K-fold | k disjoint folds; common k is 5 or 10 | Standard choice for moderate datasets |
| Stratified K-fold | Preserves class proportions in each fold | Imbalanced classification |
| Leave-one-out (LOO) | One example per fold, n folds total | Very small datasets; computationally expensive |
| Leave-p-out | All possible p-sample test sets | Theoretical analyses; rarely used at scale |
| Repeated K-fold | K-fold repeated multiple times with different random seeds | Noisy estimates that need tighter confidence intervals |
| Group K-fold | Each group goes entirely into one fold | Patient data, document data, video frames |
| Time-series split | Expanding or sliding windows in chronological order | Forecasting; any time-stamped data |
| Nested cross-validation | Inner loop tunes hyperparameters, outer loop estimates generalization | Honest performance estimates when many hyperparameters are tuned |
The scikit-learn user guide recommends 5-fold or 10-fold cross-validation as a default, noting that leave-one-out tends to have high variance because the training sets across folds are nearly identical. Nested cross-validation is the most rigorous option for small datasets where both model selection and final evaluation must come from the same pool, at the cost of training the model on the order of k times m additional times.
The relationship between training set size and model performance is one of the most studied topics in machine learning. Understanding this relationship helps practitioners decide whether to invest in collecting more data or to focus on improving the model architecture.
A learning curve plots model performance (such as accuracy or loss) on the y-axis against the number of training examples on the x-axis. Two curves are typically drawn together: one for training performance and one for validation performance.
| Observation | What It Means |
|---|---|
| Training score is high, validation score is low | The model is overfitting; it memorizes training examples but fails to generalize |
| Both scores are low | The model is underfitting; it lacks the capacity to learn the patterns |
| Both scores converge to a high value | The model generalizes well; additional data may not help much |
| Validation score keeps climbing as training size grows | More data is likely to improve the model further |
Empirical research shows that learning curves often follow a power law: performance improves rapidly at first, then the rate of improvement slows as data grows. Perrone and colleagues (2021) reviewed the shape of learning curves across many tasks and found that this power-law form is robust, with the exponent depending on the difficulty of the task and the capacity of the model. For neural language models, Kaplan and colleagues (2020) and Hoffmann and colleagues (2022, the Chinchilla paper) showed that loss decreases as a power law in both data size and model size, with an optimal ratio between the two for a fixed compute budget.
More training data generally helps when the model has high variance (i.e., it overfits on small datasets), the problem is complex with many possible patterns to learn, or the feature space is high-dimensional. Deep neural networks, which can have millions or billions of parameters, are particularly data-hungry and tend to improve steadily with larger training sets.
Models eventually reach a plateau where additional data yields diminishing returns. This happens when the model has already captured the key patterns and further examples are largely redundant. At this point, improving data quality, feature engineering, or switching to a better model architecture may produce larger gains than simply adding more data. Recent evidence from large language model scaling research suggests that high-quality data can outperform raw quantity, with smaller models trained on curated data sometimes matching the performance of larger models trained on noisier corpora.
Classical training treats the examples in the training set as exchangeable: the model sees them in a random order, often reshuffled each epoch. Curriculum learning, proposed by Yoshua Bengio and colleagues in 2009, asks whether ordering matters. Their hypothesis was that humans and animals learn better when examples are presented in increasing order of difficulty, and that the same may hold for machine learning models. The 2009 ICML paper showed that on shape recognition and language modeling tasks, starting with easier examples and gradually adding harder ones improved both convergence speed and the quality of the local minimum found.
Curriculum learning has since been applied throughout machine learning. Self-paced learning lets the model itself decide which examples are easy enough to include at each stage. Reverse curricula start with hard examples and add easier ones for fine-tuning. In large language model training, curricula are used to schedule data mixtures over the course of pretraining, often starting with short, clean sequences and adding longer or noisier ones later. Variants such as anti-curriculum and uniform sampling sometimes outperform straightforward easy-to-hard schedules, and the effect size depends heavily on how difficulty is measured.
When labeled examples are expensive but unlabeled examples are cheap, the structure of the training set itself becomes a design decision. Two related families of methods address this regime.
Active learning iteratively grows the training set by querying a human oracle for labels on the examples the model is most uncertain about. The model is trained, scores all unlabeled examples by uncertainty (or by expected information gain, query-by-committee disagreement, or another acquisition function), and the top examples are sent for labeling. Studies in semantic segmentation and single-cell biology have shown that active learning can roughly halve the amount of labeled data required to reach a target accuracy, compared to random sampling.
Semi-supervised learning trains on a mix of labeled and unlabeled examples, typically by generating pseudo-labels for the unlabeled portion or by enforcing consistency under augmentation. Methods like FixMatch combine both ideas: a strong-augmentation prediction is required to match a confident pseudo-label from a weak augmentation. Self-supervised pretraining followed by supervised fine-tuning can be viewed as an extreme form of this strategy, where the unlabeled stage produces a strong representation that the labeled stage refines.
The principle of "garbage in, garbage out" applies directly to training sets. A model can only learn patterns that exist in its training data, so the quality of that data determines the ceiling on model performance.
| Dimension | Description |
|---|---|
| Accuracy | Labels must be correct. Mislabeled examples teach the model wrong associations. |
| Completeness | Missing values or incomplete records can introduce noise and reduce learning effectiveness. |
| Consistency | Contradictory examples (same input mapped to different outputs) confuse the learning algorithm. |
| Relevance | The training data should reflect the actual conditions and distribution the model will encounter in production. |
| Freshness | Outdated data can lead to models that do not reflect current patterns or trends. |
| Diversity | Coverage of edge cases, rare classes, and underrepresented subgroups. |
| Provenance | A clear, auditable record of where each example came from and how it was labeled. |
Before training, data typically goes through several preprocessing steps: cleaning (removing duplicates, correcting errors), handling missing values (imputation or removal), normalization or standardization of numerical features, encoding categorical variables, and outlier detection. These steps can significantly improve the effectiveness of the training set without adding a single new example. As noted above, all preprocessing must be fit only on the training portion of the data to avoid leakage.
For supervised tasks, labels are usually produced by humans, which introduces both cost and noise. Inter-annotator agreement (often measured with Cohen's kappa or Fleiss' kappa) gives a rough ceiling on the accuracy a model can achieve, since systematic disagreement between humans means the labels themselves are not consistent. Common strategies for managing label noise include using multiple annotators per example with majority voting, training on noisy labels with loss functions robust to label noise, and post-hoc cleaning with confident learning. Tools such as cleanlab automate the detection of likely label errors in standard datasets like ImageNet, where studies have estimated that several percent of labels are incorrect.
A training set is biased when it does not accurately represent the population or conditions the model will encounter in deployment. Bias in training data leads to models that perform well on certain groups or scenarios but poorly on others.
| Bias Type | Description | Example |
|---|---|---|
| Selection bias | The training data is not randomly sampled from the target population | A hiring model trained only on data from one company |
| Sampling bias | Certain groups are over- or underrepresented | A facial recognition model trained mostly on light-skinned faces |
| Temporal bias | The training data reflects conditions from a specific time period that may not hold in the future | A credit scoring model trained on pre-pandemic financial data |
| Measurement bias | Systematic errors in how data was collected or labeled | Inconsistent labeling criteria across different annotators |
| Historical bias | The training data reflects existing societal prejudices | A language model trained on text that contains gender stereotypes |
| Confirmation bias | The data is filtered through a model whose mistakes propagate forward | Active learning queries that miss a region of input space the model never asks about |
Practitioners can reduce training set bias by collecting more diverse and representative data, applying stratified sampling to ensure all subgroups are proportionally included, auditing datasets for demographic and distributional imbalances, using data augmentation to synthetically increase representation of underrepresented groups, and implementing fairness-aware preprocessing techniques. Documentation practices such as Datasheets for Datasets (Gebru et al., 2018) and Data Cards (Pushkarna et al., 2022) encourage explicit reporting of dataset provenance, intended use, and known limitations.
The training sets used for modern large language models (LLMs) are orders of magnitude larger than those used in traditional machine learning. These models consume trillions of tokens from diverse text sources during pretraining.
| Source | Description | Scale |
|---|---|---|
| Common Crawl | Petabytes of raw web data extracted from billions of web pages, updated monthly | Hundreds of billions of tokens per snapshot |
| Wikipedia | Structured encyclopedia articles across hundreds of languages | About 6.8 million English articles, around 4.7 billion words |
| Books | BookCorpus (around 11,000 books), Project Gutenberg (around 70,000 public domain books) | Tens of billions of tokens |
| Code | GitHub repositories, Stack Overflow, Jupyter notebooks | StarCoder dataset: 783 GB across 86 languages |
| Scientific papers | arXiv, PubMed, Semantic Scholar | Billions of tokens of technical text |
| Curated collections | The Pile (825 GiB), RedPajama (1.2 trillion tokens), FineWeb | Purpose-built for LLM training |
| Corpus | Year | Size | Notes |
|---|---|---|---|
| C4 | 2019 | About 156 billion tokens | Filtered Common Crawl, built for T5 |
| The Pile | 2020 | 825 GiB | 22 sub-datasets including ArXiv, PubMed, GitHub |
| RedPajama | 2023 | 1.2 trillion tokens | Open replication of the LLaMA data mix |
| RefinedWeb | 2023 | 5 trillion tokens (600B public) | Web-only, used for Falcon |
| Dolma | 2024 | 3 trillion tokens | Used for OLMo, AI2 release with full transparency |
| FineWeb | 2024 | About 15 trillion tokens | Cleaned and deduplicated Common Crawl, 96 dumps |
| FineWeb-Edu | 2024 | 1.3 trillion tokens | Subset of FineWeb filtered for educational value |
| DCLM-Baseline | 2024 | About 4 trillion tokens | Built from a 240T token DataComp-LM pool with model-based filtering |
GPT-3 was trained on a mixture of approximately 300 billion tokens drawn from Common Crawl (filtered), WebText2, Books1, Books2, and Wikipedia. LLaMA drew from Common Crawl, C4, GitHub, Wikipedia, books, arXiv, and Stack Exchange. These training sets are carefully weighted: higher-quality sources like Wikipedia and books are often sampled multiple times per epoch, while noisier web data is downsampled.
The Dolma paper (Soldaini et al., 2024, arXiv:2402.00159) was notable for releasing both the corpus and the full data pipeline, allowing other researchers to reproduce filtering decisions. FineWeb (Penedo et al., 2024, arXiv:2406.17557) demonstrated that careful filtering of Common Crawl can match or beat curated mixes. The DataComp-LM benchmark (Li et al., 2024, arXiv:2406.11794) treats data curation as an experimental science: participants run controlled comparisons of filtering and mixing strategies on a fixed 240 trillion token pool, with model quality on 53 downstream tasks as the metric.
Research projections by Villalobos and colleagues (2022) suggest that publicly available, high-quality human-generated text could be largely exhausted between 2026 and 2032, which is driving interest in synthetic data generation and more efficient data curation methods.
After pretraining, large language models are usually fine-tuned on instruction-following data. The training set in this stage is much smaller (tens of thousands to a few million examples) but its quality has an outsized effect on the resulting model.
| Dataset | Year | Approximate Size | Notes |
|---|---|---|---|
| FLAN / FLAN v2 | 2021, 2022 | About 1.4 million examples | Mixture of NLP tasks reformulated as instructions |
| Self-Instruct / Alpaca | 2022, 2023 | 52,000 examples | Generated from GPT-3 text-davinci-003 |
| OpenHermes | 2023 | About 1 million examples | Curated mix of instruction sources |
| OpenOrca | 2023 | About 4.1 million examples | FLAN augmented with GPT-4 reasoning traces |
| Tülu 2 / Tülu 3 | 2023, 2024 | Hundreds of thousands of examples | Open recipes for instruction tuning at AI2 |
| No Robots | 2023 | 10,000 examples | Entirely human-written |
Self-Instruct (Wang et al., 2022, arXiv:2212.10560) showed that a strong base model can generate its own instruction data: starting from 175 seed tasks, the authors used GPT-3 to bootstrap 52,000 instructions, with the model fine-tuned on this data improving by 33 percentage points on Super-NaturalInstructions. Stanford Alpaca followed the same recipe with text-davinci-003 and released the resulting instruction set, which became a template for many later projects.
Reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) require a different kind of training set: pairs of model responses with a label indicating which one is preferred. Major preference datasets include:
| Dataset | Year | Size | Notes |
|---|---|---|---|
| Anthropic HH-RLHF | 2022 | About 161,000 comparisons | Human preferences over helpfulness and harmlessness |
| OpenAssistant | 2023 | About 161,000 messages | Human-generated conversations and rankings |
| UltraFeedback | 2023 | About 64,000 prompts, 256,000 completions | Multi-dimensional GPT-4 feedback (instruction following, truthfulness, honesty, helpfulness) |
| Nectar | 2023 | About 183,000 prompts | Used for the Starling models, with GPT-4 ranking responses from many models |
| PKU-SafeRLHF | 2023 | About 30,000 examples | Safety-focused harmlessness preferences |
UltraFeedback was widely adopted as a training set for direct preference optimization because its multi-dimensional scores allow more nuanced reward signals than binary preferences alone.
Synthetic training data is artificially generated data that mimics the statistical properties of real-world data. It is produced using rule-based systems, simulation engines, or generative models such as GANs and diffusion models.
Synthetic data is useful when real data is scarce, expensive, or restricted by privacy rules. Medical imaging datasets can be augmented with synthetic scans to cover rare pathologies, and autonomous driving systems use simulated environments to generate training data for edge cases that are dangerous or impractical to capture on real roads.
A distinct strand of synthetic data uses one model to produce training data for another. The Microsoft Phi series, beginning with the 2023 paper Textbooks Are All You Need (Gunasekar et al., arXiv:2306.11644), trained a 1.3 billion parameter code model on 6 billion tokens of "textbook quality" web data plus 1 billion tokens of synthetic textbooks and exercises generated with GPT-3.5. The resulting model, Phi-1, reached 50.6% pass@1 on HumanEval despite being more than ten times smaller than competitors trained on far more data. The follow-on Phi-1.5 and Phi-2 models extended the synthetic data approach, and similar techniques are now standard in fine-tuning pipelines for instruction following and reasoning.
Knowledge distillation is another form of model-generated training data: a smaller "student" model is trained on the soft outputs of a larger "teacher" model rather than on hard labels, often with much smaller datasets.
If the generative process does not capture the full complexity of real-world data, models trained on synthetic data may underperform in deployment. Validating the quality of synthetic data is itself a non-trivial challenge. Recent research on "model collapse" (Shumailov et al., 2024) has shown that recursively training on outputs of earlier model generations can degrade quality and diversity over time, since rare modes of the original distribution are gradually lost.
The data that goes into a training set is not just a technical asset; it is also a legal and ethical one. Several issues have come to the foreground as training datasets have grown.
The most prominent legal dispute is The New York Times v. OpenAI and Microsoft, filed in December 2023. The Times alleges that millions of its articles were used without license to train OpenAI models, and that those models can produce near-verbatim reproductions of its journalism. OpenAI and Microsoft argue fair use. The case entered discovery in 2024, and in May 2025 a preservation order required OpenAI to retain output logs that might evidence reproduction. As of early 2026, summary judgment on the fair-use question is not expected before the summer of 2026 at the earliest. Dozens of similar suits have been filed by authors, music publishers, and other rights holders, making training-set provenance a significant business and legal concern.
GDPR Article 17 (the right to erasure) gives individuals the right to demand deletion of personal data. Applied to a trained model, this right is technically difficult: model parameters encode something about every example seen during training, but there is no straightforward way to remove a single individual's contribution short of retraining from scratch. The CCPA in California offers similar opt-out rights. "Machine unlearning" is an active research area aimed at approximate erasure: techniques include retraining only affected subsets, certified unlearning with formal guarantees, and influence-function-based corrections. None of these is fully general yet, and most commercial systems handle erasure requests by removing source data and committing to retraining at the next scheduled cycle.
Several data publishers have introduced opt-out mechanisms for AI training. Common Crawl supports the User-Agent: GPTBot and Common Crawl directives in robots.txt; the IETF has discussed an ai.txt extension; and the Hugging Face hub supports per-dataset access controls. Many large web platforms (Reddit, Stack Exchange, X/Twitter) have moved to license access to their archives rather than allowing free scraping. For teams building training sets, the practical implication is that the legal status of a corpus has to be tracked example by example, not just at the dataset level.
Training sets are rarely static. New examples are added, labels are corrected, and earlier subsets are deprecated. Reproducible machine learning therefore requires tracking which exact version of a dataset was used to train each model. Several tools and practices have emerged:
| Tool | Role |
|---|---|
| DVC (Data Version Control) | Git-like versioning of large data and model files, with cloud storage backends |
| Hugging Face Datasets | Hub-hosted datasets with revision SHAs, dataset cards, and streaming access |
| lakeFS | Git-like branching, merging, and rollback for object stores |
| MLflow / Weights & Biases | Experiment tracking that logs dataset hashes alongside model artifacts |
| Pachyderm | Data pipelines with content-addressed storage and lineage |
| Datasheets and Data Cards | Structured documentation of dataset purpose, composition, and limitations |
DVC integrates natively with Hugging Face datasets and supports loading data from the hub via a dvc:// filesystem URL. The combination of Git for code, DVC or lakeFS for data, and a model registry for trained artifacts is a common pattern for reproducible training pipelines.
Imagine you are learning to tell the difference between cats and dogs by looking at pictures. Your parent shows you a big stack of photos, each labeled "cat" or "dog." That stack is the training set. You study the photos and start to notice things: cats have pointy ears, dogs have floppy ears, and so on.
After you finish studying, your parent gives you a new batch of photos you have never seen before and asks, "Is this a cat or a dog?" Those new photos are the test set. The better and more varied your study stack was, the better you will be on the new batch. If all the cats in your study stack were orange tabbies, you might not recognize a black cat. And if your parent accidentally mixed one of the test photos into your study stack, you would ace that one photo only because you had already seen the answer, which is why training and test photos must stay strictly separate.