Training Set

Data & Datasets Machine Learning

34 min read

Updated Jun 20, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 20, 2026

Fact-checked

In review queue

Sources

32 citations

Revision

v7 · 6,768 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

See also: Machine learning terms

A training set is the portion of a dataset that a machine learning model learns from: the labeled examples a model processes during training to adjust its internal parameters and build the mathematical relationships it uses to make predictions on new, unseen data ^[1]. In Tom Mitchell's classic formulation, a program "learn[s] from experience E with respect to some task T and some performance measure P" ^[31], and the training set is that experience E. It is one of three standard data partitions in supervised machine learning, alongside the validation set (used to tune the model) and the test set (used to score it once at the end).

The quality, size, and representativeness of the training set have a direct impact on a model's accuracy, generalization ability, and fairness. Poorly constructed training sets can lead to overfitting, bias, and unreliable predictions in production ^[3]. The phrase "garbage in, garbage out" is a useful summary: a model can only learn patterns that exist in its training data, so practitioners spend a significant fraction of any project on collecting, cleaning, labeling, and validating the data that goes into training rather than on the model architecture itself. Andrew Ng, who popularized this emphasis, defines the broader practice as "the discipline of systematically engineering the data needed to successfully build an AI system" ^[32].

How do training sets work?

In a typical supervised machine learning workflow, a practitioner starts with a full dataset of labeled examples. Each example consists of input features paired with a known output (the label or target). The dataset is then split into non-overlapping subsets:

Subset	Purpose	When Used
Training set	Teaches the model by exposing it to labeled examples so it can learn patterns and adjust weights	During model training
Validation set	Provides feedback for tuning hyperparameters and detecting overfitting	During model development
Test set	Gives an unbiased estimate of final model performance on data it has never seen	After training is complete

The model iterates over the training set multiple times (each full pass is called an epoch). In each pass, it computes a loss function that measures prediction errors, then uses an optimizer (such as stochastic gradient descent) to update its parameters in the direction that reduces the loss. This cycle repeats until the model converges or a stopping criterion is met.

For most supervised tasks the training set is fixed once the project starts. In reinforcement learning the equivalent is a stream of episodes generated by interaction with an environment, and in self-supervised learning the labels are derived from the data itself (such as predicting the next token in a sequence). The discussion below focuses on supervised settings, though most of the same principles apply once the training distribution is fixed.

How are training, validation, and test splits chosen?

What are common split ratios?

There is no single "correct" ratio for dividing data. The best split depends on the total dataset size, the complexity of the model, and the number of hyperparameters to tune. The following ratios are widely used in practice:

Split Scheme	Training	Validation	Test	Best For
80 / 10 / 10	80%	10%	10%	Large datasets (100k+ samples)
70 / 15 / 15	70%	15%	15%	Medium datasets (10k to 100k samples)
60 / 20 / 20	60%	20%	20%	Smaller datasets or complex models
98 / 1 / 1	98%	1%	1%	Very large datasets (millions of examples)

When only a train/test split is needed (without a separate validation set), common ratios are 80/20 or 75/25. In deep learning with very large datasets, practitioners sometimes allocate as much as 98% to training, because even 1% of a massive dataset provides thousands of validation or test examples. Joseph (2022) studied the question of optimal ratios formally and showed that for ordinary least-squares regression the test fraction should scale roughly with the square root of the number of model parameters divided by the dataset size, which often justifies test fractions much smaller than the rule-of-thumb 20% ^[4].

Why three splits and not two?

The purpose of the validation set is to provide a feedback signal for choices made by the human or by an automated search procedure: model architecture, learning rate, regularization strength, the number of training epochs, and so on. Every time a hyperparameter is changed because of validation performance, the validation set leaks a small amount of information into the model. After many iterations the validation score becomes optimistically biased, in much the same way that running thousands of significance tests will eventually find a spurious result. The test set is held back and ideally consulted only once at the end, so its score is a faithful estimate of generalization to new data ^[2]. When test performance is repeatedly checked during development, practitioners say the test set has been "burned" and a new held-out set must be collected.

What are the main splitting strategies?

How data is divided matters as much as the ratio itself. The three main strategies are random splitting, stratified splitting, and temporal splitting. The right choice depends on the structure of the data and the prediction task.

random splitting

Random splitting shuffles the dataset and assigns samples to each subset based on the chosen ratio. It is the simplest and most common approach, and works well when the data is large and roughly balanced across classes. However, random splitting can produce subsets with skewed class distributions, especially when working with imbalanced datasets, and it ignores any group structure that may be present.

stratified splitting

Stratified splitting forces each subset (training, validation, test) to preserve the same class distribution as the original dataset. For example, if the full dataset contains 90% negative examples and 10% positive examples, each split will maintain that 90/10 ratio. Stratified splitting is essential for imbalanced classification problems, because a purely random split could place very few minority-class examples in the validation or test sets, producing unreliable performance estimates. Scikit-learn provides StratifiedShuffleSplit and StratifiedKFold for this purpose. Stratification can also be done on continuous targets by binning the values first, and on multi-label data via iterative stratification.

temporal splitting for time series

For time series data, random shuffling is not appropriate because it would cause data leakage: the model would train on future data and be tested on past data. Temporal splitting uses a chronological cutoff point. All data before the cutoff becomes the training set, and all data after the cutoff forms the validation or test set. Scikit-learn's TimeSeriesSplit implements an expanding-window variant where each fold adds more historical data to the training window while testing on the next time period. A walk-forward variant uses a sliding window of fixed length, simulating a model that is retrained periodically and used only on the immediately following period.

group-aware splitting

Many datasets contain natural groups: multiple measurements per patient in a medical dataset, multiple sentences per document in NLP, multiple frames per video in computer vision. If a single group is split across training and test sets, the model can effectively memorize group-specific signals during training and recognize them in the test set, producing optimistic but misleading scores. The fix is group-aware splitting, where every example from a given group is assigned to exactly one subset. Scikit-learn provides GroupKFold and GroupShuffleSplit for this purpose, and GroupTimeSeriesSplit (in the mlxtend library) combines group-aware splitting with chronological ordering.

What is data leakage and how do you prevent it?

Data leakage occurs when information from outside the training set sneaks into the model during training, producing test scores that overestimate real-world performance. Leakage is one of the most common and damaging errors in applied machine learning, and it can be subtle enough that even experienced practitioners miss it.

common forms of leakage

Leakage Type	Description	Example
Train/test contamination	Examples from the test set appear in the training set, often through duplicates or near-duplicates	Web-scraped data containing reposts and mirrors of the same article
Target leakage	A feature carries information about the target that would not be available at prediction time	Including "customer churned date" as a feature when predicting churn
Preprocessing leakage	Statistics computed on the full dataset are used to transform training data	Standardizing features using the mean of the entire dataset before splitting
Group leakage	Different examples from the same group appear in both training and test	One patient's MRI scan in training, another in test, with identifying signal in both
Temporal leakage	The training set contains examples from after the test period	Random splits of time-stamped data
Label leakage	The label is encoded indirectly in another feature	A free-text note that says "approved" when predicting loan approval

preventing leakage

The core principle is to split first, then process. Any imputation of missing values, scaling, encoding, feature selection, or oversampling must be fit on the training set only and then applied to the validation and test sets ^[29]. Scikit-learn Pipeline objects automate this for many transformations. Other safeguards include deduplicating before splitting (with hash-based or near-duplicate detection), checking that timestamps and group identifiers respect the split boundaries, and inspecting any feature whose validation score looks suspiciously good.

How do you handle class imbalance in a training set?

Many real classification problems involve a dominant majority class and one or more minority classes. Fraud detection, rare-disease diagnosis, defect detection, and click prediction all share this pattern. A model that always predicts the majority class can achieve high accuracy while being useless for the actual task. Several techniques address this in the training set itself.

resampling techniques

Technique	What It Does	Trade-off
Random oversampling	Duplicates random minority-class examples	Simple, but can overfit to the duplicated samples
Random undersampling	Discards random majority-class examples	Reduces overfitting risk, but loses information
SMOTE	Generates synthetic minority examples by interpolating between nearest neighbors	Creates new examples, but can blur class boundaries
Borderline-SMOTE	Synthesizes only near the decision boundary	Focuses learning where it matters, more sensitive to noise
ADASYN	Adaptive synthetic sampling that emphasizes harder minority examples	Adapts to local difficulty, but amplifies outliers
Tomek links / ENN	Cleans noisy and overlapping examples after over- or under-sampling	Improves boundary clarity, but is computationally heavier

SMOTE (Synthetic Minority Over-sampling Technique) was introduced by Chawla, Bowyer, Hall, and Kegelmeyer in 2002 ^[5]. The algorithm picks a minority-class example, finds its k nearest neighbors in the same class (typically k=5), and creates a synthetic example along the line segment to a randomly chosen neighbor. The original paper, published in the Journal of Artificial Intelligence Research, showed that combining SMOTE with undersampling of the majority class produced higher area under the ROC curve than undersampling alone ^[5]. SMOTE remains a baseline for tabular imbalanced classification today, with implementations in the imbalanced-learn library.

Cost-sensitive learning is an alternative to resampling: instead of changing the data, the loss function is reweighted so that errors on minority examples cost more. Many classifiers (logistic regression, gradient boosting, neural networks) accept class weights or sample weights for this purpose. Resampling and cost weighting can also be combined.

How does data augmentation expand a training set?

Data augmentation is a family of techniques that create new training examples by applying transformations to existing ones. Unlike synthetic data, which is generated from scratch, augmentation modifies real samples to increase diversity and to encode invariances the model should respect (such as the fact that a rotated cat is still a cat) ^[10].

common augmentation techniques

Domain	Techniques
Images	Horizontal/vertical flip, rotation, random crop, scaling, color jitter, Gaussian noise, cutout, random erasing, mixup, CutMix, AutoAugment, RandAugment
Text	Synonym replacement, random insertion, random deletion, random swap (EDA), back-translation, paraphrase generation, token masking, prompt-based augmentation with LLMs
Audio	Time stretching, pitch shifting, noise injection, speed perturbation, SpecAugment (time and frequency masking on spectrograms), reverberation
Tabular	SMOTE and variants, Gaussian noise on numeric columns, swap-noise, permutation of independent columns
Graph	Edge dropping, node dropping, attribute masking, subgraph sampling

Mixup, introduced by Zhang and colleagues in 2017, trains the model on linear combinations of pairs of inputs and their labels, encouraging the model to behave linearly between training examples ^[7]. CutMix, proposed by Yun and colleagues in 2019, replaces a patch of one image with a patch from another and mixes the labels in proportion to the patch area ^[8]. Both techniques act as strong regularizers and have become standard in image classification recipes. SpecAugment plays an analogous role in speech recognition, masking time and frequency bands directly on the input spectrogram.

For text, Easy Data Augmentation (EDA) by Wei and Zou (2019) combines four simple operations (synonym replacement, random insertion, random swap, random deletion) and reports consistent gains on small classification datasets ^[9]. Back-translation, where text is translated into another language and back to the original, produces fluent paraphrases and is particularly useful for low-resource languages and informal text.

Data augmentation is bounded by the quality and diversity of the original data. If the original training set contains systematic biases, augmented copies will carry those same biases, and aggressive augmentation can introduce label noise when transformations break the assumption of label invariance.

How does cross-validation differ from a fixed split?

Cross-validation provides a way to use all available data for both training and evaluation, which is valuable when the dataset is too small to afford a large held-out set ^[2]. In k-fold cross-validation, the dataset is divided into k equal-sized folds. The model is trained k times, each time using a different fold as the validation set and the remaining k minus 1 folds as the training set. The final performance estimate is the average across all k runs.

cross-validation variants

Variant	How It Splits	When to Use
K-fold	k disjoint folds; common k is 5 or 10	Standard choice for moderate datasets
Stratified K-fold	Preserves class proportions in each fold	Imbalanced classification
Leave-one-out (LOO)	One example per fold, n folds total	Very small datasets; computationally expensive
Leave-p-out	All possible p-sample test sets	Theoretical analyses; rarely used at scale
Repeated K-fold	K-fold repeated multiple times with different random seeds	Noisy estimates that need tighter confidence intervals
Group K-fold	Each group goes entirely into one fold	Patient data, document data, video frames
Time-series split	Expanding or sliding windows in chronological order	Forecasting; any time-stamped data
Nested cross-validation	Inner loop tunes hyperparameters, outer loop estimates generalization	Honest performance estimates when many hyperparameters are tuned

The scikit-learn user guide recommends 5-fold or 10-fold cross-validation as a default, noting that leave-one-out tends to have high variance because the training sets across folds are nearly identical ^[28]. Nested cross-validation is the most rigorous option for small datasets where both model selection and final evaluation must come from the same pool, at the cost of training the model on the order of k times m additional times.

How does training set size affect performance?

The relationship between training set size and model performance is one of the most studied topics in machine learning. Understanding this relationship helps practitioners decide whether to invest in collecting more data or to focus on improving the model architecture.

learning curves

A learning curve plots model performance (such as accuracy or loss) on the y-axis against the number of training examples on the x-axis. Two curves are typically drawn together: one for training performance and one for validation performance.

Observation	What It Means
Training score is high, validation score is low	The model is overfitting; it memorizes training examples but fails to generalize
Both scores are low	The model is underfitting; it lacks the capacity to learn the patterns
Both scores converge to a high value	The model generalizes well; additional data may not help much
Validation score keeps climbing as training size grows	More data is likely to improve the model further

Empirical research shows that learning curves often follow a power law: performance improves rapidly at first, then the rate of improvement slows as data grows. Perrone and colleagues (2021) reviewed the shape of learning curves across many tasks and found that this power-law form is robust, with the exponent depending on the difficulty of the task and the capacity of the model ^[11]. For neural language models, Kaplan and colleagues (2020) and Hoffmann and colleagues (2022, the Chinchilla paper) showed that loss decreases as a power law in both data size and model size, with an optimal ratio between the two for a fixed compute budget ^[12] ^[13]. The Chinchilla authors found that model size and training-set size should grow in lockstep: "for every doubling of model size the number of training tokens should also be doubled" ^[13]. Their compute-optimal 70B-parameter model was trained on roughly 1.4 trillion tokens, about 20 tokens per parameter, and outperformed the 280B-parameter Gopher despite being four times smaller ^[13].

when more data helps

More training data generally helps when the model has high variance (i.e., it overfits on small datasets), the problem is complex with many possible patterns to learn, or the feature space is high-dimensional. Deep neural networks, which can have millions or billions of parameters, are particularly data-hungry and tend to improve steadily with larger training sets.

when more data stops helping

Models eventually reach a plateau where additional data yields diminishing returns. This happens when the model has already captured the key patterns and further examples are largely redundant. At this point, improving data quality, feature engineering, or switching to a better model architecture may produce larger gains than simply adding more data. Recent evidence from large language model scaling research suggests that high-quality data can outperform raw quantity, with smaller models trained on curated data sometimes matching the performance of larger models trained on noisier corpora.

Does the order of training examples matter? (curriculum learning)

Classical training treats the examples in the training set as exchangeable: the model sees them in a random order, often reshuffled each epoch. Curriculum learning, proposed by Yoshua Bengio and colleagues in 2009, asks whether ordering matters. Their hypothesis was that humans and animals learn better when examples are presented in increasing order of difficulty, and that the same may hold for machine learning models. The 2009 ICML paper showed that on shape recognition and language modeling tasks, starting with easier examples and gradually adding harder ones improved both convergence speed and the quality of the local minimum found ^[6].

Curriculum learning has since been applied throughout machine learning. Self-paced learning lets the model itself decide which examples are easy enough to include at each stage. Reverse curricula start with hard examples and add easier ones for fine-tuning. In large language model training, curricula are used to schedule data mixtures over the course of pretraining, often starting with short, clean sequences and adding longer or noisier ones later. Variants such as anti-curriculum and uniform sampling sometimes outperform straightforward easy-to-hard schedules, and the effect size depends heavily on how difficulty is measured.

How do active and semi-supervised learning shape the training set?

When labeled examples are expensive but unlabeled examples are cheap, the structure of the training set itself becomes a design decision. Two related families of methods address this regime.

Active learning iteratively grows the training set by querying a human oracle for labels on the examples the model is most uncertain about. The model is trained, scores all unlabeled examples by uncertainty (or by expected information gain, query-by-committee disagreement, or another acquisition function), and the top examples are sent for labeling. Studies in semantic segmentation and single-cell biology have shown that active learning can roughly halve the amount of labeled data required to reach a target accuracy, compared to random sampling.

Semi-supervised learning trains on a mix of labeled and unlabeled examples, typically by generating pseudo-labels for the unlabeled portion or by enforcing consistency under augmentation. Methods like FixMatch combine both ideas: a strong-augmentation prediction is required to match a confident pseudo-label from a weak augmentation. Self-supervised pretraining followed by supervised fine-tuning can be viewed as an extreme form of this strategy, where the unlabeled stage produces a strong representation that the labeled stage refines.

What makes a high-quality training set?

The principle of "garbage in, garbage out" applies directly to training sets. A model can only learn patterns that exist in its training data, so the quality of that data determines the ceiling on model performance. This is the core insight behind the data-centric AI movement: as Andrew Ng put it, the field's "dominant paradigm over the last decade was to download the data set while you focus on improving the code," and the proposed shift is toward "systematically engineering the data needed to successfully build an AI system" ^[32].

key quality dimensions

Dimension	Description
Accuracy	Labels must be correct. Mislabeled examples teach the model wrong associations.
Completeness	Missing values or incomplete records can introduce noise and reduce learning effectiveness.
Consistency	Contradictory examples (same input mapped to different outputs) confuse the learning algorithm.
Relevance	The training data should reflect the actual conditions and distribution the model will encounter in production.
Freshness	Outdated data can lead to models that do not reflect current patterns or trends.
Diversity	Coverage of edge cases, rare classes, and underrepresented subgroups.
Provenance	A clear, auditable record of where each example came from and how it was labeled.

data preprocessing

Before training, data typically goes through several preprocessing steps: cleaning (removing duplicates, correcting errors), handling missing values (imputation or removal), normalization or standardization of numerical features, encoding categorical variables, and outlier detection. These steps can significantly improve the effectiveness of the training set without adding a single new example. As noted above, all preprocessing must be fit only on the training portion of the data to avoid leakage.

labeling and label noise

For supervised tasks, labels are usually produced by humans, which introduces both cost and noise. Inter-annotator agreement (often measured with Cohen's kappa or Fleiss' kappa) gives a rough ceiling on the accuracy a model can achieve, since systematic disagreement between humans means the labels themselves are not consistent. Common strategies for managing label noise include using multiple annotators per example with majority voting, training on noisy labels with loss functions robust to label noise, and post-hoc cleaning with confident learning. Tools such as cleanlab automate the detection of likely label errors in standard datasets like ImageNet, where studies have estimated that several percent of labels are incorrect.

How does training set bias arise, and how is it mitigated?

A training set is biased when it does not accurately represent the population or conditions the model will encounter in deployment. Bias in training data leads to models that perform well on certain groups or scenarios but poorly on others ^[24].

common types of bias

Bias Type	Description	Example
Selection bias	The training data is not randomly sampled from the target population	A hiring model trained only on data from one company
Sampling bias	Certain groups are over- or underrepresented	A facial recognition model trained mostly on light-skinned faces
Temporal bias	The training data reflects conditions from a specific time period that may not hold in the future	A credit scoring model trained on pre-pandemic financial data
Measurement bias	Systematic errors in how data was collected or labeled	Inconsistent labeling criteria across different annotators
Historical bias	The training data reflects existing societal prejudices	A language model trained on text that contains gender stereotypes
Confirmation bias	The data is filtered through a model whose mistakes propagate forward	Active learning queries that miss a region of input space the model never asks about

mitigating bias

Practitioners can reduce training set bias by collecting more diverse and representative data, applying stratified sampling to ensure all subgroups are proportionally included, auditing datasets for demographic and distributional imbalances, using data augmentation to synthetically increase representation of underrepresented groups, and implementing fairness-aware preprocessing techniques. Documentation practices such as Datasheets for Datasets (Gebru et al., 2018) and Data Cards (Pushkarna et al., 2022) encourage explicit reporting of dataset provenance, intended use, and known limitations ^[25].

What training data do large language models use?

The training sets used for modern large language models (LLMs) are orders of magnitude larger than those used in traditional machine learning. These models consume trillions of tokens from diverse text sources during pretraining.

primary data sources

Source	Description	Scale
Common Crawl	Petabytes of raw web data extracted from billions of web pages, updated monthly	Hundreds of billions of tokens per snapshot
Wikipedia	Structured encyclopedia articles across hundreds of languages	About 6.8 million English articles, around 4.7 billion words
Books	BookCorpus (around 11,000 books), Project Gutenberg (around 70,000 public domain books)	Tens of billions of tokens
Code	GitHub repositories, Stack Overflow, Jupyter notebooks	StarCoder dataset: 783 GB across 86 languages
Scientific papers	arXiv, PubMed, Semantic Scholar	Billions of tokens of technical text
Curated collections	The Pile (825 GiB), RedPajama (1.2 trillion tokens), FineWeb	Purpose-built for LLM training

major open pretraining corpora

Corpus	Year	Size	Notes
C4	2019	About 156 billion tokens	Filtered Common Crawl, built for T5
The Pile	2020	825 GiB	22 sub-datasets including ArXiv, PubMed, GitHub
RedPajama	2023	1.2 trillion tokens	Open replication of the LLaMA data mix
RefinedWeb	2023	5 trillion tokens (600B public)	Web-only, used for Falcon
Dolma	2024	3 trillion tokens	Used for OLMo, AI2 release with full transparency
FineWeb	2024	About 15 trillion tokens	Cleaned and deduplicated Common Crawl, 96 dumps
FineWeb-Edu	2024	1.3 trillion tokens	Subset of FineWeb filtered for educational value
DCLM-Baseline	2024	About 4 trillion tokens	Built from a 240T token DataComp-LM pool with model-based filtering

GPT-3 was trained on a mixture of approximately 300 billion tokens drawn from Common Crawl (filtered), WebText2, Books1, Books2, and Wikipedia ^[14]. LLaMA drew from Common Crawl, C4, GitHub, Wikipedia, books, arXiv, and Stack Exchange. These training sets are carefully weighted: higher-quality sources like Wikipedia and books are often sampled multiple times per epoch, while noisier web data is downsampled.

The Dolma paper (Soldaini et al., 2024, arXiv:2402.00159) was notable for releasing both the corpus and the full data pipeline, allowing other researchers to reproduce filtering decisions ^[17]. FineWeb (Penedo et al., 2024, arXiv:2406.17557) demonstrated that careful filtering of Common Crawl can match or beat curated mixes ^[18]. The DataComp-LM benchmark (Li et al., 2024, arXiv:2406.11794) treats data curation as an experimental science: participants run controlled comparisons of filtering and mixing strategies on a fixed 240 trillion token pool, with model quality on 53 downstream tasks as the metric ^[19].

Research projections by Villalobos and colleagues (2022) suggest that publicly available, high-quality human-generated text could be largely exhausted between 2026 and 2032, which is driving interest in synthetic data generation and more efficient data curation methods ^[26].

instruction tuning datasets

After pretraining, large language models are usually fine-tuned on instruction-following data. The training set in this stage is much smaller (tens of thousands to a few million examples) but its quality has an outsized effect on the resulting model.

Dataset	Year	Approximate Size	Notes
FLAN / FLAN v2	2021, 2022	About 1.4 million examples	Mixture of NLP tasks reformulated as instructions
Self-Instruct / Alpaca	2022, 2023	52,000 examples	Generated from GPT-3 text-davinci-003
OpenHermes	2023	About 1 million examples	Curated mix of instruction sources
OpenOrca	2023	About 4.1 million examples	FLAN augmented with GPT-4 reasoning traces
Tülu 2 / Tülu 3	2023, 2024	Hundreds of thousands of examples	Open recipes for instruction tuning at AI2
No Robots	2023	10,000 examples	Entirely human-written

Self-Instruct (Wang et al., 2022, arXiv:2212.10560) showed that a strong base model can generate its own instruction data: starting from 175 seed tasks, the authors used GPT-3 to bootstrap 52,000 instructions, with the model fine-tuned on this data improving by 33 percentage points on Super-NaturalInstructions ^[20]. Stanford Alpaca followed the same recipe with text-davinci-003 and released the resulting instruction set, which became a template for many later projects.

preference datasets for alignment

Reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) require a different kind of training set: pairs of model responses with a label indicating which one is preferred ^[23]. Major preference datasets include:

Dataset	Year	Size	Notes
Anthropic HH-RLHF	2022	About 161,000 comparisons	Human preferences over helpfulness and harmlessness
OpenAssistant	2023	About 161,000 messages	Human-generated conversations and rankings
UltraFeedback	2023	About 64,000 prompts, 256,000 completions	Multi-dimensional GPT-4 feedback (instruction following, truthfulness, honesty, helpfulness)
Nectar	2023	About 183,000 prompts	Used for the Starling models, with GPT-4 ranking responses from many models
PKU-SafeRLHF	2023	About 30,000 examples	Safety-focused harmlessness preferences

UltraFeedback was widely adopted as a training set for direct preference optimization because its multi-dimensional scores allow more nuanced reward signals than binary preferences alone ^[22].

What is synthetic training data?

Synthetic training data is artificially generated data that mimics the statistical properties of real-world data. It is produced using rule-based systems, simulation engines, or generative models such as GANs and diffusion models.

Synthetic data is useful when real data is scarce, expensive, or restricted by privacy rules. Medical imaging datasets can be augmented with synthetic scans to cover rare pathologies, and autonomous driving systems use simulated environments to generate training data for edge cases that are dangerous or impractical to capture on real roads.

model-generated training data

A distinct strand of synthetic data uses one model to produce training data for another. The Microsoft Phi series, beginning with the 2023 paper Textbooks Are All You Need (Gunasekar et al., arXiv:2306.11644), trained a 1.3 billion parameter code model on 6 billion tokens of "textbook quality" web data plus 1 billion tokens of synthetic textbooks and exercises generated with GPT-3.5. The resulting model, Phi-1, reached 50.6% pass@1 on HumanEval despite being more than ten times smaller than competitors trained on far more data ^[21]. The follow-on Phi-1.5 and Phi-2 models extended the synthetic data approach, and similar techniques are now standard in fine-tuning pipelines for instruction following and reasoning.

Knowledge distillation is another form of model-generated training data: a smaller "student" model is trained on the soft outputs of a larger "teacher" model rather than on hard labels, often with much smaller datasets.

limitations of synthetic data

If the generative process does not capture the full complexity of real-world data, models trained on synthetic data may underperform in deployment. Validating the quality of synthetic data is itself a non-trivial challenge. Recent research on "model collapse" (Shumailov et al., 2024) has shown that recursively training on outputs of earlier model generations can degrade quality and diversity over time, since rare modes of the original distribution are gradually lost ^[27].

What are the legal and ethical issues with training data?

The data that goes into a training set is not just a technical asset; it is also a legal and ethical one. Several issues have come to the foreground as training datasets have grown.

copyright and fair use

The most prominent legal dispute is The New York Times v. OpenAI and Microsoft, filed in December 2023 ^[30]. The Times alleges that millions of its articles were used without license to train OpenAI models, and that those models can produce near-verbatim reproductions of its journalism. OpenAI and Microsoft argue fair use. The case entered discovery in 2024, and in May 2025 a preservation order required OpenAI to retain output logs that might evidence reproduction. As of early 2026, summary judgment on the fair-use question is not expected before the summer of 2026 at the earliest. Dozens of similar suits have been filed by authors, music publishers, and other rights holders, making training-set provenance a significant business and legal concern.

privacy and the right to erasure

GDPR Article 17 (the right to erasure) gives individuals the right to demand deletion of personal data. Applied to a trained model, this right is technically difficult: model parameters encode something about every example seen during training, but there is no straightforward way to remove a single individual's contribution short of retraining from scratch. The CCPA in California offers similar opt-out rights. "Machine unlearning" is an active research area aimed at approximate erasure: techniques include retraining only affected subsets, certified unlearning with formal guarantees, and influence-function-based corrections. None of these is fully general yet, and most commercial systems handle erasure requests by removing source data and committing to retraining at the next scheduled cycle.

opt-out and data governance

Several data publishers have introduced opt-out mechanisms for AI training. Common Crawl supports the User-Agent: GPTBot and Common Crawl directives in robots.txt; the IETF has discussed an ai.txt extension; and the Hugging Face hub supports per-dataset access controls. Many large web platforms (Reddit, Stack Exchange, X/Twitter) have moved to license access to their archives rather than allowing free scraping. For teams building training sets, the practical implication is that the legal status of a corpus has to be tracked example by example, not just at the dataset level.

data versioning and tooling

Training sets are rarely static. New examples are added, labels are corrected, and earlier subsets are deprecated. Reproducible machine learning therefore requires tracking which exact version of a dataset was used to train each model. Several tools and practices have emerged:

Tool	Role
DVC (Data Version Control)	Git-like versioning of large data and model files, with cloud storage backends
Hugging Face Datasets	Hub-hosted datasets with revision SHAs, dataset cards, and streaming access
lakeFS	Git-like branching, merging, and rollback for object stores
MLflow / Weights & Biases	Experiment tracking that logs dataset hashes alongside model artifacts
Pachyderm	Data pipelines with content-addressed storage and lineage
Datasheets and Data Cards	Structured documentation of dataset purpose, composition, and limitations

DVC integrates natively with Hugging Face datasets and supports loading data from the hub via a dvc:// filesystem URL. The combination of Git for code, DVC or lakeFS for data, and a model registry for trained artifacts is a common pattern for reproducible training pipelines.

explain like i'm 5 (eli5)

Imagine you are learning to tell the difference between cats and dogs by looking at pictures. Your parent shows you a big stack of photos, each labeled "cat" or "dog." That stack is the training set. You study the photos and start to notice things: cats have pointy ears, dogs have floppy ears, and so on.

After you finish studying, your parent gives you a new batch of photos you have never seen before and asks, "Is this a cat or a dog?" Those new photos are the test set. The better and more varied your study stack was, the better you will be on the new batch. If all the cats in your study stack were orange tabbies, you might not recognize a black cat. And if your parent accidentally mixed one of the test photos into your study stack, you would ace that one photo only because you had already seen the answer, which is why training and test photos must stay strictly separate.

references

Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press. Chapter 5: Machine Learning Basics. https://www.deeplearningbook.org/ ↩
Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning*, 2nd ed. Springer. Chapter 7: Model Assessment and Selection. ↩
Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Springer. ↩
Joseph, V. R. (2022). "Optimal ratio for data splitting." *Statistical Analysis and Data Mining*, 15(4), 531 to 538. https://doi.org/10.1002/sam.11583 ↩
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). "SMOTE: Synthetic Minority Over-sampling Technique." *Journal of Artificial Intelligence Research*, 16, 321 to 357. https://www.jair.org/index.php/jair/article/view/10302 ↩
Bengio, Y., Louradour, J., Collobert, R., & Weston, J. (2009). "Curriculum learning." *Proceedings of the 26th International Conference on Machine Learning*. https://dl.acm.org/doi/10.1145/1553374.1553380 ↩
Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2018). "mixup: Beyond Empirical Risk Minimization." *International Conference on Learning Representations*. ↩
Yun, S. et al. (2019). "CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features." *ICCV 2019*. ↩
Wei, J., & Zou, K. (2019). "EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks." *EMNLP-IJCNLP 2019*. ↩
Shorten, C., & Khoshgoftaar, T. M. (2019). "A survey on Image Data Augmentation for Deep Learning." *Journal of Big Data*, 6, 60. ↩
Perrone, V. et al. (2021). "The Shape of Learning Curves: A Review." *arXiv:2103.10948*. ↩
Kaplan, J. et al. (2020). "Scaling Laws for Neural Language Models." *arXiv:2001.08361*. ↩
Hoffmann, J. et al. (2022). "Training Compute-Optimal Large Language Models." *arXiv:2203.15556*. (the Chinchilla paper) ↩
Brown, T. et al. (2020). "Language Models are Few-Shot Learners." *Advances in Neural Information Processing Systems*, 33, 1877 to 1901. ↩
Gao, L. et al. (2020). "The Pile: An 800GB Dataset of Diverse Text for Language Modeling." *arXiv:2101.00027*.
Penedo, G. et al. (2023). "The RefinedWeb Dataset for Falcon LLM." *arXiv:2306.01116*.
Soldaini, L. et al. (2024). "Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research." *arXiv:2402.00159*. ↩
Penedo, G. et al. (2024). "The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale." *arXiv:2406.17557*. ↩
Li, J. et al. (2024). "DataComp-LM: In search of the next generation of training sets for language models." *arXiv:2406.11794*. ↩
Wang, Y. et al. (2022). "Self-Instruct: Aligning Language Models with Self-Generated Instructions." *arXiv:2212.10560*. ↩
Gunasekar, S. et al. (2023). "Textbooks Are All You Need." *arXiv:2306.11644*. ↩
Cui, G. et al. (2023). "UltraFeedback: Boosting Language Models with High-quality Feedback." *arXiv:2310.01377*. ↩
Bai, Y. et al. (2022). "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback." *arXiv:2204.05862*. (HH-RLHF) ↩
Suresh, H., & Guttag, J. (2021). "A Framework for Understanding Sources of Harm throughout the Machine Learning Life Cycle." *Equity and Access in Algorithms, Mechanisms, and Optimization*. ↩
Gebru, T. et al. (2018). "Datasheets for Datasets." *arXiv:1803.09010*. ↩
Villalobos, P. et al. (2022). "Will we run out of data? Limits of LLM scaling based on human-generated data." *arXiv:2211.04325*. ↩
Shumailov, I. et al. (2024). "AI models collapse when trained on recursively generated data." *Nature*, 631, 755 to 759. ↩
Scikit-learn developers (2026). "3.1. Cross-validation: evaluating estimator performance." *scikit-learn 1.8 documentation*. https://scikit-learn.org/stable/modules/cross_validation.html ↩
Scikit-learn developers (2026). "11. Common pitfalls and recommended practices." *scikit-learn 1.8 documentation*. https://scikit-learn.org/stable/common_pitfalls.html ↩
The New York Times Company v. Microsoft Corp. and OpenAI, Inc., No. 1:23-cv-11195 (S.D.N.Y. filed Dec. 27, 2023). ↩
Mitchell, T. M. (1997). *Machine Learning*. McGraw-Hill, p. 2. ↩
Strickland, E. (2022). "Andrew Ng: Unbiggen AI." *IEEE Spectrum*, February 9, 2022. https://spectrum.ieee.org/andrew-ng-data-centric-ai ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

6 revisions by 1 contributors · full history

Suggest edit