See also: Machine learning terms
A training set is the portion of a dataset used to fit a machine learning model. During training, the model processes the examples in this set to learn patterns, adjust its internal parameters, and build the mathematical relationships needed to make predictions on new, unseen data. The training set is one of three standard data partitions in supervised machine learning, alongside the validation set and the test set.
The quality, size, and representativeness of the training set have a direct impact on a model's accuracy, generalization ability, and fairness. Poorly constructed training sets can lead to overfitting, bias, and unreliable predictions in production.
In a typical supervised machine learning workflow, a practitioner starts with a full dataset of labeled examples. Each example consists of input features paired with a known output (the label or target). The dataset is then split into non-overlapping subsets:
| Subset | Purpose | When Used |
|---|---|---|
| Training set | Teaches the model by exposing it to labeled examples so it can learn patterns and adjust weights | During model training |
| Validation set | Provides feedback for tuning hyperparameters and detecting overfitting | During model development |
| Test set | Gives an unbiased estimate of final model performance on data it has never seen | After training is complete |
The model iterates over the training set multiple times (each full pass is called an epoch). In each pass, it computes a loss function that measures prediction errors, then uses an optimizer (such as stochastic gradient descent) to update its parameters in the direction that reduces the loss. This cycle repeats until the model converges or a stopping criterion is met.
There is no single "correct" ratio for dividing data. The best split depends on the total dataset size, the complexity of the model, and the number of hyperparameters to tune. The following ratios are widely used in practice:
| Split Scheme | Training | Validation | Test | Best For |
|---|---|---|---|---|
| 80 / 10 / 10 | 80% | 10% | 10% | Large datasets (100k+ samples) |
| 70 / 15 / 15 | 70% | 15% | 15% | Medium datasets (10k–100k samples) |
| 60 / 20 / 20 | 60% | 20% | 20% | Smaller datasets or complex models |
When only a train/test split is needed (without a separate validation set), common ratios are 80/20 or 75/25. In deep learning with very large datasets (millions of examples), practitioners sometimes allocate as much as 98% to training, because even 1% of a massive dataset provides thousands of validation or test examples.
How data is divided matters as much as the ratio itself. The three main strategies are random splitting, stratified splitting, and temporal splitting.
Random splitting shuffles the dataset and assigns samples to each subset based on the chosen ratio. It is the simplest and most common approach, and works well when the data is large and roughly balanced across classes. However, random splitting can produce subsets with skewed class distributions, especially when working with imbalanced datasets.
Stratified splitting forces each subset (training, validation, test) to preserve the same class distribution as the original dataset. For example, if the full dataset contains 90% negative examples and 10% positive examples, each split will maintain that 90/10 ratio. Stratified splitting is essential for imbalanced classification problems, because a purely random split could place very few minority-class examples in the validation or test sets, producing unreliable performance estimates. Scikit-learn provides StratifiedShuffleSplit and StratifiedKFold for this purpose.
For time series data, random shuffling is not appropriate because it would cause data leakage: the model would train on future data and be tested on past data. Temporal splitting uses a chronological cutoff point. All data before the cutoff becomes the training set, and all data after the cutoff forms the validation or test set. Scikit-learn's TimeSeriesSplit implements an expanding-window variant where each fold adds more historical data to the training window while testing on the next time period.
The relationship between training set size and model performance is one of the most studied topics in machine learning. Understanding this relationship helps practitioners decide whether to invest in collecting more data or to focus on improving the model architecture.
A learning curve plots model performance (such as accuracy or loss) on the y-axis against the number of training examples on the x-axis. Two curves are typically drawn together: one for training performance and one for validation performance.
| Observation | What It Means |
|---|---|
| Training score is high, validation score is low | The model is overfitting; it memorizes training examples but fails to generalize |
| Both scores are low | The model is underfitting; it lacks the capacity to learn the patterns |
| Both scores converge to a high value | The model generalizes well; additional data may not help much |
| Validation score keeps climbing as training size grows | More data is likely to improve the model further |
Empirical research shows that learning curves often follow a power law: performance improves rapidly at first, then the rate of improvement slows as data grows. This is consistent with the observation that early training examples contribute the most new information, while later examples increasingly duplicate patterns the model has already seen.
More training data generally helps when the model has high variance (i.e., it overfits on small datasets), the problem is complex with many possible patterns to learn, or the feature space is high-dimensional. Deep neural networks, which can have millions or billions of parameters, are particularly data-hungry and tend to improve steadily with larger training sets.
Models eventually reach a plateau where additional data yields diminishing returns. This happens when the model has already captured the key patterns and further examples are largely redundant. As one researcher noted: "When you've read a million reviews on Yelp, maybe the next reviews don't give you that much." At this point, improving data quality, feature engineering, or switching to a better model architecture may produce larger gains than simply adding more data. Recent evidence from large language model scaling research suggests that high-quality data can outperform raw quantity, with smaller models trained on curated data sometimes matching the performance of larger models trained on noisier corpora.
The principle of "garbage in, garbage out" applies directly to training sets. A model can only learn patterns that exist in its training data, so the quality of that data determines the ceiling on model performance.
| Dimension | Description |
|---|---|
| Accuracy | Labels must be correct. Mislabeled examples teach the model wrong associations. |
| Completeness | Missing values or incomplete records can introduce noise and reduce learning effectiveness. |
| Consistency | Contradictory examples (same input mapped to different outputs) confuse the learning algorithm. |
| Relevance | The training data should reflect the actual conditions and distribution the model will encounter in production. |
| Freshness | Outdated data can lead to models that do not reflect current patterns or trends. |
Before training, data typically goes through several preprocessing steps: cleaning (removing duplicates, correcting errors), handling missing values (imputation or removal), normalization or standardization of numerical features, encoding categorical variables, and outlier detection. These steps can significantly improve the effectiveness of the training set without adding a single new example.
A training set is biased when it does not accurately represent the population or conditions the model will encounter in deployment. Bias in training data leads to models that perform well on certain groups or scenarios but poorly on others.
| Bias Type | Description | Example |
|---|---|---|
| Selection bias | The training data is not randomly sampled from the target population | A hiring model trained only on data from one company |
| Sampling bias | Certain groups are over- or underrepresented | A facial recognition model trained mostly on light-skinned faces |
| Temporal bias | The training data reflects conditions from a specific time period that may not hold in the future | A credit scoring model trained on pre-pandemic financial data |
| Measurement bias | Systematic errors in how data was collected or labeled | Inconsistent labeling criteria across different annotators |
| Historical bias | The training data reflects existing societal prejudices | A language model trained on text that contains gender stereotypes |
Practitioners can reduce training set bias by collecting more diverse and representative data, applying stratified sampling to ensure all subgroups are proportionally included, auditing datasets for demographic and distributional imbalances, using data augmentation to synthetically increase representation of underrepresented groups, and implementing fairness-aware preprocessing techniques.
The training sets used for modern large language models (LLMs) are orders of magnitude larger than those used in traditional machine learning. These models consume trillions of tokens from diverse text sources during pretraining.
| Source | Description | Scale |
|---|---|---|
| Common Crawl | Petabytes of raw web data extracted from billions of web pages, updated monthly | Hundreds of billions of tokens per snapshot |
| Wikipedia | Structured encyclopedia articles across hundreds of languages | ~6.8 million English articles, ~4.7 billion words |
| Books | BookCorpus (~11,000 books), Project Gutenberg (~70,000 public domain books) | Tens of billions of tokens |
| Code | GitHub repositories, Stack Overflow, Jupyter notebooks | StarCoder dataset: 783 GB across 86 languages |
| Scientific papers | arXiv, PubMed, Semantic Scholar | Billions of tokens of technical text |
| Curated collections | The Pile (825 GiB), RedPajama (1.2 trillion tokens), FineWeb | Purpose-built for LLM training |
GPT-3 was trained on a mixture of approximately 300 billion tokens drawn from Common Crawl (filtered), WebText2, Books1, Books2, and Wikipedia. LLaMA drew from Common Crawl, C4, GitHub, Wikipedia, books, arXiv, and Stack Exchange. These training sets are carefully weighted: higher-quality sources like Wikipedia and books are often sampled multiple times per epoch, while noisier web data is downsampled.
Research projections suggest that publicly available, high-quality human-generated text could be largely exhausted by 2026 to 2032, which is driving interest in synthetic data generation and more efficient data curation methods.
Synthetic training data is artificially generated data that mimics the statistical properties of real-world data. It is produced using rule-based systems, simulation engines, or generative models such as GANs and diffusion models.
Synthetic data is particularly useful when real data is scarce, expensive to collect, or restricted by privacy regulations. For example, medical imaging datasets can be augmented with synthetic scans to help models learn rare pathologies that appear infrequently in real patient data. Similarly, autonomous driving systems use simulated environments to generate training data for edge cases that are dangerous or impractical to capture on real roads.
However, synthetic data has limitations. If the generative process does not capture the full complexity of real-world data, models trained on synthetic data may underperform in deployment. Validating the quality of synthetic data is itself a non-trivial challenge.
Data augmentation is a family of techniques that create new training examples by applying transformations to existing ones. Unlike synthetic data, which is generated from scratch, augmentation modifies real samples to increase diversity.
| Domain | Techniques |
|---|---|
| Images | Rotation, flipping, cropping, scaling, color jitter, Gaussian noise, cutout, mixup |
| Text | Synonym replacement, random insertion, random deletion, back-translation, paraphrasing |
| Audio | Time stretching, pitch shifting, noise injection, speed perturbation |
| Tabular data | SMOTE (Synthetic Minority Over-sampling Technique), random noise addition |
Data augmentation can reduce overfitting by making the training set appear larger and more varied. It is especially valuable when the original dataset is small or imbalanced. However, augmentation is bounded by the quality and diversity of the original data. If the original training set contains systematic biases, augmented copies will carry those same biases.
Cross-validation provides a way to use all available data for both training and evaluation. In k-fold cross-validation, the dataset is divided into k equal-sized folds. The model is trained k times, each time using a different fold as the validation set and the remaining k−1 folds as the training set. The final performance estimate is the average across all k runs.
Cross-validation is particularly useful when data is limited, because every example eventually serves in both the training and validation roles. Common values for k are 5 and 10. Stratified k-fold cross-validation combines folding with stratified sampling to maintain class proportions in each fold.
Imagine you are learning to tell the difference between cats and dogs by looking at pictures. Your parent shows you a big stack of photos, each with a label that says "cat" or "dog." That stack of photos is the training set. You study the photos and start to notice things: cats have pointy ears, dogs have floppy ears, and so on.
After you finish studying, your parent gives you a new batch of photos you have never seen before and asks, "Is this a cat or a dog?" Those new photos are the test set. The better and more varied the photos in your study stack were, the better you will be at identifying cats and dogs in the new batch.
If all the cat photos in your study stack showed only orange tabby cats, you might not recognize a black cat when you see one. That is why the training set needs to include lots of different examples.