See also: Machine learning, Training set, Validation set, Test set
A dataset (also written as "data set") is a structured collection of data organized for analysis, processing, or machine learning tasks. In the context of artificial intelligence and machine learning, a dataset refers specifically to the body of information used to train, validate, and evaluate models. Datasets typically consist of individual data points (also called samples, instances, or examples), each described by a set of features (input variables) and, in supervised learning, corresponding labels (output variables or target values).
Datasets are the foundation of every machine learning pipeline. The quality, size, diversity, and representativeness of a dataset directly influence the performance and reliability of the models built from it. As the field has matured, the creation, curation, documentation, and governance of datasets have become research areas in their own right.
The term "dataset" is now standard in most technical writing, though "data set" (two words) still appears in some formal and statistical contexts. Both forms are considered acceptable.
Datasets can be categorized by their structure, modality, and labeling status. The following table summarizes the most common types encountered in machine learning.
| Type | Description | Common Formats | Example Tasks |
|---|---|---|---|
| Tabular | Data organized in rows and columns, where each row is a sample and each column is a feature | CSV, Parquet, SQL databases | Classification, regression, fraud detection |
| Image | Collections of photographs or generated images, typically stored as pixel arrays | JPEG, PNG, TIFF | Image recognition, object detection, image segmentation |
| Text | Documents, sentences, or token sequences used for language tasks | Plain text, JSON, XML | Sentiment analysis, machine translation, question answering |
| Audio | Sound recordings or waveforms used for speech and acoustic tasks | WAV, MP3, FLAC | Speech recognition, speaker identification, music generation |
| Video | Sequential frames of visual data, sometimes with accompanying audio | MP4, AVI, WebM | Action recognition, video captioning, autonomous driving |
| Graph | Data representing nodes and edges, capturing relationships between entities | Adjacency matrices, edge lists, GraphML | Social network analysis, molecular property prediction, knowledge graphs |
| Type | Description | Typical Use |
|---|---|---|
| Labeled | Each sample is annotated with a ground-truth label or target value | Supervised learning |
| Unlabeled | Samples have no associated target values | Unsupervised learning, pretraining |
| Semi-labeled | A small portion of samples are labeled, while the majority are not | Semi-supervised learning |
| Type | Description | Examples |
|---|---|---|
| Structured | Data follows a predefined schema with clearly defined rows and columns | Spreadsheets, relational databases, CSV files |
| Unstructured | Data lacks a rigid format and requires preprocessing before use | Raw text documents, images, audio recordings |
| Semi-structured | Data has some organizational properties but does not conform to a strict tabular schema | JSON documents, XML files, HTML pages |
Several datasets have played pivotal roles in advancing the field of machine learning and deep learning. These benchmark datasets serve as standard reference points for comparing model architectures and training techniques.
| Dataset | Year | Size | Description |
|---|---|---|---|
| MNIST | 1998 | 70,000 grayscale images (28x28 pixels) | Handwritten digit recognition dataset created by Yann LeCun, Corinna Cortes, and Chris Burges. Contains 60,000 training and 10,000 test images of digits 0 through 9. Often considered the "hello world" of machine learning. |
| CIFAR-10 | 2009 | 60,000 color images (32x32 pixels) | Created by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton at the Canadian Institute for Advanced Research. Contains 10 classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck) with 6,000 images per class. |
| CIFAR-100 | 2009 | 60,000 color images (32x32 pixels) | An extension of CIFAR-10 with 100 fine-grained classes grouped into 20 superclasses, containing 600 images per class. |
| ImageNet | 2009 | 14+ million images | A large-scale image database organized according to the WordNet hierarchy, containing more than 14 million hand-annotated images across 20,000+ categories. The associated ImageNet Large Scale Visual Recognition Challenge (ILSVRC) catalyzed the deep learning revolution when AlexNet won the 2012 competition with a top-5 error rate of 15.3%, outperforming the runner-up by over 10 percentage points. |
| COCO (Common Objects in Context) | 2014 | 330,000+ images | A large-scale dataset for object detection, segmentation, and captioning. Each image is annotated with 80 object categories and 5 descriptive captions, making it one of the most widely used benchmarks in computer vision. |
| Dataset | Year | Description |
|---|---|---|
| SQuAD (Stanford Question Answering Dataset) | 2016 | A reading comprehension dataset consisting of questions posed on Wikipedia articles, where the answer to each question is a span of text from the corresponding passage. SQuAD 2.0 added unanswerable questions to increase difficulty. |
| GLUE (General Language Understanding Evaluation) | 2018 | A benchmark suite of nine natural language understanding tasks, including sentiment analysis, textual entailment, and paraphrase detection. GLUE became the standard evaluation framework for pretrained language models like BERT and RoBERTa. SuperGLUE was later introduced as a more challenging successor. |
| Common Crawl | 2008 (ongoing) | A nonprofit project that crawls the web and provides free archives of web page data. The corpus contains over 300 billion web pages, with monthly crawls adding billions more. Common Crawl data has been used to train many large language models, and a 2024 Mozilla report found that two-thirds of 47 generative LLMs released between 2019 and 2023 relied on Common Crawl data. |
Building a high-quality dataset involves multiple stages, each requiring careful planning and execution.
Data can be gathered from a variety of sources:
For supervised learning tasks, raw data must be annotated with ground-truth labels. Annotation methods include:
Raw datasets often contain noise, missing values, duplicates, and inconsistencies. Preprocessing steps typically include:
Before training a model, a dataset is typically divided into three subsets, each serving a distinct purpose.
| Subset | Purpose | Typical Share |
|---|---|---|
| Training set | Used to fit the model's parameters during training | 60% to 80% |
| Validation set | Used to tune hyperparameters and monitor for overfitting during training | 10% to 20% |
| Test set | Used to evaluate the final model's performance on unseen data | 10% to 20% |
Common split ratios include 60/20/20, 70/15/15, and 80/10/10. For very large datasets (millions of samples), even smaller validation and test proportions (such as 98/1/1) can be sufficient because each subset still contains tens of thousands of examples.
Dataset bias occurs when the data used to train a model does not accurately represent the population or phenomenon the model is intended to serve. Biased datasets can lead to models that systematically disadvantage certain groups or produce inaccurate results for underrepresented populations.
| Bias Type | Description |
|---|---|
| Selection bias | The data collection process systematically excludes certain groups or scenarios |
| Measurement bias | The way data is recorded introduces systematic errors (e.g., different sensor calibrations across demographic groups) |
| Historical bias | The data reflects existing societal inequalities, and models trained on it perpetuate those inequalities |
| Representation bias | Certain groups are underrepresented in the dataset relative to their real-world prevalence |
| Label bias | Annotators apply labels inconsistently, or cultural biases influence the labeling process |
Several high-profile cases have demonstrated the consequences of dataset bias:
Addressing dataset bias requires intervention at multiple stages:
Synthetic datasets are artificially generated by algorithms to mimic the statistical properties of real-world data without containing actual real-world information. They have become increasingly important in modern machine learning workflows.
| Method | Description |
|---|---|
| Generative Adversarial Networks (GANs) | Two neural networks (a generator and a discriminator) compete to produce realistic synthetic samples |
| Variational Autoencoders (VAEs) | Encoder-decoder architectures that learn a latent representation and sample new data points from it |
| Large Language Models (LLMs) | Pretrained language models generate synthetic text data for NLP tasks |
| Statistical simulation | Traditional statistical models replicate the distributions and correlations found in real data |
| Rule-based generation | Domain-specific rules and templates produce structured synthetic records |
| Data augmentation | Existing samples are transformed (rotated, cropped, paraphrased) to create new training examples |
Some estimates suggest that over 60% of data used for AI applications by 2024 was synthetic, and this proportion is expected to continue growing.
Benchmark datasets are standardized, high-quality collections designed to evaluate and compare machine learning models in a fair and reproducible manner. They function as shared reference points: when every researcher measures their model against the same test data, it becomes straightforward to compare results and track progress.
Benchmark datasets serve several purposes:
Benchmarks also have notable limitations:
The relationship between dataset size and model performance is a central concern in machine learning. While more data generally leads to better models, the quality of that data matters at least as much as its quantity.
Research on neural scaling laws has shown that model performance improves predictably as training data, model size, and compute increase together. However, these gains depend on data quality. The Chinchilla scaling laws, published by DeepMind in 2022, emphasized that many large language models were undertrained relative to their size and that increasing the amount of high-quality training data could be more efficient than simply scaling model parameters.
Proper documentation of datasets has become a recognized best practice in the machine learning community. Two prominent frameworks have emerged for this purpose.
Proposed by Timnit Gebru and colleagues in 2018, the "Datasheets for Datasets" framework draws an analogy to datasheets in the electronics industry, where every component is accompanied by documentation describing its characteristics and recommended uses. A dataset datasheet addresses:
Data Cards, developed at Google, provide a structured summary of essential information about a dataset across its life cycle. Beyond basic metadata, Data Cards include explanations, rationales, and instructions related to a dataset's provenance, representation, intended uses, and fairness evaluations. Google published Data Cards alongside the Open Images dataset, and the approach has been adopted by organizations including Hugging Face.
Companies including Microsoft, Google, and IBM have begun piloting datasheets for datasets within their product teams, and academic researchers increasingly publish datasets with accompanying documentation.
As datasets evolve over time (with corrections, additions, and schema changes), tracking those changes becomes essential for reproducibility and debugging.
DVC is an open-source version control system designed for data science and machine learning projects. It extends Git workflows to handle large files, datasets, and models without storing them directly in the Git repository.
Key features of DVC include:
Other data versioning tools include Git LFS, Dolt, lakeFS, and MLflow. In November 2025, lakeFS announced its acquisition of DVC.
The license attached to a dataset determines how it can be used, shared, and modified. Understanding dataset licenses is critical for both legal compliance and ethical practice.
| License | Permissions | Key Restrictions |
|---|---|---|
| CC0 (Public Domain) | Unrestricted use, modification, and distribution | None |
| CC BY (Attribution) | Use, modify, and distribute with credit | Must attribute the original creator |
| CC BY-SA (Attribution-ShareAlike) | Use, modify, and distribute with credit | Derivative works must use the same license |
| CC BY-NC (Attribution-NonCommercial) | Use and modify with credit | No commercial use |
| CC BY-NC-SA | Use and modify with credit | No commercial use; derivatives must use the same license |
| ODC-PDDL (Public Domain) | Unrestricted use | None |
| ODC-By (Attribution) | Use and modify with credit | Must attribute the source |
| ODbL (Open Database License) | Use, modify, and distribute | Must attribute, share alike, and keep open |
| CDLA-Permissive | Use, modify, and redistribute | Minimal restrictions, similar to MIT license for software |
For open data, CC0 and ODC-PDDL are the most permissive options. Researchers should note that software licenses (such as MIT or Apache 2.0) are not always appropriate for datasets, and purpose-built data licenses like those from Creative Commons or Open Data Commons are generally preferred.
Hugging Face operates the largest public hub for machine learning datasets, hosting over 600,000 datasets as of 2024. The platform supports datasets in more than 8,000 languages, spanning tasks in natural language processing, computer vision, and audio processing.
Key features of the Hugging Face Datasets Hub include:
datasets library allows users to process datasets that are too large to fit in memory by streaming data on the fly.datasets library.The Hugging Face Datasets Hub has become a central resource for the machine learning community, providing easy access to datasets for research, education, and application development.
Before using a dataset to train a machine learning model, it must be preprocessed. Common preprocessing tasks include cleaning the data, normalizing numerical features, encoding categorical variables, and performing feature engineering. Preprocessing is a critical step in the machine learning pipeline because the quality of input data directly affects model accuracy and generalization.
Additional preprocessing techniques include:
Imagine you have a big box of flash cards. Each flash card has a picture on one side and the name of what is in the picture on the other side. If you show a computer thousands of these flash cards, it starts to learn what a cat looks like versus what a dog looks like. That box of flash cards is a dataset.
The more flash cards you have, and the more different kinds of pictures they show, the better the computer gets at recognizing things it has never seen before. But if all your flash cards only show golden retrievers, the computer might think every dog is a golden retriever. That is why it is important to have lots of different examples in your dataset.
Before you start teaching the computer, you split the flash cards into three piles: a big pile for learning (the training set), a medium pile for checking its progress (the validation set), and a small pile you hide away for a final test (the test set). That way, you can be sure the computer actually learned and is not just memorizing the answers.