Datasets
Last reviewed
Sources
21 citations
Review status
Source-backed
Revision
v4 ยท 2,654 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
21 citations
Review status
Source-backed
Revision
v4 ยท 2,654 words
Add missing citations, update stale details, or suggest a clearer explanation.
In machine learning, a dataset is a structured collection of examples used to fit, tune, and evaluate models, where each example pairs input data (features) with an optional target (label or response). Datasets range from the 150 row Iris file used in introductory statistics to the multi terabyte web crawls behind modern language models, and they are typically partitioned into training, validation, and test splits so a model can be trained on one portion and measured on data it has never seen. The reliability of almost every machine learning result depends on how a dataset is built, documented, and split: leakage between these splits is one of the most common reasons measured accuracy collapses in production.
See also: Machine learning terms, dataset
In machine learning, datasets are collections of examples used to fit, tune, and evaluate models. Each example usually combines input data (features) with a target value (label or response), although unsupervised learning tasks work with unlabeled examples. A dataset can be as small as the 150 row Iris file used in introductory statistics courses or as large as the multi terabyte web crawl behind modern language models. The word "dataset" is used loosely: sometimes it means a single CSV file, sometimes a labeled benchmark like MNIST, sometimes a corpus of raw text scraped from the open web.
This article covers datasets as a general concept. The Hugging Face datasets library is a Python package built on Apache Arrow for loading and processing them at scale. Famous datasets such as ImageNet and Common Crawl are introduced briefly below and have their own pages where applicable.
At the most basic level, a dataset is a table. Rows are examples (also called samples, instances, or records); columns are features (also called attributes or variables); and one or more columns may serve as the target the model is learning to predict. The notation X is conventional for the feature matrix and y for the label vector, following the linear algebra convention used in scikit-learn and most teaching materials.
Features come in several types. Numeric features carry continuous or discrete numbers such as age or pixel intensity. Categorical features carry discrete labels such as country or product code, often encoded with one hot vectors or learned embeddings. Text features are sequences of tokens that may be processed by a tokenizer before training. Image features are arrays of pixel values, typically of shape (height, width, channels). Audio features are waveforms or spectrograms. Time series features are ordered sequences with a timestamp index.
Many real datasets mix modalities. A product catalog might pair a numeric price column, a categorical brand column, a short text description, and a product photo. Multimodal models depend on datasets that align modalities carefully so the same row describes the same product.
A single dataset is rarely used in one piece. To produce a model that generalizes to new inputs, practitioners split the data into three subsets with different roles, summarized in the table below.[1][2]
| Split | Typical share | Purpose |
|---|---|---|
| Training set | 60 to 80 percent | Used to fit model parameters by gradient descent or another optimization method. |
| Validation set | 10 to 20 percent | Used during development to tune hyperparameters, choose between architectures, and detect overfitting through early stopping. |
| Test set | 10 to 20 percent | Used once, at the end, to estimate how the chosen model will behave on unseen data. |
The ratios are conventions, not rules. With very large datasets the test set can be small in relative terms and still contain tens of thousands of examples. With small datasets, holding back 30 percent for evaluation may waste too much training signal; researchers often use k fold cross validation, where the data is split into k folds, the model is trained k times on k minus one folds, and performance is averaged. Scikit-learn provides train_test_split and KFold utilities for these patterns.[3]
Terminology varies. Some papers use "dev set" for what others call validation, and some communities reserve "test" for a held out benchmark that is only evaluated once at the end of a project.
Splits are easy to get wrong. Data leakage happens when information from outside the training set, especially from the test set, sneaks into the model during fitting.[4] Two common patterns are target leakage, where a feature reveals the answer (for example a column filled in only after the outcome was known), and train test contamination, where preprocessing such as feature scaling, imputation, or feature selection is fitted on the full dataset before splitting. Means and standard deviations used for normalization should come only from the training portion.
Leakage inflates measured performance during development and collapses it in production. The scikit-learn documentation flags this as one of the most common pitfalls in the field and recommends pipelines that fit preprocessing inside cross validation folds.[3] For time series, the split must respect time order, so the model never sees future data when predicting the past. The related risk for large language models is benchmark contamination, where evaluation questions appear in the pretraining corpus; because web scale corpora such as Common Crawl overlap with many public benchmarks, decontamination filtering has become a standard step before training.
Datasets are stored in many formats, chosen for size, speed, and tooling support. The major options are summarized below.[5]
| Format | Origin | Strengths | Typical use |
|---|---|---|---|
| CSV | Plain text, decades old | Human readable, opens in any tool | Small tabular data, teaching, sharing |
| JSON / JSONL | Web standard | Schemaless, easy to extend | NLP datasets where each line is one example |
| Parquet | Apache, columnar | Compression, partial column reads, fast scans | Large tabular and analytics workloads |
| Apache Arrow / Feather | Apache, columnar in memory | Zero copy reads, memory mapping | Cross language interchange, Hugging Face cache |
| TFRecord | Google, protocol buffers | Streamable shards, native to TensorFlow | Large image and audio pipelines |
| HDF5 | NCSA, hierarchical binary | Stores high dimensional arrays in one file | Scientific data, large image stacks |
| WebDataset | tar based | Streams from object storage, plays well with shards | Web scale vision and audio training |
CSV remains popular because it is text and works everywhere, but it has real limits: no schema, no compression by default, and slow to parse at scale. Parquet stores values column by column, which lets the reader skip columns and compress each column efficiently. TFRecord packages examples as serialized protocol buffers and supports sharding, which matters when several training workers read the same data in parallel.
HDF5 is widely used in scientific computing for high dimensional arrays that do not fit a tabular layout, but it has well known problems with concurrent reads.[5] Apache Arrow was designed in part to address those issues.[6] The Hugging Face datasets library stores its on disk cache as Arrow tables, which lets it memory map huge files and serve random samples to the GPU without loading everything into RAM.
The UCI Machine Learning Repository, created as an FTP archive in 1987 by PhD student David Aha, has hosted classic benchmarks such as Iris, Wine, and Adult Census for nearly four decades. As of 2024 it lists 689 datasets and is still a default landing place for teaching examples.[7]
Kaggle launched in 2010 as a competition platform and now hosts a large catalog of public datasets contributed by users, governments, and companies. Many real industrial problems first reach a wide audience as a Kaggle competition.
The Hugging Face Hub has become the de facto sharing layer for modern NLP, vision, and audio datasets. It pairs each dataset with a README based dataset card that describes source, licensing, and known biases.[13] The Hub hosted over 450,000 public datasets as of mid 2025 and surpassed 730,000 by 2026, making it the largest single catalog of machine learning datasets.[18]
TensorFlow Datasets and torchvision.datasets ship curated catalogs that download, verify, and split common benchmarks with a single line of code, keeping reproductions consistent across labs.
Government open data portals (data.gov, EU Open Data, national statistics offices) and academic archives such as Zenodo and Figshare carry datasets used in research papers, often with persistent DOIs.
A few keep showing up because they shaped the field.
MNIST, assembled around 1994 by Yann LeCun, Corinna Cortes, and Christopher Burges from earlier NIST collections, contains 60,000 training and 10,000 test images of handwritten digits at 28 by 28 grayscale resolution.[8] The digits were size normalized into a 20 by 20 box and centered by center of mass within the 28 by 28 field, a preprocessing recipe documented on the original MNIST page.[8] For two decades it was the default benchmark for digit recognition.
CIFAR-10 and CIFAR-100 were published in 2009 by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton as labeled subsets of the 80 Million Tiny Images collection.[17] CIFAR-10 has 60,000 colour images at 32 by 32 across 10 classes; CIFAR-100 has the same total across 100 classes grouped into 20 superclasses.[9] They bridged MNIST and harder natural image benchmarks.
ImageNet, conceived by Fei-Fei Li in 2006 and labeled between 2008 and 2010 with help from about 49,000 Mechanical Turk workers, contains roughly 14 million images organized under the WordNet hierarchy across 21,841 populated synsets.[10] The ImageNet Large Scale Visual Recognition Challenge ran from 2010 to 2017 on a 1,000 class subset.[10] The 2012 result, where AlexNet cut top-5 error to 15.3 percent versus 26.2 percent for the runner up, is often cited as the start of the deep learning era.[19]
Common Crawl, founded by Gil Elbaz in 2007, publishes monthly snapshots of the public web, each typically containing roughly 2 to 2.5 billion pages, and by 2023 it had captured more than three billion pages in total.[11] It is one of the main raw corpora behind large language models from Anthropic, OpenAI, Google DeepMind, and others, usually combined with curated subsets such as C4 and RefinedWeb after heavy filtering. A 2024 Mozilla Foundation study found that at least 64 percent of 47 large language models released between 2019 and 2023 were trained on filtered versions of Common Crawl.[20]
Other frequently cited collections include SQuAD for question answering, COCO for object detection, LibriSpeech for speech recognition, and The Pile and RedPajama for open language model pretraining.
Raw datasets rarely go straight into a model. Preprocessing cleans missing values, removes duplicates, encodes categorical variables, and rescales numeric features. Feature engineering builds new columns that make the prediction task easier, for example extracting hour of day from a timestamp or combining latitude and longitude into a city cluster.
For deep learning the preprocessing logic is usually built into the training pipeline and runs on the fly. Tokenization, image augmentation (random crops, flips, color jitter), and mixup are typical steps. Each transformation must be reproducible: the same input yields the same output for a given random seed.
Reproducing a result a year later requires the same code, the same model weights, and the same data. Code lives in Git; weights live in artifact stores; data is harder, because it can be too large for Git and may change as new examples arrive. Tools such as DVC (Data Version Control), Git LFS, and lakeFS solve this by tracking a hash of each dataset version in Git while storing the actual files in cloud or local object storage. DVC dvc.yaml and dvc.lock files record which raw data and which pipeline produced a given experiment, so a teammate can run dvc repro and recreate the artifact.[14]
Documentation is a separate axis of reproducibility. The Datasheets for Datasets framework (Gebru et al., 2018, later published in Communications of the ACM in December 2021) proposed standardized questions about provenance, intended use, collection process, and known limitations.[15] The authors framed the goal by analogy to the electronics industry: "we propose that every dataset be accompanied with a datasheet that documents its motivation, composition, collection process, recommended uses, and so on."[15] Hugging Face dataset cards build on the same idea: each public dataset is expected to ship a README covering contents, splits, license, and ethical considerations. A large scale analysis of all 7,433 dataset cards on Hugging Face (Yang, Liang, and Zou, ICLR 2024) found documentation quality varies widely, with 86.0 percent of the top 100 most downloaded datasets completing every section while completion drops sharply across the long tail.[16]
The datasets Python package (often referenced as huggingface/datasets) is the most widely used loader for modern AI datasets. A single call such as load_dataset("glue", "sst2") pulls a public dataset from the Hub and returns a Dataset or DatasetDict object that supports map, filter, train_test_split, and integration with PyTorch and TensorFlow data loaders.[12]
Under the hood it stores examples as Apache Arrow tables and memory maps them, which keeps RAM usage low even for hundred gigabyte corpora. The official documentation explains that "all datasets are memory-mapped using an efficient zero-serialization cost backend (Apache Arrow)," which lets large datasets run on machines with relatively small device memory.[21] Streaming mode (streaming=True) goes further: examples are pulled lazily from remote storage so training can start before the data finishes downloading. The library is distinct from the broader concept of a dataset; the concept came first, and the library is one popular implementation of how to load and process them.
A dataset is like a big box of flashcards. Each card has a question on the front (the features, like a picture of an animal or a sentence) and an answer on the back (the label, like "cat" or "happy"). To teach the computer, you let it study most of the cards (the training set), quiz it on a smaller pile while it studies (the validation set), and at the very end you give it a final pile it has never seen (the test set) to find out how well it really learned. If you accidentally let the computer peek at the final pile while studying, it will look like a genius in practice and then fail on real questions, which is why people are careful about how they split the box.