Datasets
Last reviewed
May 11, 2026
Sources
17 citations
Review status
Source-backed
Revision
v2 · 2,200 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
17 citations
Review status
Source-backed
Revision
v2 · 2,200 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms, dataset
In machine learning, datasets are collections of examples used to fit, tune, and evaluate models. Each example usually combines input data (features) with a target value (label or response), although unsupervised learning tasks work with unlabeled examples. A dataset can be as small as the 150 row Iris file used in introductory statistics courses or as large as the multi terabyte web crawl behind modern language models. The word "dataset" is used loosely: sometimes it means a single CSV file, sometimes a labeled benchmark like MNIST, sometimes a corpus of raw text scraped from the open web.
This article covers datasets as a general concept. The Hugging Face datasets library is a Python package built on Apache Arrow for loading and processing them at scale. Famous datasets such as ImageNet and Common Crawl are introduced briefly below and have their own pages where applicable.
At the most basic level, a dataset is a table. Rows are examples (also called samples, instances, or records); columns are features (also called attributes or variables); and one or more columns may serve as the target the model is learning to predict. The notation X is conventional for the feature matrix and y for the label vector, following the linear algebra convention used in scikit-learn and most teaching materials.
Features come in several types. Numeric features carry continuous or discrete numbers such as age or pixel intensity. Categorical features carry discrete labels such as country or product code, often encoded with one hot vectors or learned embeddings. Text features are sequences of tokens that may be processed by a tokenizer before training. Image features are arrays of pixel values, typically of shape (height, width, channels). Audio features are waveforms or spectrograms. Time series features are ordered sequences with a timestamp index.
Many real datasets mix modalities. A product catalog might pair a numeric price column, a categorical brand column, a short text description, and a product photo. Multimodal models depend on datasets that align modalities carefully so the same row describes the same product.
A single dataset is rarely used in one piece. To produce a model that generalizes to new inputs, practitioners split the data into three subsets with different roles, summarized in the table below.
| Split | Typical share | Purpose |
|---|---|---|
| Training set | 60 to 80 percent | Used to fit model parameters by gradient descent or another optimization method. |
| Validation set | 10 to 20 percent | Used during development to tune hyperparameters, choose between architectures, and detect overfitting through early stopping. |
| Test set | 10 to 20 percent | Used once, at the end, to estimate how the chosen model will behave on unseen data. |
The ratios are conventions, not rules. With very large datasets the test set can be small in relative terms and still contain tens of thousands of examples. With small datasets, holding back 30 percent for evaluation may waste too much training signal; researchers often use k fold cross validation, where the data is split into k folds, the model is trained k times on k minus one folds, and performance is averaged. Scikit-learn provides train_test_split and KFold utilities for these patterns.
Terminology varies. Some papers use "dev set" for what others call validation, and some communities reserve "test" for a held out benchmark that is only evaluated once at the end of a project.
Splits are easy to get wrong. Data leakage happens when information from outside the training set, especially from the test set, sneaks into the model during fitting. Two common patterns are target leakage, where a feature reveals the answer (for example a column filled in only after the outcome was known), and train test contamination, where preprocessing such as feature scaling, imputation, or feature selection is fitted on the full dataset before splitting. Means and standard deviations used for normalization should come only from the training portion.
Leakage inflates measured performance during development and collapses it in production. The scikit-learn documentation flags this as one of the most common pitfalls in the field and recommends pipelines that fit preprocessing inside cross validation folds. For time series, the split must respect time order, so the model never sees future data when predicting the past.
Datasets are stored in many formats, chosen for size, speed, and tooling support. The major options are summarized below.
| Format | Origin | Strengths | Typical use |
|---|---|---|---|
| CSV | Plain text, decades old | Human readable, opens in any tool | Small tabular data, teaching, sharing |
| JSON / JSONL | Web standard | Schemaless, easy to extend | NLP datasets where each line is one example |
| Parquet | Apache, columnar | Compression, partial column reads, fast scans | Large tabular and analytics workloads |
| Apache Arrow / Feather | Apache, columnar in memory | Zero copy reads, memory mapping | Cross language interchange, Hugging Face cache |
| TFRecord | Google, protocol buffers | Streamable shards, native to TensorFlow | Large image and audio pipelines |
| HDF5 | NCSA, hierarchical binary | Stores high dimensional arrays in one file | Scientific data, large image stacks |
| WebDataset | tar based | Streams from object storage, plays well with shards | Web scale vision and audio training |
CSV remains popular because it is text and works everywhere, but it has real limits: no schema, no compression by default, and slow to parse at scale. Parquet stores values column by column, which lets the reader skip columns and compress each column efficiently. TFRecord packages examples as serialized protocol buffers and supports sharding, which matters when several training workers read the same data in parallel.
HDF5 is widely used in scientific computing for high dimensional arrays that do not fit a tabular layout, but it has well known problems with concurrent reads. Apache Arrow was designed in part to address those issues. The Hugging Face datasets library stores its on disk cache as Arrow tables, which lets it memory map huge files and serve random samples to the GPU without loading everything into RAM.
The UCI Machine Learning Repository, created as an FTP archive in 1987 by PhD student David Aha, has hosted classic benchmarks such as Iris, Wine, and Adult Census for nearly four decades. As of 2024 it lists 689 datasets and is still a default landing place for teaching examples.
Kaggle launched in 2010 as a competition platform and now hosts a large catalog of public datasets contributed by users, governments, and companies. Many real industrial problems first reach a wide audience as a Kaggle competition.
The Hugging Face Hub has become the de facto sharing layer for modern NLP, vision, and audio datasets. It pairs each dataset with a README based dataset card that describes source, licensing, and known biases.
TensorFlow Datasets and torchvision.datasets ship curated catalogs that download, verify, and split common benchmarks with a single line of code, keeping reproductions consistent across labs.
Government open data portals (data.gov, EU Open Data, national statistics offices) and academic archives such as Zenodo and Figshare carry datasets used in research papers, often with persistent DOIs.
A few keep showing up because they shaped the field.
MNIST, assembled around 1994 by Yann LeCun, Corinna Cortes, and Christopher Burges from earlier NIST collections, contains 60,000 training and 10,000 test images of handwritten digits at 28 by 28 grayscale resolution. For two decades it was the default benchmark for digit recognition.
CIFAR-10 and CIFAR-100 were published in 2009 by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton as labeled subsets of the 80 Million Tiny Images collection. CIFAR-10 has 60,000 colour images at 32 by 32 across 10 classes; CIFAR-100 has the same total across 100 classes grouped into 20 superclasses. They bridged MNIST and harder natural image benchmarks.
ImageNet, conceived by Fei-Fei Li in 2006 and labeled between 2008 and 2010 with help from about 49,000 Mechanical Turk workers, contains roughly 14 million images organized under the WordNet hierarchy. The ImageNet Large Scale Visual Recognition Challenge ran from 2010 to 2017 on a 1,000 class subset. The 2012 result, where AlexNet cut top-5 error by more than 10 percentage points over the runner up, is often cited as the start of the deep learning era.
Common Crawl, founded by Gil Elbaz in 2007, publishes monthly snapshots of the public web. By 2023 it had captured more than three billion pages. It is one of the main raw corpora behind large language models from Anthropic, OpenAI, Google DeepMind, and others, usually combined with curated subsets such as C4 and RefinedWeb after heavy filtering.
Other frequently cited collections include SQuAD for question answering, COCO for object detection, LibriSpeech for speech recognition, and The Pile and RedPajama for open language model pretraining.
Raw datasets rarely go straight into a model. Preprocessing cleans missing values, removes duplicates, encodes categorical variables, and rescales numeric features. Feature engineering builds new columns that make the prediction task easier, for example extracting hour of day from a timestamp or combining latitude and longitude into a city cluster.
For deep learning the preprocessing logic is usually built into the training pipeline and runs on the fly. Tokenization, image augmentation (random crops, flips, color jitter), and mixup are typical steps. Each transformation must be reproducible: the same input yields the same output for a given random seed.
Reproducing a result a year later requires the same code, the same model weights, and the same data. Code lives in Git; weights live in artifact stores; data is harder, because it can be too large for Git and may change as new examples arrive. Tools such as DVC (Data Version Control), Git LFS, and lakeFS solve this by tracking a hash of each dataset version in Git while storing the actual files in cloud or local object storage. DVC dvc.yaml and dvc.lock files record which raw data and which pipeline produced a given experiment, so a teammate can run dvc repro and recreate the artifact.
Documentation is a separate axis of reproducibility. The Datasheets for Datasets framework (Gebru et al., 2018) proposed standardized questions about provenance, intended use, collection process, and known limitations. Hugging Face dataset cards build on the same idea: each public dataset is expected to ship a README covering contents, splits, license, and ethical considerations. A 2024 analysis of 7,433 dataset cards found documentation quality varies widely across the ecosystem.
The datasets Python package (often referenced as huggingface/datasets) is the most widely used loader for modern AI datasets. A single call such as load_dataset("glue", "sst2") pulls a public dataset from the Hub and returns a Dataset or DatasetDict object that supports map, filter, train_test_split, and integration with PyTorch and TensorFlow data loaders.
Under the hood it stores examples as Apache Arrow tables and memory maps them, which keeps RAM usage low even for hundred gigabyte corpora. Streaming mode (streaming=True) goes further: examples are pulled lazily from remote storage so training can start before the data finishes downloading. The library is distinct from the broader concept of a dataset; the concept came first, and the library is one popular implementation of how to load and process them.
A dataset is like a big box of flashcards. Each card has a question on the front (the features, like a picture of an animal or a sentence) and an answer on the back (the label, like "cat" or "happy"). To teach the computer, you let it study most of the cards (the training set), quiz it on a smaller pile while it studies (the validation set), and at the very end you give it a final pile it has never seen (the test set) to find out how well it really learned. If you accidentally let the computer peek at the final pile while studying, it will look like a genius in practice and then fail on real questions, which is why people are careful about how they split the box.