# Data Set or Dataset

> Source: https://aiwiki.ai/wiki/data_set_or_dataset
> Updated: 2026-06-23
> Categories: Data & Datasets, Machine Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [Machine learning](/wiki/machine_learning), [Training set](/wiki/training_set), [Validation set](/wiki/validation_set), [Test set](/wiki/test_set)*

A dataset (also written as "data set") is a structured collection of data points used to train, validate, and evaluate machine learning models. In artificial intelligence and [machine learning](/wiki/machine_learning), each data point (also called a sample, instance, or example) is described by a set of [features](/wiki/feature) (input variables) and, in [supervised learning](/wiki/supervised_learning), a corresponding [label](/wiki/label) (the target output). Datasets are the foundation of every machine learning pipeline: a model can only learn the patterns present in its data, so the size, quality, diversity, and representativeness of a dataset directly determine the performance and reliability of the model built from it.

Before training, a dataset is normally divided into three disjoint subsets: a [training set](/wiki/training_set) used to fit the model's parameters, a [validation set](/wiki/validation_set) used to tune [hyperparameters](/wiki/hyperparameter) and detect [overfitting](/wiki/overfitting), and a [test set](/wiki/test_set) held back to measure final performance on unseen data. Landmark benchmark datasets such as [MNIST](/wiki/mnist), [ImageNet](/wiki/imagenet), and [Common Crawl](/wiki/common_crawl) have repeatedly driven progress in the field by giving researchers a shared, reproducible basis for comparison.

## Introduction

A dataset (also written as "data set") is a structured collection of data organized for analysis, processing, or [machine learning](/wiki/machine_learning) tasks. In the context of artificial intelligence and machine learning, a dataset refers specifically to the body of information used to [train](/wiki/training), validate, and evaluate models. Datasets typically consist of individual data points (also called samples, instances, or examples), each described by a set of [features](/wiki/feature) (input variables) and, in [supervised learning](/wiki/supervised_learning), corresponding [labels](/wiki/label) (output variables or target values).

Datasets are the foundation of every machine learning pipeline. The quality, size, diversity, and representativeness of a dataset directly influence the performance and reliability of the models built from it. As the field has matured, the creation, curation, documentation, and governance of datasets have become research areas in their own right.

The term "dataset" is now standard in most technical writing, though "data set" (two words) still appears in some formal and statistical contexts. Both forms are considered acceptable.

## What types of datasets are there?

Datasets can be categorized by their structure, modality, and labeling status. The following table summarizes the most common types encountered in machine learning.

### By Data Modality

| Type | Description | Common Formats | Example Tasks |
|------|-------------|----------------|---------------|
| Tabular | Data organized in rows and columns, where each row is a sample and each column is a [feature](/wiki/feature) | CSV, Parquet, SQL databases | [Classification](/wiki/classification), [regression](/wiki/regression), fraud detection |
| Image | Collections of photographs or generated images, typically stored as pixel arrays | JPEG, PNG, TIFF | [Image recognition](/wiki/image_recognition), [object detection](/wiki/object_detection), [image segmentation](/wiki/image_segmentation) |
| Text | Documents, sentences, or token sequences used for language tasks | Plain text, JSON, XML | [Sentiment analysis](/wiki/sentiment_analysis), [machine translation](/wiki/machine_translation), question answering |
| Audio | Sound recordings or waveforms used for speech and acoustic tasks | WAV, MP3, FLAC | [Speech recognition](/wiki/speech_recognition), speaker identification, music generation |
| Video | Sequential frames of visual data, sometimes with accompanying audio | MP4, AVI, WebM | Action recognition, video captioning, autonomous driving |
| Graph | Data representing nodes and edges, capturing relationships between entities | Adjacency matrices, edge lists, GraphML | Social network analysis, molecular property prediction, knowledge graphs |

### By Labeling Status

| Type | Description | Typical Use |
|------|-------------|-------------|
| Labeled | Each sample is annotated with a ground-truth [label](/wiki/label) or target value | [Supervised learning](/wiki/supervised_learning) |
| Unlabeled | Samples have no associated target values | [Unsupervised learning](/wiki/unsupervised_learning), pretraining |
| Semi-labeled | A small portion of samples are labeled, while the majority are not | [Semi-supervised learning](/wiki/semi-supervised_learning) |

### By Structure

| Type | Description | Examples |
|------|-------------|----------|
| Structured | Data follows a predefined schema with clearly defined rows and columns | Spreadsheets, relational databases, CSV files |
| Unstructured | Data lacks a rigid format and requires preprocessing before use | Raw text documents, images, audio recordings |
| Semi-structured | Data has some organizational properties but does not conform to a strict tabular schema | JSON documents, XML files, HTML pages |

## What are the most famous machine learning datasets?

Several datasets have played pivotal roles in advancing the field of machine learning and [deep learning](/wiki/deep_learning). These benchmark datasets serve as standard reference points for comparing model architectures and training techniques.

### Computer Vision Datasets

| Dataset | Year | Size | Description |
|---------|------|------|-------------|
| [MNIST](/wiki/mnist) | 1998 | 70,000 grayscale images (28x28 pixels) | Handwritten digit recognition dataset created by Yann LeCun, Corinna Cortes, and Chris Burges. Contains 60,000 training and 10,000 test images of digits 0 through 9. Often considered the "hello world" of machine learning. [3] |
| [CIFAR-10](/wiki/cifar_10) | 2009 | 60,000 color images (32x32 pixels) | Created by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton at the Canadian Institute for Advanced Research. Contains 10 classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck) with 6,000 images per class. [4] |
| CIFAR-100 | 2009 | 60,000 color images (32x32 pixels) | An extension of CIFAR-10 with 100 fine-grained classes grouped into 20 superclasses, containing 600 images per class. [4] |
| [ImageNet](/wiki/imagenet) | 2009 | 14+ million images | A large-scale image database organized according to the WordNet hierarchy, containing more than 14 million hand-annotated images across 20,000+ categories. The associated ImageNet Large Scale Visual Recognition Challenge (ILSVRC) catalyzed the deep learning revolution when [AlexNet](/wiki/alexnet) won the 2012 competition with a top-5 error rate of 15.3%, more than 10 percentage points lower than the runner-up's 26.2%. [2][11] |
| [COCO](/wiki/coco_dataset) (Common Objects in Context) | 2014 | 330,000+ images | A large-scale dataset for [object detection](/wiki/object_detection), segmentation, and captioning. Each image is annotated with 80 object categories and 5 descriptive captions, making it one of the most widely used benchmarks in [computer vision](/wiki/computer_vision). [5] |

The ImageNet result was a watershed moment. AlexNet itself contained roughly 60 million parameters and was trained on GPUs, an approach that became a defining ingredient of the deep learning era. [11]

### Natural Language Processing Datasets

| Dataset | Year | Description |
|---------|------|-------------|
| [SQuAD](/wiki/squad) (Stanford Question Answering Dataset) | 2016 | A reading comprehension dataset consisting of questions posed on Wikipedia articles, where the answer to each question is a span of text from the corresponding passage. SQuAD 2.0 added unanswerable questions to increase difficulty. [6] |
| [GLUE](/wiki/glue_benchmark) (General Language Understanding Evaluation) | 2018 | A benchmark suite of nine [natural language understanding](/wiki/natural_language_understanding) tasks, including sentiment analysis, textual entailment, and paraphrase detection. GLUE became the standard evaluation framework for pretrained language models like [BERT](/wiki/bert) and [RoBERTa](/wiki/roberta). SuperGLUE was later introduced as a more challenging successor. [7] |
| [Common Crawl](/wiki/common_crawl) | 2008 (ongoing) | A nonprofit project that crawls the web and provides free archives of web page data. Operating since 2008, it has amassed more than 9.5 petabytes of data and on the order of hundreds of billions of web pages, with monthly crawls typically adding 3 to 5 billion more pages. Common Crawl data has been used to train many [large language models](/wiki/large_language_model); a 2024 Mozilla Foundation report found that of 47 generative LLMs released between 2019 and 2023, around two-thirds were trained on Common Crawl data. [12] |

## How is a dataset created and curated?

Building a high-quality dataset involves multiple stages, each requiring careful planning and execution.

### Data Collection

Data can be gathered from a variety of sources:

- **Manual collection:** Researchers design experiments, surveys, or observation protocols to gather data from scratch.
- **Web scraping:** Automated tools extract data from websites, APIs, or public databases. Projects like Common Crawl provide large-scale web data freely.
- **Existing databases:** Public repositories, government records, scientific archives, and enterprise databases serve as data sources.
- **Sensor data:** IoT devices, cameras, microphones, and other instruments capture real-world measurements.

### Data Annotation and Labeling

For [supervised learning](/wiki/supervised_learning) tasks, raw data must be annotated with ground-truth [labels](/wiki/label). Annotation methods include:

- **In-house labeling:** Domain experts annotate data directly, providing high accuracy but at greater cost and lower throughput.
- **Crowdsourcing:** Platforms like Amazon Mechanical Turk distribute labeling tasks to large numbers of workers. Quality control measures such as redundant annotations, gold-standard questions, and inter-annotator agreement scores help maintain accuracy.
- **Automated labeling:** [Machine learning](/wiki/machine_learning) models pre-label data, which human reviewers then verify and correct (sometimes called "human-in-the-loop" annotation).
- **Programmatic labeling:** Frameworks like Snorkel use labeling functions to generate weak labels at scale, which are then combined and denoised.

### Data Cleaning and Preprocessing

Raw datasets often contain noise, missing values, duplicates, and inconsistencies. [Preprocessing](/wiki/preprocessing) steps typically include:

- Removing or imputing missing values
- Eliminating duplicate records
- Correcting mislabeled samples
- Normalizing or standardizing numerical [features](/wiki/feature)
- Encoding categorical variables (e.g., [one-hot encoding](/wiki/one-hot_encoding))
- Tokenizing and cleaning text data
- Resizing and normalizing images

## How is a dataset split into train, validation, and test sets?

Before training a model, a dataset is typically divided into three subsets, each serving a distinct purpose.

| Subset | Purpose | Typical Share |
|--------|---------|---------------|
| [Training set](/wiki/training_set) | Used to fit the model's parameters during training | 60% to 80% |
| [Validation set](/wiki/validation_set) | Used to tune [hyperparameters](/wiki/hyperparameter) and monitor for [overfitting](/wiki/overfitting) during training | 10% to 20% |
| [Test set](/wiki/test_set) | Used to evaluate the final model's performance on unseen data | 10% to 20% |

Common split ratios include 60/20/20, 70/15/15, and 80/10/10. For very large datasets (millions of samples), even smaller validation and test proportions (such as 98/1/1) can be sufficient because each subset still contains tens of thousands of examples.

### Splitting Methods

- **Random splitting:** The dataset is shuffled, and samples are assigned to each subset at random. This is the simplest approach but may not preserve class distributions.
- **Stratified splitting:** Maintains the original proportion of classes across all subsets. This is particularly important for [class-imbalanced datasets](/wiki/class-imbalanced_dataset).
- **[Cross-validation](/wiki/cross-validation):** The dataset is divided into K folds, and the model is trained K times, each time using a different fold as the validation set and the remaining folds for training. K-fold cross-validation provides more robust performance estimates, especially for smaller datasets.
- **Time-based splitting:** For [time series](/wiki/time_series_analysis) data, the split respects temporal ordering so that the model never trains on future data.

## What is dataset bias, and how is it addressed?

Dataset bias occurs when the data used to train a model does not accurately represent the population or phenomenon the model is intended to serve. Biased datasets can lead to models that systematically disadvantage certain groups or produce inaccurate results for underrepresented populations.

### Common Types of Dataset Bias

| Bias Type | Description |
|-----------|-------------|
| Selection bias | The data collection process systematically excludes certain groups or scenarios |
| Measurement bias | The way data is recorded introduces systematic errors (e.g., different sensor calibrations across demographic groups) |
| Historical bias | The data reflects existing societal inequalities, and models trained on it perpetuate those inequalities |
| Representation bias | Certain groups are underrepresented in the dataset relative to their real-world prevalence |
| Label bias | Annotators apply labels inconsistently, or cultural biases influence the labeling process |

### Real-World Examples

Several high-profile cases have demonstrated the consequences of dataset bias:

- In 2018, Amazon discontinued an AI recruiting tool after discovering it penalized resumes from women. The model had been trained on a decade of hiring data from a predominantly male applicant pool.
- Facial recognition systems have been shown to have significantly higher error rates for darker-skinned individuals, largely because training datasets contained a disproportionate number of lighter-skinned faces.
- Medical AI models trained primarily on data from one demographic group may produce unreliable predictions when applied to patients from other groups.

### Mitigation Strategies

Addressing dataset bias requires intervention at multiple stages:

- **Pre-processing:** Reweighting or resampling data to balance representation across groups.
- **In-processing:** Embedding fairness constraints directly into the model's training objective.
- **Post-processing:** Adjusting model predictions after training to equalize performance across groups.
- **Auditing:** Regularly evaluating models for disparate impact across demographic groups using fairness metrics.
- **Diverse data collection:** Intentionally seeking data from underrepresented populations and contexts.

## What are synthetic datasets?

Synthetic datasets are artificially generated by algorithms to mimic the statistical properties of real-world data without containing actual real-world information. They have become increasingly important in modern machine learning workflows.

### Generation Methods

| Method | Description |
|--------|-------------|
| [Generative Adversarial Networks](/wiki/generative_adversarial_network_gan) (GANs) | Two neural networks (a generator and a discriminator) compete to produce realistic synthetic samples |
| [Variational Autoencoders](/wiki/variational_autoencoder) (VAEs) | Encoder-decoder architectures that learn a latent representation and sample new data points from it |
| [Large Language Models](/wiki/large_language_model) (LLMs) | Pretrained language models generate synthetic text data for NLP tasks |
| Statistical simulation | Traditional statistical models replicate the distributions and correlations found in real data |
| Rule-based generation | Domain-specific rules and templates produce structured synthetic records |
| [Data augmentation](/wiki/data_augmentation) | Existing samples are transformed (rotated, cropped, paraphrased) to create new training examples |

### Advantages

- **Privacy protection:** Synthetic data contains no real personal information, reducing privacy and compliance risks.
- **Scalability:** Large volumes of data can be generated quickly and at low cost.
- **Addressing data scarcity:** Synthetic data can fill gaps when real-world data is scarce, expensive, or difficult to collect.
- **Balancing datasets:** Synthetic samples can be generated for underrepresented classes to address class imbalance.

### Limitations

- **Bias propagation:** If synthetic data is generated from biased real data, the bias carries over into the synthetic version.
- **Quality gaps:** Synthetic data may not capture the full complexity and noise of real-world data, potentially leading to models that underperform in production.
- **Validation challenges:** Evaluating whether synthetic data faithfully represents the target distribution requires careful statistical testing.

Some estimates suggest that over 60% of data used for AI applications by 2024 was synthetic, and this proportion is expected to continue growing.

## What is a benchmark dataset?

Benchmark datasets are standardized, high-quality collections designed to evaluate and compare machine learning models in a fair and reproducible manner. They function as shared reference points: when every researcher measures their model against the same test data, it becomes straightforward to compare results and track progress.

### Role and Importance

Benchmark datasets serve several purposes:

- **Standardized comparison:** They provide a common evaluation ground for different model architectures and training approaches.
- **Progress tracking:** Leaderboards built around benchmark datasets track improvements in model performance over time.
- **Reproducibility:** Shared datasets enable other researchers to reproduce and verify published results.
- **Task definition:** Benchmarks help define and standardize machine learning tasks, establishing clear metrics and evaluation protocols.

### Limitations of Benchmarks

Benchmarks also have notable limitations:

- **Overfitting to the benchmark:** Researchers may optimize specifically for benchmark performance rather than general capability, a phenomenon sometimes called "teaching to the test."
- **Dataset artifacts:** Models can exploit annotation patterns or shortcuts in benchmark datasets rather than learning genuine understanding.
- **Mislabeled data:** Even established benchmark datasets have been found to contain mislabeled samples that can distort evaluation results.
- **Real-world gap:** Strong benchmark performance does not always translate to reliable real-world deployment.

## Does dataset size or quality matter more?

The relationship between dataset size and model performance is a central concern in machine learning. While more data generally leads to better models, the quality of that data matters at least as much as its quantity.

### Key Principles

- **Quality over quantity:** Models trained on high-quality data can generalize effectively even with fewer examples. Conversely, models trained on noisy or inaccurate data may fail to make reliable predictions even with vast amounts of data.
- **Diminishing returns from noisy data:** As noise levels increase, the relationship between sample size and model accuracy weakens. Beyond a certain point, adding more low-quality data provides minimal improvement.
- **Minimum viable dataset size:** Very small datasets risk [underfitting](/wiki/underfitting), where the model cannot capture the underlying patterns. The minimum required size depends on the complexity of the task, the number of [features](/wiki/feature), and the model architecture.
- **Data-centric AI:** An emerging paradigm that prioritizes improving data quality over changing model architectures. Advocates argue that curating better data often yields greater performance gains per unit of effort than designing more complex models.

### Scaling Laws

Research on neural [scaling laws](/wiki/scaling_laws) has shown that model performance improves predictably as training data, model size, and compute increase together. However, these gains depend on data quality. The [Chinchilla](/wiki/chinchilla) scaling laws, published by DeepMind in 2022, emphasized that many large language models were undertrained relative to their size and that increasing the amount of high-quality training data could be more efficient than simply scaling model parameters. To demonstrate the point, DeepMind trained a 70 billion parameter model called Chinchilla on 1.4 trillion tokens; despite being four times smaller than the 280 billion parameter Gopher, Chinchilla outperformed it (along with GPT-3, Jurassic-1, and Megatron-Turing NLG) using the same compute budget. [13]

## How are datasets documented?

Proper documentation of datasets has become a recognized best practice in the machine learning community. Two prominent frameworks have emerged for this purpose.

### Datasheets for Datasets

Proposed by Timnit Gebru and colleagues in 2018, the "Datasheets for Datasets" framework draws an analogy to datasheets in the electronics industry, where every component is accompanied by documentation describing its characteristics and recommended uses. The authors motivated the work by observing that "the machine learning community currently has no standardized process for documenting datasets, which can lead to severe consequences in high-stakes domains." [1] A dataset datasheet addresses:

- **Motivation:** Why the dataset was created and who funded it
- **Composition:** What the dataset contains, including data types, label distributions, and any known errors
- **Collection process:** How the data was gathered, including any filtering or preprocessing applied
- **Preprocessing and cleaning:** What transformations were applied to the raw data
- **Uses:** Recommended tasks and known limitations
- **Distribution:** How the dataset is shared and under what license
- **Maintenance:** Who maintains the dataset and how errors can be reported

### Data Cards

Data Cards, developed at Google, provide a structured summary of essential information about a dataset across its life cycle. Beyond basic metadata, Data Cards include explanations, rationales, and instructions related to a dataset's provenance, representation, intended uses, and fairness evaluations. [8] Google published Data Cards alongside the Open Images dataset, and the approach has been adopted by organizations including [Hugging Face](/wiki/hugging_face).

Companies including Microsoft, Google, and IBM have begun piloting datasheets for datasets within their product teams, and academic researchers increasingly publish datasets with accompanying documentation.

## Data Versioning

As datasets evolve over time (with corrections, additions, and schema changes), tracking those changes becomes essential for reproducibility and debugging.

### DVC (Data Version Control)

DVC is an open-source version control system designed for data science and machine learning projects. It extends [Git](/wiki/git) workflows to handle large files, datasets, and models without storing them directly in the Git repository.

Key features of DVC include:

- **Data and model versioning:** Large files are replaced with small metadata files tracked by Git, while the actual data is stored in configurable remote storage (Amazon S3, Google Cloud Storage, Azure Blob Storage, or local storage).
- **Data pipelines:** DVC defines computational graphs that connect code, data, and configuration, specifying all steps required to produce a model.
- **Experiment tracking:** Developers can explore, iterate, and compare different machine learning experiments, with each experiment defined by changes in the workspace.

Other data versioning tools include Git LFS, Dolt, lakeFS, and [MLflow](/wiki/mlflow). In November 2025, lakeFS announced its acquisition of DVC.

## How are datasets licensed?

The license attached to a dataset determines how it can be used, shared, and modified. Understanding dataset licenses is critical for both legal compliance and ethical practice.

### Common License Types

| License | Permissions | Key Restrictions |
|---------|-------------|------------------|
| CC0 (Public Domain) | Unrestricted use, modification, and distribution | None |
| CC BY (Attribution) | Use, modify, and distribute with credit | Must attribute the original creator |
| CC BY-SA (Attribution-ShareAlike) | Use, modify, and distribute with credit | Derivative works must use the same license |
| CC BY-NC (Attribution-NonCommercial) | Use and modify with credit | No commercial use |
| CC BY-NC-SA | Use and modify with credit | No commercial use; derivatives must use the same license |
| ODC-PDDL (Public Domain) | Unrestricted use | None |
| ODC-By (Attribution) | Use and modify with credit | Must attribute the source |
| ODbL (Open Database License) | Use, modify, and distribute | Must attribute, share alike, and keep open |
| CDLA-Permissive | Use, modify, and redistribute | Minimal restrictions, similar to MIT license for software |

For open data, CC0 and ODC-PDDL are the most permissive options. Researchers should note that software licenses (such as MIT or Apache 2.0) are not always appropriate for datasets, and purpose-built data licenses like those from Creative Commons or Open Data Commons are generally preferred.

## The Hugging Face Datasets Hub

[Hugging Face](/wiki/hugging_face) operates the largest public hub for machine learning datasets, hosting over 600,000 datasets as of 2024. The platform supports datasets in more than 8,000 languages, spanning tasks in [natural language processing](/wiki/natural_language_processing), [computer vision](/wiki/computer_vision), and audio processing. [10]

Key features of the Hugging Face Datasets Hub include:

- **Dataset Viewer:** An interactive browser-based tool for exploring dataset contents without downloading.
- **Dataset Cards:** Structured documentation accompanying each dataset, describing its composition, intended uses, and limitations.
- **Streaming support:** The `datasets` library allows users to process datasets that are too large to fit in memory by streaming data on the fly.
- **Efficient storage:** The platform uses Xet, a deduplication-based storage system that accelerates uploads and downloads by transferring duplicate data only once.
- **Privacy options:** Both public and private datasets are supported, enabling organizations to comply with licensing and privacy requirements.
- **One-line loading:** Any hosted dataset can be loaded into a Python environment with a single line of code using the `datasets` library.

The Hugging Face Datasets Hub has become a central resource for the machine learning community, providing easy access to datasets for research, education, and application development.

## Data Preprocessing

Before using a dataset to train a [machine learning model](/wiki/machine_learning_model), it must be preprocessed. Common preprocessing tasks include cleaning the data, normalizing numerical features, encoding categorical variables, and performing [feature engineering](/wiki/feature_engineering). Preprocessing is a critical step in the machine learning pipeline because the quality of input data directly affects model accuracy and generalization.

Additional preprocessing techniques include:

- **[Data augmentation](/wiki/data_augmentation):** Creating modified copies of existing samples (through rotations, translations, noise injection, or paraphrasing) to artificially expand the training set and improve model robustness.
- **[Dimensionality reduction](/wiki/dimensionality_reduction):** Reducing the number of features while preserving the most important information, using techniques like PCA or t-SNE.
- **[Feature selection](/wiki/feature_selection):** Identifying and retaining only the most relevant features for the task at hand.
- **Handling class imbalance:** Applying [oversampling](/wiki/oversampling) (e.g., SMOTE) or undersampling techniques to balance the distribution of classes in the dataset.

## Explain Like I'm 5 (ELI5)

Imagine you have a big box of flash cards. Each flash card has a picture on one side and the name of what is in the picture on the other side. If you show a computer thousands of these flash cards, it starts to learn what a cat looks like versus what a dog looks like. That box of flash cards is a dataset.

The more flash cards you have, and the more different kinds of pictures they show, the better the computer gets at recognizing things it has never seen before. But if all your flash cards only show golden retrievers, the computer might think every dog is a golden retriever. That is why it is important to have lots of different examples in your dataset.

Before you start teaching the computer, you split the flash cards into three piles: a big pile for learning (the training set), a medium pile for checking its progress (the validation set), and a small pile you hide away for a final test (the test set). That way, you can be sure the computer actually learned and is not just memorizing the answers.

## References

1. Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daume III, H., & Crawford, K. (2021). "Datasheets for Datasets." *Communications of the ACM*, 64(12), 86-92. https://arxiv.org/abs/1803.09010

2. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). "ImageNet: A Large-Scale Hierarchical Image Database." *IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*.

3. LeCun, Y., Cortes, C., & Burges, C. J. C. (1998). "The MNIST Database of Handwritten Digits." http://yann.lecun.com/exdb/mnist/

4. Krizhevsky, A. (2009). "Learning Multiple Layers of Features from Tiny Images." Technical Report, University of Toronto. (CIFAR-10/100)

5. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., & Zitnick, C. L. (2014). "Microsoft COCO: Common Objects in Context." *European Conference on Computer Vision (ECCV)*.

6. Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). "SQuAD: 100,000+ Questions for Machine Comprehension of Text." *Proceedings of EMNLP*.

7. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R. (2018). "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding." *Proceedings of EMNLP*.

8. Pushkarna, M., Zaldivar, A., & Kjartansson, O. (2022). "Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI." *ACM Conference on Fairness, Accountability, and Transparency (FAccT)*.

9. "What is a Dataset?" IBM. https://www.ibm.com/think/topics/dataset

10. Hugging Face Datasets Hub Documentation. https://huggingface.co/docs/datasets/index

11. "AlexNet." Wikipedia. https://en.wikipedia.org/wiki/AlexNet

12. Baack, S., & Mozilla Insights (2024). "Training Data for the Price of a Sandwich: Common Crawl's Impact on Generative AI." Mozilla Foundation. https://www.mozillafoundation.org/en/research/library/generative-ai-training-data/common-crawl/

13. Hoffmann, J., Borgeaud, S., Mensch, A., et al. (2022). "Training Compute-Optimal Large Language Models." DeepMind. https://arxiv.org/abs/2203.15556

