Data Set or Dataset

Data & Datasets Machine Learning

22 min read

Updated Jun 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 23, 2026

Fact-checked

In review queue

Sources

13 citations

Revision

v5 · 4,325 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

A dataset (also written as "data set") is a structured collection of data points used to train, validate, and evaluate machine learning models. In artificial intelligence and machine learning, each data point (also called a sample, instance, or example) is described by a set of features (input variables) and, in supervised learning, a corresponding label (the target output). Datasets are the foundation of every machine learning pipeline: a model can only learn the patterns present in its data, so the size, quality, diversity, and representativeness of a dataset directly determine the performance and reliability of the model built from it.

Before training, a dataset is normally divided into three disjoint subsets: a training set used to fit the model's parameters, a validation set used to tune hyperparameters and detect overfitting, and a test set held back to measure final performance on unseen data. Landmark benchmark datasets such as MNIST, ImageNet, and Common Crawl have repeatedly driven progress in the field by giving researchers a shared, reproducible basis for comparison.

Introduction

A dataset (also written as "data set") is a structured collection of data organized for analysis, processing, or machine learning tasks. In the context of artificial intelligence and machine learning, a dataset refers specifically to the body of information used to train, validate, and evaluate models. Datasets typically consist of individual data points (also called samples, instances, or examples), each described by a set of features (input variables) and, in supervised learning, corresponding labels (output variables or target values).

Datasets are the foundation of every machine learning pipeline. The quality, size, diversity, and representativeness of a dataset directly influence the performance and reliability of the models built from it. As the field has matured, the creation, curation, documentation, and governance of datasets have become research areas in their own right.

The term "dataset" is now standard in most technical writing, though "data set" (two words) still appears in some formal and statistical contexts. Both forms are considered acceptable.

What types of datasets are there?

Datasets can be categorized by their structure, modality, and labeling status. The following table summarizes the most common types encountered in machine learning.

By Data Modality

Type	Description	Common Formats	Example Tasks
Tabular	Data organized in rows and columns, where each row is a sample and each column is a feature	CSV, Parquet, SQL databases	Classification, regression, fraud detection
Image	Collections of photographs or generated images, typically stored as pixel arrays	JPEG, PNG, TIFF	Image recognition, object detection, image segmentation
Text	Documents, sentences, or token sequences used for language tasks	Plain text, JSON, XML	Sentiment analysis, machine translation, question answering
Audio	Sound recordings or waveforms used for speech and acoustic tasks	WAV, MP3, FLAC	Speech recognition, speaker identification, music generation
Video	Sequential frames of visual data, sometimes with accompanying audio	MP4, AVI, WebM	Action recognition, video captioning, autonomous driving
Graph	Data representing nodes and edges, capturing relationships between entities	Adjacency matrices, edge lists, GraphML	Social network analysis, molecular property prediction, knowledge graphs

By Labeling Status

Type	Description	Typical Use
Labeled	Each sample is annotated with a ground-truth label or target value	Supervised learning
Unlabeled	Samples have no associated target values	Unsupervised learning, pretraining
Semi-labeled	A small portion of samples are labeled, while the majority are not	Semi-supervised learning

By Structure

Type	Description	Examples
Structured	Data follows a predefined schema with clearly defined rows and columns	Spreadsheets, relational databases, CSV files
Unstructured	Data lacks a rigid format and requires preprocessing before use	Raw text documents, images, audio recordings
Semi-structured	Data has some organizational properties but does not conform to a strict tabular schema	JSON documents, XML files, HTML pages

What are the most famous machine learning datasets?

Several datasets have played pivotal roles in advancing the field of machine learning and deep learning. These benchmark datasets serve as standard reference points for comparing model architectures and training techniques.

Computer Vision Datasets

Dataset	Year	Size	Description
MNIST	1998	70,000 grayscale images (28x28 pixels)	Handwritten digit recognition dataset created by Yann LeCun, Corinna Cortes, and Chris Burges. Contains 60,000 training and 10,000 test images of digits 0 through 9. Often considered the "hello world" of machine learning. ^[3]
CIFAR-10	2009	60,000 color images (32x32 pixels)	Created by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton at the Canadian Institute for Advanced Research. Contains 10 classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck) with 6,000 images per class. ^[4]
CIFAR-100	2009	60,000 color images (32x32 pixels)	An extension of CIFAR-10 with 100 fine-grained classes grouped into 20 superclasses, containing 600 images per class. ^[4]
ImageNet	2009	14+ million images	A large-scale image database organized according to the WordNet hierarchy, containing more than 14 million hand-annotated images across 20,000+ categories. The associated ImageNet Large Scale Visual Recognition Challenge (ILSVRC) catalyzed the deep learning revolution when AlexNet won the 2012 competition with a top-5 error rate of 15.3%, more than 10 percentage points lower than the runner-up's 26.2%. ^[2]^[11]
COCO (Common Objects in Context)	2014	330,000+ images	A large-scale dataset for object detection, segmentation, and captioning. Each image is annotated with 80 object categories and 5 descriptive captions, making it one of the most widely used benchmarks in computer vision. ^[5]

The ImageNet result was a watershed moment. AlexNet itself contained roughly 60 million parameters and was trained on GPUs, an approach that became a defining ingredient of the deep learning era. ^[11]

Natural Language Processing Datasets

Dataset	Year	Description
SQuAD (Stanford Question Answering Dataset)	2016	A reading comprehension dataset consisting of questions posed on Wikipedia articles, where the answer to each question is a span of text from the corresponding passage. SQuAD 2.0 added unanswerable questions to increase difficulty. ^[6]
GLUE (General Language Understanding Evaluation)	2018	A benchmark suite of nine natural language understanding tasks, including sentiment analysis, textual entailment, and paraphrase detection. GLUE became the standard evaluation framework for pretrained language models like BERT and RoBERTa. SuperGLUE was later introduced as a more challenging successor. ^[7]
Common Crawl	2008 (ongoing)	A nonprofit project that crawls the web and provides free archives of web page data. Operating since 2008, it has amassed more than 9.5 petabytes of data and on the order of hundreds of billions of web pages, with monthly crawls typically adding 3 to 5 billion more pages. Common Crawl data has been used to train many large language models; a 2024 Mozilla Foundation report found that of 47 generative LLMs released between 2019 and 2023, around two-thirds were trained on Common Crawl data. ^[12]

How is a dataset created and curated?

Building a high-quality dataset involves multiple stages, each requiring careful planning and execution.

Data Collection

Data can be gathered from a variety of sources:

Manual collection: Researchers design experiments, surveys, or observation protocols to gather data from scratch.
Web scraping: Automated tools extract data from websites, APIs, or public databases. Projects like Common Crawl provide large-scale web data freely.
Existing databases: Public repositories, government records, scientific archives, and enterprise databases serve as data sources.
Sensor data: IoT devices, cameras, microphones, and other instruments capture real-world measurements.

Data Annotation and Labeling

For supervised learning tasks, raw data must be annotated with ground-truth labels. Annotation methods include:

In-house labeling: Domain experts annotate data directly, providing high accuracy but at greater cost and lower throughput.
Crowdsourcing: Platforms like Amazon Mechanical Turk distribute labeling tasks to large numbers of workers. Quality control measures such as redundant annotations, gold-standard questions, and inter-annotator agreement scores help maintain accuracy.
Automated labeling: Machine learning models pre-label data, which human reviewers then verify and correct (sometimes called "human-in-the-loop" annotation).
Programmatic labeling: Frameworks like Snorkel use labeling functions to generate weak labels at scale, which are then combined and denoised.

Data Cleaning and Preprocessing

Raw datasets often contain noise, missing values, duplicates, and inconsistencies. Preprocessing steps typically include:

Removing or imputing missing values
Eliminating duplicate records
Correcting mislabeled samples
Normalizing or standardizing numerical features
Encoding categorical variables (e.g., one-hot encoding)
Tokenizing and cleaning text data
Resizing and normalizing images

How is a dataset split into train, validation, and test sets?

Before training a model, a dataset is typically divided into three subsets, each serving a distinct purpose.

Subset	Purpose	Typical Share
Training set	Used to fit the model's parameters during training	60% to 80%
Validation set	Used to tune hyperparameters and monitor for overfitting during training	10% to 20%
Test set	Used to evaluate the final model's performance on unseen data	10% to 20%

Common split ratios include 60/20/20, 70/15/15, and 80/10/10. For very large datasets (millions of samples), even smaller validation and test proportions (such as 98/1/1) can be sufficient because each subset still contains tens of thousands of examples.

Splitting Methods

Random splitting: The dataset is shuffled, and samples are assigned to each subset at random. This is the simplest approach but may not preserve class distributions.
Stratified splitting: Maintains the original proportion of classes across all subsets. This is particularly important for class-imbalanced datasets.
Cross-validation: The dataset is divided into K folds, and the model is trained K times, each time using a different fold as the validation set and the remaining folds for training. K-fold cross-validation provides more robust performance estimates, especially for smaller datasets.
Time-based splitting: For time series data, the split respects temporal ordering so that the model never trains on future data.

What is dataset bias, and how is it addressed?

Dataset bias occurs when the data used to train a model does not accurately represent the population or phenomenon the model is intended to serve. Biased datasets can lead to models that systematically disadvantage certain groups or produce inaccurate results for underrepresented populations.

Common Types of Dataset Bias

Bias Type	Description
Selection bias	The data collection process systematically excludes certain groups or scenarios
Measurement bias	The way data is recorded introduces systematic errors (e.g., different sensor calibrations across demographic groups)
Historical bias	The data reflects existing societal inequalities, and models trained on it perpetuate those inequalities
Representation bias	Certain groups are underrepresented in the dataset relative to their real-world prevalence
Label bias	Annotators apply labels inconsistently, or cultural biases influence the labeling process

Real-World Examples

Several high-profile cases have demonstrated the consequences of dataset bias:

In 2018, Amazon discontinued an AI recruiting tool after discovering it penalized resumes from women. The model had been trained on a decade of hiring data from a predominantly male applicant pool.
Facial recognition systems have been shown to have significantly higher error rates for darker-skinned individuals, largely because training datasets contained a disproportionate number of lighter-skinned faces.
Medical AI models trained primarily on data from one demographic group may produce unreliable predictions when applied to patients from other groups.

Mitigation Strategies

Addressing dataset bias requires intervention at multiple stages:

Pre-processing: Reweighting or resampling data to balance representation across groups.
In-processing: Embedding fairness constraints directly into the model's training objective.
Post-processing: Adjusting model predictions after training to equalize performance across groups.
Auditing: Regularly evaluating models for disparate impact across demographic groups using fairness metrics.
Diverse data collection: Intentionally seeking data from underrepresented populations and contexts.

What are synthetic datasets?

Synthetic datasets are artificially generated by algorithms to mimic the statistical properties of real-world data without containing actual real-world information. They have become increasingly important in modern machine learning workflows.

Generation Methods

Method	Description
Generative Adversarial Networks (GANs)	Two neural networks (a generator and a discriminator) compete to produce realistic synthetic samples
Variational Autoencoders (VAEs)	Encoder-decoder architectures that learn a latent representation and sample new data points from it
Large Language Models (LLMs)	Pretrained language models generate synthetic text data for NLP tasks
Statistical simulation	Traditional statistical models replicate the distributions and correlations found in real data
Rule-based generation	Domain-specific rules and templates produce structured synthetic records
Data augmentation	Existing samples are transformed (rotated, cropped, paraphrased) to create new training examples

Advantages

Privacy protection: Synthetic data contains no real personal information, reducing privacy and compliance risks.
Scalability: Large volumes of data can be generated quickly and at low cost.
Addressing data scarcity: Synthetic data can fill gaps when real-world data is scarce, expensive, or difficult to collect.
Balancing datasets: Synthetic samples can be generated for underrepresented classes to address class imbalance.

Limitations

Bias propagation: If synthetic data is generated from biased real data, the bias carries over into the synthetic version.
Quality gaps: Synthetic data may not capture the full complexity and noise of real-world data, potentially leading to models that underperform in production.
Validation challenges: Evaluating whether synthetic data faithfully represents the target distribution requires careful statistical testing.

Some estimates suggest that over 60% of data used for AI applications by 2024 was synthetic, and this proportion is expected to continue growing.

What is a benchmark dataset?

Benchmark datasets are standardized, high-quality collections designed to evaluate and compare machine learning models in a fair and reproducible manner. They function as shared reference points: when every researcher measures their model against the same test data, it becomes straightforward to compare results and track progress.

Role and Importance

Benchmark datasets serve several purposes:

Standardized comparison: They provide a common evaluation ground for different model architectures and training approaches.
Progress tracking: Leaderboards built around benchmark datasets track improvements in model performance over time.
Reproducibility: Shared datasets enable other researchers to reproduce and verify published results.
Task definition: Benchmarks help define and standardize machine learning tasks, establishing clear metrics and evaluation protocols.

Limitations of Benchmarks

Benchmarks also have notable limitations:

Overfitting to the benchmark: Researchers may optimize specifically for benchmark performance rather than general capability, a phenomenon sometimes called "teaching to the test."
Dataset artifacts: Models can exploit annotation patterns or shortcuts in benchmark datasets rather than learning genuine understanding.
Mislabeled data: Even established benchmark datasets have been found to contain mislabeled samples that can distort evaluation results.
Real-world gap: Strong benchmark performance does not always translate to reliable real-world deployment.

Does dataset size or quality matter more?

The relationship between dataset size and model performance is a central concern in machine learning. While more data generally leads to better models, the quality of that data matters at least as much as its quantity.

Key Principles

Quality over quantity: Models trained on high-quality data can generalize effectively even with fewer examples. Conversely, models trained on noisy or inaccurate data may fail to make reliable predictions even with vast amounts of data.
Diminishing returns from noisy data: As noise levels increase, the relationship between sample size and model accuracy weakens. Beyond a certain point, adding more low-quality data provides minimal improvement.
Minimum viable dataset size: Very small datasets risk underfitting, where the model cannot capture the underlying patterns. The minimum required size depends on the complexity of the task, the number of features, and the model architecture.
Data-centric AI: An emerging paradigm that prioritizes improving data quality over changing model architectures. Advocates argue that curating better data often yields greater performance gains per unit of effort than designing more complex models.

Scaling Laws

Research on neural scaling laws has shown that model performance improves predictably as training data, model size, and compute increase together. However, these gains depend on data quality. The Chinchilla scaling laws, published by DeepMind in 2022, emphasized that many large language models were undertrained relative to their size and that increasing the amount of high-quality training data could be more efficient than simply scaling model parameters. To demonstrate the point, DeepMind trained a 70 billion parameter model called Chinchilla on 1.4 trillion tokens; despite being four times smaller than the 280 billion parameter Gopher, Chinchilla outperformed it (along with GPT-3, Jurassic-1, and Megatron-Turing NLG) using the same compute budget. ^[13]

How are datasets documented?

Proper documentation of datasets has become a recognized best practice in the machine learning community. Two prominent frameworks have emerged for this purpose.

Datasheets for Datasets

Proposed by Timnit Gebru and colleagues in 2018, the "Datasheets for Datasets" framework draws an analogy to datasheets in the electronics industry, where every component is accompanied by documentation describing its characteristics and recommended uses. The authors motivated the work by observing that "the machine learning community currently has no standardized process for documenting datasets, which can lead to severe consequences in high-stakes domains." ^[1] A dataset datasheet addresses:

Motivation: Why the dataset was created and who funded it
Composition: What the dataset contains, including data types, label distributions, and any known errors
Collection process: How the data was gathered, including any filtering or preprocessing applied
Preprocessing and cleaning: What transformations were applied to the raw data
Uses: Recommended tasks and known limitations
Distribution: How the dataset is shared and under what license
Maintenance: Who maintains the dataset and how errors can be reported

Data Cards

Data Cards, developed at Google, provide a structured summary of essential information about a dataset across its life cycle. Beyond basic metadata, Data Cards include explanations, rationales, and instructions related to a dataset's provenance, representation, intended uses, and fairness evaluations. ^[8] Google published Data Cards alongside the Open Images dataset, and the approach has been adopted by organizations including Hugging Face.

Companies including Microsoft, Google, and IBM have begun piloting datasheets for datasets within their product teams, and academic researchers increasingly publish datasets with accompanying documentation.

Data Versioning

As datasets evolve over time (with corrections, additions, and schema changes), tracking those changes becomes essential for reproducibility and debugging.

DVC (Data Version Control)

DVC is an open-source version control system designed for data science and machine learning projects. It extends Git workflows to handle large files, datasets, and models without storing them directly in the Git repository.

Key features of DVC include:

Data and model versioning: Large files are replaced with small metadata files tracked by Git, while the actual data is stored in configurable remote storage (Amazon S3, Google Cloud Storage, Azure Blob Storage, or local storage).
Data pipelines: DVC defines computational graphs that connect code, data, and configuration, specifying all steps required to produce a model.
Experiment tracking: Developers can explore, iterate, and compare different machine learning experiments, with each experiment defined by changes in the workspace.

Other data versioning tools include Git LFS, Dolt, lakeFS, and MLflow. In November 2025, lakeFS announced its acquisition of DVC.

How are datasets licensed?

The license attached to a dataset determines how it can be used, shared, and modified. Understanding dataset licenses is critical for both legal compliance and ethical practice.

Common License Types

License	Permissions	Key Restrictions
CC0 (Public Domain)	Unrestricted use, modification, and distribution	None
CC BY (Attribution)	Use, modify, and distribute with credit	Must attribute the original creator
CC BY-SA (Attribution-ShareAlike)	Use, modify, and distribute with credit	Derivative works must use the same license
CC BY-NC (Attribution-NonCommercial)	Use and modify with credit	No commercial use
CC BY-NC-SA	Use and modify with credit	No commercial use; derivatives must use the same license
ODC-PDDL (Public Domain)	Unrestricted use	None
ODC-By (Attribution)	Use and modify with credit	Must attribute the source
ODbL (Open Database License)	Use, modify, and distribute	Must attribute, share alike, and keep open
CDLA-Permissive	Use, modify, and redistribute	Minimal restrictions, similar to MIT license for software

For open data, CC0 and ODC-PDDL are the most permissive options. Researchers should note that software licenses (such as MIT or Apache 2.0) are not always appropriate for datasets, and purpose-built data licenses like those from Creative Commons or Open Data Commons are generally preferred.

The Hugging Face Datasets Hub

Hugging Face operates the largest public hub for machine learning datasets, hosting over 600,000 datasets as of 2024. The platform supports datasets in more than 8,000 languages, spanning tasks in natural language processing, computer vision, and audio processing. ^[10]

Key features of the Hugging Face Datasets Hub include:

Dataset Viewer: An interactive browser-based tool for exploring dataset contents without downloading.
Dataset Cards: Structured documentation accompanying each dataset, describing its composition, intended uses, and limitations.
Streaming support: The datasets library allows users to process datasets that are too large to fit in memory by streaming data on the fly.
Efficient storage: The platform uses Xet, a deduplication-based storage system that accelerates uploads and downloads by transferring duplicate data only once.
Privacy options: Both public and private datasets are supported, enabling organizations to comply with licensing and privacy requirements.
One-line loading: Any hosted dataset can be loaded into a Python environment with a single line of code using the datasets library.

The Hugging Face Datasets Hub has become a central resource for the machine learning community, providing easy access to datasets for research, education, and application development.

Data Preprocessing

Before using a dataset to train a machine learning model, it must be preprocessed. Common preprocessing tasks include cleaning the data, normalizing numerical features, encoding categorical variables, and performing feature engineering. Preprocessing is a critical step in the machine learning pipeline because the quality of input data directly affects model accuracy and generalization.

Additional preprocessing techniques include:

Data augmentation: Creating modified copies of existing samples (through rotations, translations, noise injection, or paraphrasing) to artificially expand the training set and improve model robustness.
Dimensionality reduction: Reducing the number of features while preserving the most important information, using techniques like PCA or t-SNE.
Feature selection: Identifying and retaining only the most relevant features for the task at hand.
Handling class imbalance: Applying oversampling (e.g., SMOTE) or undersampling techniques to balance the distribution of classes in the dataset.

Explain Like I'm 5 (ELI5)

Imagine you have a big box of flash cards. Each flash card has a picture on one side and the name of what is in the picture on the other side. If you show a computer thousands of these flash cards, it starts to learn what a cat looks like versus what a dog looks like. That box of flash cards is a dataset.

The more flash cards you have, and the more different kinds of pictures they show, the better the computer gets at recognizing things it has never seen before. But if all your flash cards only show golden retrievers, the computer might think every dog is a golden retriever. That is why it is important to have lots of different examples in your dataset.

Before you start teaching the computer, you split the flash cards into three piles: a big pile for learning (the training set), a medium pile for checking its progress (the validation set), and a small pile you hide away for a final test (the test set). That way, you can be sure the computer actually learned and is not just memorizing the answers.

References

Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J. W., Wallach, H., Daume III, H., & Crawford, K. (2021). "Datasheets for Datasets." *Communications of the ACM*, 64(12), 86-92. https://arxiv.org/abs/1803.09010 ↩
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). "ImageNet: A Large-Scale Hierarchical Image Database." *IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*. ↩
LeCun, Y., Cortes, C., & Burges, C. J. C. (1998). "The MNIST Database of Handwritten Digits." http://yann.lecun.com/exdb/mnist/ ↩
Krizhevsky, A. (2009). "Learning Multiple Layers of Features from Tiny Images." Technical Report, University of Toronto. (CIFAR-10/100) ↩
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., & Zitnick, C. L. (2014). "Microsoft COCO: Common Objects in Context." *European Conference on Computer Vision (ECCV)*. ↩
Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). "SQuAD: 100,000+ Questions for Machine Comprehension of Text." *Proceedings of EMNLP*. ↩
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R. (2018). "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding." *Proceedings of EMNLP*. ↩
Pushkarna, M., Zaldivar, A., & Kjartansson, O. (2022). "Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI." *ACM Conference on Fairness, Accountability, and Transparency (FAccT)*. ↩
"What is a Dataset?" IBM. https://www.ibm.com/think/topics/dataset
Hugging Face Datasets Hub Documentation. https://huggingface.co/docs/datasets/index ↩
"AlexNet." Wikipedia. https://en.wikipedia.org/wiki/AlexNet ↩
Baack, S., & Mozilla Insights (2024). "Training Data for the Price of a Sandwich: Common Crawl's Impact on Generative AI." Mozilla Foundation. https://www.mozillafoundation.org/en/research/library/generative-ai-training-data/common-crawl/ ↩
Hoffmann, J., Borgeaud, S., Mensch, A., et al. (2022). "Training Compute-Optimal Large Language Models." DeepMind. https://arxiv.org/abs/2203.15556 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

What links here

Convenience Sampling Coverage Bias Data Analysis DataFrame Disparate Treatment Experimenter's Bias Group Attribution Bias Machine learning terms/All Netflix Prize Non-Response Bias Out-Group Homogeneity Bias Pre-Trained Model Terms

Introduction

What types of datasets are there?

By Data Modality

By Labeling Status

By Structure

What are the most famous machine learning datasets?

Computer Vision Datasets

Natural Language Processing Datasets

How is a dataset created and curated?

Data Collection

Data Annotation and Labeling

Data Cleaning and Preprocessing

How is a dataset split into train, validation, and test sets?

Splitting Methods

What is dataset bias, and how is it addressed?

Common Types of Dataset Bias

Real-World Examples

Mitigation Strategies

What are synthetic datasets?

Generation Methods

Advantages

Limitations

What is a benchmark dataset?

Role and Importance

Limitations of Benchmarks

Does dataset size or quality matter more?

Key Principles

Scaling Laws

How are datasets documented?

Datasheets for Datasets

Data Cards

Data Versioning

DVC (Data Version Control)

How are datasets licensed?

Common License Types

The Hugging Face Datasets Hub

Data Preprocessing

Explain Like I'm 5 (ELI5)

References

Improve this article

Related Articles

Dimension Reduction

Discrete Feature

Proxy labels

Bucketing

Categorical Data

Class-Imbalanced Dataset

What links here

Related Articles

Dimension Reduction

Discrete Feature

Proxy labels

Bucketing

Categorical Data

Class-Imbalanced Dataset

What links here