Jump to content

Datasets: Difference between revisions

195 bytes added ,  21 February 2023
no edit summary
No edit summary
No edit summary
Line 1: Line 1:
{{see also|Machine learning terms}}
==Introduction==
==Introduction==
[[Datasets]] in [[machine learning]] refer to a collection of information or raw [[data]] collected for [[training]], testing, and assessing a model. They typically consist of [[input]] data ([[features]]) and their corresponding [[output]] or [[label]] data. Datasets can vary in size, format, and complexity depending on the problem being addressed.
[[Datasets]] in [[machine learning]] refer to a collection of information or raw [[data]] collected for [[training]], testing, and assessing a model. They typically consist of [[input]] data ([[features]]) and their corresponding [[output]] or [[label]] data. Datasets can vary in size, format, and complexity depending on the problem being addressed.
Line 5: Line 6:


==Importance==
==Importance==
Datasets are essential elements in machine learning, as they serve to train, test and evaluate models. The quality and quantity of the data used can have a considerable effect on the accuracy and effectiveness of a model's predictions. Furthermore, diversity and representativeness within the dataset may restrict its ability to generalize to new or unseen scenarios.
Datasets are essential elements in machine learning, as they are used to [[train]], [[test]] and [[evaluate]] models. The quality and quantity of the data used can have a considerable effect on the [[accuracy]] and effectiveness of a model's predictions. Furthermore, diversity and representativeness within the dataset may restrict its ability to generalize to new or unseen scenarios.


==Types of Datasets==
==Types of Datasets==
Line 11: Line 12:


===Structured Datasets===
===Structured Datasets===
Structured datasets refer to data with an organized format and rows/columns. Examples of structured datasets include spreadsheets, SQL databases and CSV files. These types of data sets are commonly employed for classification or regression tasks.
Structured datasets refer to data with an organized format and rows/columns. Examples of structured datasets include spreadsheets, SQL databases and CSV files. These types of data sets are commonly employed for [[classification]] or [[regression]] tasks.


===Unstructured Datasets===
===Unstructured Datasets===
Unstructured datasets refer to those which lack a predefined format and consist of text, images or audio. Examples of unstructured datasets include social media posts, images and audio recordings. These types of records can be utilized for tasks such as natural language processing, computer vision and speech recognition.
Unstructured datasets refer to those which lack a predefined format and consist of text, images or audio. Examples of unstructured datasets include social media posts, images and audio recordings. These types of records can be utilized for tasks such as [[natural language processing]], [[computer vision]] and [[speech recognition]].


===Labeled Datasets===
===Labeled Datasets===
Labeled datasets are those which have already been annotated with the correct output or label data. These types of datasets are frequently employed in supervised learning tasks, which require you to predict an exact output based on input data.
[[Label]]ed datasets are those which have already been annotated with the correct output or [[label]] data. These types of datasets are frequently employed in [[supervised learning]] tasks, which require you to predict an exact [[output]] based on [[input]] data.


===Unlabeled Datasets===
===Unlabeled Datasets===
Unlabeled datasets refer to those without predetermined output or label data. They're frequently employed in unsupervised learning tasks, which aim to detect patterns or structure within the data.
Unlabeled datasets refer to those without predetermined output or label data. They're frequently employed in [[unsupervised learning]] tasks, which aim to detect patterns or structure within the data.


===Training Datasets===
===Training Datasets===
Training datasets are those used to train a machine learning model. Typically, these datasets form the majority of data used and have been carefully curated so that the model has access to an array of inputs.
[[Training]] datasets are those used to train a [[machine learning model]]. Typically, these datasets form the majority of data used and have been carefully curated so that the model has access to an array of inputs.


===Validation Datasets===
===Validation Datasets===
Validation datasets are those used to assess a model's performance during training. These datasets are usually separated from the training data, and they guarantee that the model does not overfit or underfit to its input data.
[[Validation]] datasets are those used to assess a model's performance during training. These datasets are usually separated from the training data, and they guarantee that the model does not [[overfit]] or [[underfit]] to its input data.


===Testing Datasets===
===Testing Datasets===
Testing datasets are those which evaluate the final performance of a machine learning model. These datasets typically come from both training and validation data, to guarantee that it can generalize to new, unseen data sets.
[[Testing]] datasets are those which evaluate the final performance of a machine learning model. These datasets typically come from both training and validation data, to guarantee that it can generalize to new, unseen data sets.


==Data Preprocessing==
==Data Preprocessing==
Before using a dataset to train a machine learning model, it must be preprocessed. This may involve tasks like cleaning the data, normalizing it and feature engineering. Preprocessing plays an integral role in the machine learning pipeline since it can significantly impact its accuracy and effectiveness.
Before using a dataset to train a machine learning model, it must be [[preprocess]]ed. This may involve tasks like cleaning the data, [[normalizing]] it and [[feature engineering]]. [[Preprocessing]] plays an integral role in the machine learning pipeline since it can significantly impact its accuracy and effectiveness.


==Explain Like I'm 5 (ELI5)==
==Explain Like I'm 5 (ELI5)==
A dataset is like a large bag of different objects that a computer can study and learn from. It's essential that there be plenty of variety in the bag so the computer can recognize similar items when they appear, such as pictures, words and numbers. Before it can use this knowledge to make decisions though, we need to organize everything inside so it's easy for it to comprehend - like tidying up our room before playing with toys!
A dataset is like a large bag of different objects that a computer can study and learn from. It's essential that there be plenty of variety in the bag so the computer can recognize similar items when they appear, such as pictures, words and numbers. Before it can use this knowledge to make decisions though, we need to organize everything inside so it's easy for it to comprehend - like tidying up our room before playing with toys!
[[Category:Terms]] [[Category:Machine learning terms]]