Jump to content

Datasets: Difference between revisions

134 bytes added ,  21 February 2023
no edit summary
m (Alpha5 moved page Data set or dataset to Datasets)
No edit summary
Line 1: Line 1:
===Definition==
==Introduction==
Datasets in machine learning refer to a collection of information collected for training, testing, and assessing a model. They typically consist of input data (features) and their corresponding output or label data. Datasets can vary in size, format, and complexity depending on the problem being addressed.
[[Datasets]] in [[machine learning]] refer to a collection of information or raw [[data]] collected for [[training]], testing, and assessing a model. They typically consist of [[input]] data ([[features]]) and their corresponding [[output]] or [[label]] data. Datasets can vary in size, format, and complexity depending on the problem being addressed.
 
Datasets are often organized as a spreadsheet or CSV (comma-separated value) file.


==Importance==
==Importance==
Line 8: Line 10:
Datasets can be classified based on characteristics such as their size, source, format and labeling. Common types of datasets include:
Datasets can be classified based on characteristics such as their size, source, format and labeling. Common types of datasets include:


===Structured Datasets==
===Structured Datasets=
Structured datasets refer to data with an organized format and rows/columns. Examples of structured datasets include spreadsheets, SQL databases and CSV files. These types of data sets are commonly employed for classification or regression tasks.
Structured datasets refer to data with an organized format and rows/columns. Examples of structured datasets include spreadsheets, SQL databases and CSV files. These types of data sets are commonly employed for classification or regression tasks.


===Unstructured Datasets==
===Unstructured Datasets===
Unstructured datasets refer to those which lack a predefined format and consist of text, images or audio. Examples of unstructured datasets include social media posts, images and audio recordings. These types of records can be utilized for tasks such as natural language processing, computer vision and speech recognition.
Unstructured datasets refer to those which lack a predefined format and consist of text, images or audio. Examples of unstructured datasets include social media posts, images and audio recordings. These types of records can be utilized for tasks such as natural language processing, computer vision and speech recognition.


===Labeled Datasets==
===Labeled Datasets===
Labeled datasets are those which have already been annotated with the correct output or label data. These types of datasets are frequently employed in supervised learning tasks, which require you to predict an exact output based on input data.
Labeled datasets are those which have already been annotated with the correct output or label data. These types of datasets are frequently employed in supervised learning tasks, which require you to predict an exact output based on input data.


===Unlabeled Datasets==
===Unlabeled Datasets===
Unlabeled datasets refer to those without predetermined output or label data. They're frequently employed in unsupervised learning tasks, which aim to detect patterns or structure within the data.
Unlabeled datasets refer to those without predetermined output or label data. They're frequently employed in unsupervised learning tasks, which aim to detect patterns or structure within the data.


===Training Datasets==
===Training Datasets===
Training datasets are those used to train a machine learning model. Typically, these datasets form the majority of data used and have been carefully curated so that the model has access to an array of inputs.
Training datasets are those used to train a machine learning model. Typically, these datasets form the majority of data used and have been carefully curated so that the model has access to an array of inputs.


===Validation Datasets==
===Validation Datasets===
Validation datasets are those used to assess a model's performance during training. These datasets are usually separated from the training data, and they guarantee that the model does not overfit or underfit to its input data.
Validation datasets are those used to assess a model's performance during training. These datasets are usually separated from the training data, and they guarantee that the model does not overfit or underfit to its input data.


===Testing Datasets==
===Testing Datasets===
Testing datasets are those which evaluate the final performance of a machine learning model. These datasets typically come from both training and validation data, to guarantee that it can generalize to new, unseen data sets.
Testing datasets are those which evaluate the final performance of a machine learning model. These datasets typically come from both training and validation data, to guarantee that it can generalize to new, unseen data sets.