Categorical Data

In machine learning and statistics, categorical data (also called qualitative data) refers to variables that take on a discrete set of values representing categories, groups, or labels rather than numerical quantities. Unlike continuous or numerical data, categorical values have no inherent mathematical meaning: you cannot meaningfully add, subtract, or average them. Categorical data plays a central role in many machine learning tasks, including classification, clustering, and regression, where it often appears as input features, target labels, or both.

Examples of categorical data include colors (red, blue, green), country names (USA, France, Japan), blood types (A, B, AB, O), and product ratings (1 star through 5 stars). Since most machine learning algorithms operate on numerical inputs, converting categorical data into a suitable numeric representation is a fundamental step in data preprocessing.

Levels of measurement

The psychologist Stanley Smith Stevens introduced the classic typology of measurement scales in his 1946 paper "On the Theory of Scales of Measurement," published in Science. Stevens identified four levels: nominal, ordinal, interval, and ratio. Categorical data falls under the first two levels.

Level	Ordered	Equal spacing	True zero	Examples
Nominal	No	No	No	Colors, blood types, country codes
Ordinal	Yes	No	No	Education level, satisfaction ratings, T-shirt sizes
Interval	Yes	Yes	No	Temperature in Celsius, calendar years
Ratio	Yes	Yes	Yes	Height, weight, income

The interval and ratio levels describe quantitative (numerical) data. The distinction between nominal and ordinal categorical data has direct consequences for which encoding methods are appropriate and which statistical tests can be applied.

Types of categorical data

Categorical data falls into two main subtypes: nominal and ordinal.

Nominal data

Nominal data consists of categories with no natural order or ranking. The categories are simply different labels, and no category is "greater than" or "less than" another. Examples include:

Car colors: red, blue, green, white
Country names: USA, France, China, Brazil
Blood types: A, B, AB, O
Programming languages: Python, Java, C++, Rust

Because there is no ordering relationship among nominal categories, encoding methods that impose an artificial numeric order (such as label encoding) can mislead certain models into assuming a ranking that does not exist.

Ordinal data

Ordinal data consists of categories with a meaningful order or ranking, but the distances between categories are not necessarily equal or known. Examples include:

Education level: high school, bachelor's degree, master's degree, doctorate
Product ratings: 1 star, 2 stars, 3 stars, 4 stars, 5 stars
Customer satisfaction: very dissatisfied, dissatisfied, neutral, satisfied, very satisfied
T-shirt sizes: S, M, L, XL

The key difference from nominal data is that ordinal categories carry information about relative position. A "5-star" rating is better than a "3-star" rating, even though the exact numerical gap between them may not be well defined.

Binary (dichotomous) data

Binary data is a special case of nominal data with exactly two categories. Examples include yes/no, true/false, male/female, and pass/fail. Binary features are the simplest categorical type and can be represented directly as 0 or 1 without any risk of imposing a false ordering. Many encoding methods reduce multi-category features into sets of binary columns.

Cardinality

The cardinality of a categorical feature refers to the number of unique categories it contains. This distinction has major practical implications for encoding and modeling.

Cardinality level	Typical range	Examples	Encoding considerations
Low cardinality	2 to ~20 unique values	Gender, color, day of week, country (small set)	One-hot encoding works well; most encoding methods are viable
Medium cardinality	~20 to ~100 unique values	US state, product category, job title	One-hot encoding starts creating many columns; target or binary encoding may be preferable
High cardinality	100+ unique values	ZIP code, user ID, product SKU, IP address	One-hot encoding is impractical; feature hashing, target encoding, or entity embeddings are needed

High-cardinality features pose a particular challenge. Applying one-hot encoding to a feature with 10,000 unique values would create 10,000 new binary columns, leading to extreme dimensionality, increased memory usage, and potential overfitting. Specialized encoding strategies are essential for handling these cases effectively.

Encoding methods

Since most machine learning algorithms require numerical inputs, categorical features must be transformed into numbers through a process called encoding. The choice of encoding method depends on the type of categorical data, its cardinality, and the model being used.

One-hot encoding

One-hot encoding creates a new binary column for each unique category. Each observation gets a 1 in the column corresponding to its category and 0 in all other columns. For a color feature with values red, blue, and green:

Original	is_red	is_blue	is_green
Red	1	0	0
Blue	0	1	0
Green	0	0	1

One-hot encoding is the most widely used method for nominal data with low to moderate cardinality. It avoids introducing any false ordinal relationship between categories. However, it suffers from the curse of dimensionality when applied to high-cardinality features, as it creates one column per category.

A common variant is dummy encoding, which drops one of the K columns (producing K-1 columns) to avoid perfect multicollinearity in linear regression and other linear models. The dropped category becomes the implicit reference level.

Label (integer) encoding

Label encoding assigns each category a unique integer (for example, red = 0, blue = 1, green = 2). This produces a single numeric column, making it memory-efficient. However, it introduces an implicit ordinal relationship between categories. A model might incorrectly interpret green (2) as being "twice" blue (1) or "greater than" red (0).

Label encoding is appropriate for ordinal data where the integer ordering reflects the true ranking. For nominal data, it should generally be avoided with linear models and neural networks, though decision tree-based models are less affected because they split on specific threshold values rather than assuming linear relationships.

Ordinal encoding

Ordinal encoding is similar to label encoding but explicitly maps categories to integers according to a known, meaningful order. For education level, a mapping like high school = 1, bachelor's = 2, master's = 3, doctorate = 4 preserves the natural ranking. In scikit-learn, the OrdinalEncoder class accepts a user-specified category order, ensuring the mapping aligns with domain knowledge rather than arbitrary alphabetical or appearance-based sorting.

Best practices for ordinal encoding include fitting the encoder on training data only (to prevent data leakage) and handling unknown categories at inference time by assigning them a default value or raising an error.

Target (mean) encoding

Target encoding replaces each category with the mean of the target variable for that category. For a binary classification task, each category value is replaced by the proportion of positive examples observed for that category in the training data.

Target encoding is powerful because it captures the relationship between the category and the target directly. It works well with high-cardinality features since it always produces a single numeric column. The major risk is target leakage: if the encoding is computed using the same data that the model trains on, the model can memorize the target through the encoded values. To mitigate this, practitioners use several techniques:

Cross-validation encoding. Compute the encoding on out-of-fold data so that each sample's encoding never depends on its own target value.
Smoothing (Bayesian shrinkage). Blend the category mean with the global mean, weighted by the number of samples in each category. Categories with few observations are pulled toward the global mean, reducing variance.
Additive noise. Add small random noise to the encoded values during training to prevent overfitting.

Scikit-learn introduced a built-in TargetEncoder class (since version 1.3) that applies internal cross-fitting to reduce leakage.

Frequency (count) encoding

Frequency encoding replaces each category with its frequency or proportion of occurrence in the dataset. If "blue" appears in 30 out of 100 rows, it is encoded as 30 (count encoding) or 0.30 (frequency encoding).

This method is simple and produces a single column per feature. It works well when the frequency of a category is genuinely correlated with the target variable. A limitation is that categories with the same frequency receive identical encoded values, causing a loss of distinguishing information.

Binary encoding

Binary encoding is a compromise between one-hot encoding and label encoding. First, each category is assigned an integer. Then, each integer is converted to its binary (base-2) representation, and each bit becomes a separate column. For a feature with 8 categories, binary encoding produces only 3 columns (since log2(8) = 3), compared to 8 columns for one-hot encoding.

Binary encoding reduces dimensionality significantly for high-cardinality features. However, it can introduce misleading distances between categories: two categories whose binary representations differ by a single bit may appear closer than categories differing by multiple bits, even when no such proximity exists in reality.

Feature hashing (the hashing trick)

Feature hashing applies a hash function (such as MurmurHash3) to map each category to one of a fixed number of output columns. This approach is useful for extremely high-cardinality features or situations where the full set of categories is not known in advance (for example, streaming data with new categories appearing over time).

The number of output columns is a parameter chosen by the practitioner, providing direct control over dimensionality. The main drawback is hash collisions: different categories may map to the same column, mixing unrelated information. Despite this, feature hashing is widely used in large-scale production systems where memory efficiency and speed are needed. The scikit-learn library provides FeatureHasher for this purpose.

Weinberger et al. (2009) introduced the hashing trick for large-scale multitask learning and proved exponential tail bounds showing that hash collisions have negligible impact on learning performance with high probability.

Weight of evidence (WoE) encoding

Weight of evidence encoding originated in the credit scoring industry and is designed specifically for binary classification problems. For each category, WoE is calculated as the natural logarithm of the ratio of the proportion of positive cases to the proportion of negative cases:

WoE = ln(Distribution of Positives / Distribution of Negatives)

Positive WoE values indicate a category with more positive cases than expected; negative values indicate more negative cases. A WoE of zero means the category has an equal proportion of both classes. WoE encoding is popular in financial risk modeling, fraud detection, and credit scoring because it produces a monotonic relationship between the encoded feature and the log-odds of the target.

James-Stein encoding

James-Stein encoding is a Bayesian shrinkage technique that computes a weighted average between the category-specific target mean and the overall (global) target mean. The weight depends on the variance and sample size within each category: categories with fewer observations are shrunk more heavily toward the global mean, while categories with many observations retain values closer to their observed mean.

This approach is based on the James-Stein estimator, which was originally defined for normally distributed data. It naturally regularizes against overfitting on rare categories and is available in the category_encoders library for Python.

Leave-one-out encoding

Leave-one-out (LOO) encoding is closely related to target encoding. For each observation, the encoded value is the mean of the target variable for all other observations sharing the same category, excluding the current observation. By leaving out each row's own target value, LOO encoding reduces the direct target leakage present in naive target encoding. However, it can still overfit on small categories where excluding a single observation produces large fluctuations in the mean.

Entity embeddings

Entity embeddings, introduced by Guo and Berkhahn in 2016, use a neural network to learn a dense, low-dimensional vector representation for each category. During training, each category is mapped to an embedding vector (similar to word embeddings in natural language processing), and the embedding weights are updated through backpropagation.

Entity embeddings capture semantic relationships between categories. For example, an embedding for geographic regions might learn that "France" and "Germany" are closer together than "France" and "Japan." This technique excels with high-cardinality features and large datasets, and the learned embeddings can be reused as input features for other models, including tree-based methods. The downside is that it requires a neural network training step and sufficient data to learn meaningful representations. Entity embeddings were popularized by their success in the Kaggle "Rossmann Store Sales" competition and have since become standard practice in deep learning for tabular data.

Contrast coding schemes

In the statistics and social sciences community, several contrast coding systems are used to represent categorical variables in regression models. These methods encode a K-level categorical variable into K-1 contrast vectors with specific mathematical properties.

Scheme	Reference point	Interpretation of coefficients
Treatment (dummy) coding	One reference category	Difference between each category and the reference category
Sum (deviation) coding	Grand mean	Deviation of each category from the overall mean
Helmert coding	Mean of subsequent levels	Difference between each level and the average of all subsequent levels
Backward difference coding	Previous level	Difference between each level and the preceding level

All contrast coding schemes yield the same model predictions; they differ only in how the regression coefficients are interpreted. Treatment coding is the default in most software and is equivalent to dummy encoding. Sum coding is preferred when the goal is to compare each category against the overall average rather than against a single reference category.

Comparison of encoding methods

Method	Output columns	Handles high cardinality	Preserves order	Risks / drawbacks	Best for
One-hot encoding	K (one per category)	No (dimensionality explosion)	No	Sparse, high memory for many categories	Low-cardinality nominal features
Label encoding	1	Yes	Yes (imposed)	False ordinal relationship for nominal data	Ordinal features; tree-based models
Ordinal encoding	1	Yes	Yes (user-defined)	Requires domain knowledge of ordering	Ordinal features with known ranking
Target encoding	1	Yes	N/A	Target leakage without regularization	High-cardinality features with supervised tasks
Frequency encoding	1	Yes	N/A	Categories with same frequency become identical	Frequency-correlated features
Binary encoding	log2(K)	Partially	No	Misleading distances between categories	Medium to high-cardinality features
Feature hashing	Fixed (user-chosen)	Yes	No	Hash collisions; irreversible	Very high cardinality; streaming data
WoE encoding	1	Yes	N/A	Only for binary targets; zero-frequency issues	Credit scoring; risk modeling
James-Stein encoding	1	Yes	N/A	Assumes normality of target	Target-based tasks with rare categories
Leave-one-out encoding	1	Yes	N/A	Overfitting on small categories	Moderate-cardinality supervised tasks
Entity embeddings	D (embedding dimension)	Yes	Learned	Requires neural net training; needs large data	High-cardinality features with deep learning

Handling categorical data across model types

Different families of machine learning models interact with categorical features in fundamentally different ways. Choosing the right encoding depends heavily on the model.

Tree-based models

Decision trees, random forests, and gradient boosting models split features at specific threshold values. Because they do not assume any linear or continuous relationship between feature values, they are more tolerant of label encoding even for nominal data. A decision tree simply asks "Is feature X equal to 2?" rather than interpreting 2 as numerically meaningful.

Some implementations can handle categorical features natively without any preprocessing:

Library	Native categorical support	Method used
CatBoost	Yes (built-in)	Ordered target statistics
LightGBM	Yes (built-in)	Optimal split finding on categories
XGBoost	Experimental (since v1.6)	Optimal partitioning of categories
Scikit-learn trees	No	Requires manual encoding

For libraries without native support, label encoding or target encoding typically works well.

Linear models

Linear regression, logistic regression, and support vector machines learn by assigning weights to each feature and computing predictions through weighted sums. These models assume a linear relationship between feature values and the target, so label encoding nominal data is problematic: the model would treat the numeric gaps between category codes as meaningful. One-hot encoding is the standard approach for linear models with low-cardinality features. For high-cardinality features, target encoding or feature hashing are better alternatives.

Neural networks

Neural networks are flexible enough to learn complex nonlinear relationships, but they still require numerical input. One-hot encoding works for low-cardinality features, but entity embeddings are the preferred approach for medium to high-cardinality features. Embedding layers (available in frameworks like PyTorch and TensorFlow) learn dense representations during training, capturing relationships between categories that one-hot encoding cannot express.

Naive Bayes

Naive Bayes classifiers can work directly with categorical features without any encoding. The algorithm estimates the conditional probability of each category given each class label from the training data. For this reason, Naive Bayes is sometimes used as a baseline for classification tasks with many categorical features.

CatBoost's native categorical handling

CatBoost (short for "Categorical Boosting") is a gradient boosting library developed by Yandex that was designed specifically to handle categorical features without manual preprocessing. It uses a technique called ordered target statistics to encode categories during training.

The core idea is to compute a target-based encoding for each category, but in an order-dependent way that prevents target leakage. For each training example, CatBoost calculates the average target value only from examples that appeared before it in a random permutation of the data. This "look only at the past" approach ensures that no example's target value is used to compute its own encoding.

The formula for the ordered target statistic for observation i with category value k is:

TS(i, k) = (sum of target values for category k before position i + prior) / (count of category k before position i + 1)

The prior term is typically the global target mean multiplied by a smoothing parameter a (default 1), which regularizes the estimate for rare categories.

CatBoost also has a parameter called one_hot_max_size (default is typically 2) that determines a threshold: features with fewer unique values than this threshold are one-hot encoded, while features with more unique values use ordered target statistics. Additionally, CatBoost can automatically generate and encode feature combinations (interactions between pairs of categorical features), expanding the effective feature space without manual feature engineering.

Prokhorenkova et al. (2018) showed that standard target statistics (used in other gradient boosting implementations) introduce a prediction shift, and CatBoost's ordered approach eliminates this bias.

Statistical analysis of categorical data

Beyond encoding for machine learning, categorical data has a rich set of statistical tools for analysis and hypothesis testing.

Chi-squared test of independence

The Pearson chi-squared test is the most widely used test for determining whether there is a statistically significant association between two categorical variables. It compares observed frequencies in a contingency table against the frequencies expected under the null hypothesis of independence.

The test statistic is:

chi-squared = sum of ((O - E)^2 / E)

where O is the observed frequency and E is the expected frequency for each cell. If the resulting p-value is below a chosen significance level (commonly 0.05), the null hypothesis of independence is rejected.

Cramer's V

While the chi-squared test determines whether an association exists, it does not measure the strength of that association. Cramer's V fills this gap. It is derived from the chi-squared statistic and ranges from 0 (no association) to 1 (perfect association).

Cramer's V	Interpretation
0.00 to 0.10	Negligible association
0.10 to 0.30	Weak association
0.30 to 0.50	Moderate association
0.50 to 1.00	Strong association

These thresholds should be adjusted based on the degrees of freedom of the contingency table.

Fisher's exact test

For small sample sizes where expected cell counts fall below 5, the chi-squared approximation becomes unreliable. Fisher's exact test computes the exact probability of obtaining the observed (or a more extreme) distribution under the null hypothesis, making it suitable for 2x2 contingency tables with small samples.

Mutual information

Mutual information (MI) measures the amount of information that one variable provides about another. Unlike the chi-squared test, MI can capture nonlinear dependencies between variables. MI equals zero when two variables are independent and increases with stronger dependence. In scikit-learn, mutual_info_classif and mutual_info_regression compute MI scores for feature selection with categorical data.

Feature selection for categorical variables

Selecting the most informative categorical features before training a model can improve performance and reduce computation. Common approaches include:

Method	Type	Handles mixed types	Notes
Chi-squared test	Filter	No (categorical only)	Tests association between each feature and the target; available via `SelectKBest` in scikit-learn
Mutual information	Filter	Yes (categorical and numerical)	Captures nonlinear relationships; more general than chi-squared
Cramer's V	Filter	No (categorical only)	Measures association strength; useful for feature-feature correlation analysis
Permutation importance	Model-based	Yes	Measures the drop in model performance when a feature's values are shuffled
Tree-based importance	Model-based	Yes	Uses split-based or gain-based importance from tree models

Missing values in categorical data

Missing values are common in real-world categorical data and require careful handling. Several strategies are used in practice:

Strategy	Description	When to use
Mode imputation	Replace missing values with the most frequent category	Small percentage of missing values (under 5 to 10%); values are missing at random
Dedicated "Missing" category	Treat missing as its own category	Missingness itself carries information (for example, a customer who did not provide their occupation)
Model-based imputation (KNN, MICE)	Predict the missing category using other features	Moderate missingness; when relationships between features can inform imputation
Drop rows or columns	Remove observations or features with missing values	Very small number of affected rows, or a feature with an extremely high missing rate

Mode imputation is the simplest approach and works when the proportion of missing values is small. However, it can distort the distribution of the feature by inflating the count of the most frequent category. Creating a dedicated "Missing" category is often the most practical approach because it preserves all data and allows the model to learn whether missingness is predictive. For more complex scenarios, K-nearest neighbors (KNN) imputation and Multiple Imputation by Chained Equations (MICE) use relationships among features to predict missing values, though these methods are computationally more expensive.

Avoiding data leakage with categorical encoding

Data leakage occurs when information from the test set or target variable improperly influences the training process, leading to artificially inflated performance metrics that do not generalize to new data. Categorical encoding is one of the most common sources of data leakage in machine learning pipelines.

Common leakage scenarios

One-hot encoding leakage. Fitting the encoder on the entire dataset (including the test set) means the model knows which categories exist in the test data. If a category appears only in the test set, the encoder built on all data will create a column for it, while an encoder fit only on training data would not.
Target encoding leakage. Computing category means using the full dataset (including samples the model will be evaluated on) gives the model direct access to target information it should not have. This is the most severe form of leakage for supervised encoders.
Frequency encoding leakage. Counting category frequencies across the entire dataset inflates or deflates counts compared to what the model would see in production.

Prevention strategies

The fundamental rule is to fit all encoding transformations on the training data only, then apply (transform) to the validation and test sets. In scikit-learn, wrapping encoders inside a Pipeline or ColumnTransformer and using cross-validation ensures correct fit/transform separation. For target-based encoders specifically, internal cross-fitting (as implemented in scikit-learn's TargetEncoder) provides an additional layer of protection.

Categorical data in pandas

The Python library pandas provides a dedicated Categorical dtype for representing categorical data efficiently. Converting a string column to the Categorical dtype can yield substantial benefits.

Memory savings: Internally, pandas stores categories as integer codes rather than repeating the full string for each row. For a column with 1 million rows but only 50 unique values, converting from object dtype to Categorical can reduce memory usage by over 95% (for example, from 64 MB to under 1 MB).

Performance improvements: Operations such as groupby, value_counts, and comparisons run faster on Categorical columns because they operate on small integer codes rather than variable-length strings. Speedups of 5x to 10x are common for groupby operations on large datasets.

Usage example in pandas:

import pandas as pd

df = pd.DataFrame({'color': ['red', 'blue', 'green', 'red', 'blue'] * 200000})
df['color'] = df['color'].astype('category')  # Convert to Categorical dtype

# Optional: specify a custom order for ordinal data
df['size'] = pd.Categorical(
    ['S', 'M', 'L', 'XL', 'S'] * 200000,
    categories=['S', 'M', 'L', 'XL'],
    ordered=True
)

The Categorical dtype also enforces a fixed set of allowed values, catching data entry errors such as misspellings. Scikit-learn and other libraries increasingly support Categorical columns directly, reducing the need for manual encoding.

Scikit-learn encoding workflow

Scikit-learn provides a standardized workflow for encoding categorical features within a machine learning pipeline. The ColumnTransformer class allows different encoding strategies to be applied to different columns in a single step.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

# Define which columns get which encoding
preprocessor = ColumnTransformer(
    transformers=[
        ('nominal', OneHotEncoder(handle_unknown='ignore'), ['color', 'country']),
        ('ordinal', OrdinalEncoder(categories=[['S','M','L','XL']]), ['size']),
    ],
    remainder='passthrough'  # keep numeric columns as-is
)

# Combine preprocessing and model in a single pipeline
pipe = Pipeline([
    ('preprocess', preprocessor),
    ('model', RandomForestClassifier())
])

pipe.fit(X_train, y_train)
score = pipe.score(X_test, y_test)

Using a pipeline ensures that encoding is fit only on training data and applied consistently to test data, preventing data leakage.

Categorical data in other tools and frameworks

Beyond pandas and scikit-learn, several tools provide specialized support for categorical data.

Tool / library	Categorical support
PyTorch	`nn.Embedding` layer for entity embeddings; manual encoding for other methods
TensorFlow / Keras	`tf.keras.layers.Embedding`, `tf.feature_column.categorical_column_*` family of functions
category_encoders	Python library with 15+ encoding methods including WoE, James-Stein, CatBoost, and leave-one-out encoders
Feature-engine	Scikit-learn-compatible library with ordinal, target, count, decision tree, and mean encoding transformers
R (base)	Native `factor` type with ordered and unordered variants; contrast coding built into `lm()` and `glm()`
Apache Spark MLlib	`StringIndexer` (label encoding) and `OneHotEncoder` for distributed pipelines

Explain like I'm 5 (ELI5)

Imagine you have a box of crayons. Each crayon has a different color: red, blue, green, yellow. These colors are categorical data because they are just names for different groups. You cannot say red is "bigger" than blue or add green plus yellow together.

Now, there are two kinds of groups:

Nominal groups have no order. The colors of crayons are nominal because no color is "first" or "last."
Ordinal groups do have an order. Shirt sizes (small, medium, large) are ordinal because small comes before medium, which comes before large.

Computers only understand numbers, not color names. So before a computer can learn from this data, we need to turn the labels into numbers. There are different ways to do this. One way is to give each color its own column and mark it with a 1 or 0. Another way is to replace the labels with their average score. Picking the right way to turn labels into numbers helps the computer learn better.

References

Stevens, S. S. "On the Theory of Scales of Measurement." *Science*, 103(2684), 677-680, 1946.
Guo, C. and Berkhahn, F. "Entity Embeddings of Categorical Variables." *arXiv preprint arXiv:1604.06737*, 2016.
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., and Gulin, A. "CatBoost: unbiased boosting with categorical features." *Advances in Neural Information Processing Systems (NeurIPS)*, 2018.
Weinberger, K., Dasgupta, A., Langford, J., Smola, A., and Attenberg, J. "Feature hashing for large scale multitask learning." *Proceedings of the 26th International Conference on Machine Learning (ICML)*, 2009.
Pargent, F., Pfisterer, F., Thomas, J., and Bischl, B. "Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features." *Computational Statistics*, 37(5), 2022.
Hancock, J. T. and Khoshgoftaar, T. M. "Survey on categorical data for neural networks." *Journal of Big Data*, 7(1), 2020.
Potdar, K., Pardawala, T., and Pai, C. "A Comparative Study of Categorical Variable Encoding Techniques for Neural Network Classifiers." *International Journal of Computer Applications*, 175(4), 2017.
Grinsztajn, L., Oyallon, E., and Varoquaux, G. "Why do tree-based models still outperform deep learning on typical tabular data?" *NeurIPS Datasets and Benchmarks Track*, 2022.
Cerda, P. and Varoquaux, G. "Encoding high-cardinality string categorical variables." *IEEE Transactions on Knowledge and Data Engineering*, 34(3), 2022.
Dahouda, M. K. and Joe, I. "Effective Methods of Categorical Data Encoding for Artificial Intelligence Algorithms." *Mathematics*, 12(16), 2553, 2024.
McGinnis, W. D., Siu, C., Andre, S., and Huang, H. "Category Encoders: a scikit-learn-contrib package of transformers for encoding categorical data." *Journal of Open Source Software*, 3(21), 501, 2018.
Seger, C. "An investigation of categorical variable encoding techniques in machine learning: binary versus one-hot and feature hashing." *KTH Royal Institute of Technology*, 2018.
pandas documentation. "Categorical data." https://pandas.pydata.org/docs/user_guide/categorical.html
scikit-learn documentation. "Preprocessing: Encoding categorical features." https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features

Levels of measurement

Types of categorical data

Nominal data

Ordinal data

Binary (dichotomous) data

Cardinality

Encoding methods

One-hot encoding

Label (integer) encoding

Ordinal encoding

Target (mean) encoding

Frequency (count) encoding

Binary encoding

Feature hashing (the hashing trick)

Weight of evidence (WoE) encoding

James-Stein encoding

Leave-one-out encoding

Entity embeddings

Contrast coding schemes

Comparison of encoding methods

Handling categorical data across model types

Tree-based models

Linear models

Neural networks

Naive Bayes

CatBoost's native categorical handling

Statistical analysis of categorical data

Chi-squared test of independence

Cramer's V

Fisher's exact test

Mutual information

Feature selection for categorical variables

Missing values in categorical data

Avoiding data leakage with categorical encoding

Common leakage scenarios

Prevention strategies

Categorical data in pandas

Scikit-learn encoding workflow

Categorical data in other tools and frameworks

Explain like I'm 5 (ELI5)

See also

References

Improve this article

Related Articles

Continuous Feature

Discrete Feature

ARC-AGI 2

Ground Truth

Instance

AUC-ROC

Levels of measurement

Types of categorical data

Nominal data

Ordinal data

Binary (dichotomous) data

Cardinality

Encoding methods

One-hot encoding

Label (integer) encoding

Ordinal encoding

Target (mean) encoding

Frequency (count) encoding

Binary encoding

Feature hashing (the hashing trick)

Weight of evidence (WoE) encoding

James-Stein encoding

Leave-one-out encoding

Entity embeddings

Contrast coding schemes

Comparison of encoding methods

Handling categorical data across model types

Tree-based models

Linear models

Neural networks

Naive Bayes

CatBoost's native categorical handling

Statistical analysis of categorical data

Chi-squared test of independence

Cramer's V

Fisher's exact test