# Categorical Data

> Source: https://aiwiki.ai/wiki/categorical_data
> Updated: 2026-06-22
> Categories: Data & Datasets, Machine Learning, Statistics
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

Categorical data, also called qualitative data, is data whose values are discrete labels or groups (such as colors, country names, or blood types) rather than measurable quantities, so they cannot be meaningfully added, subtracted, or averaged. In [machine learning](/wiki/machine_learning) and statistics it is split into two families, nominal (unordered) and ordinal (ordered), and because most algorithms accept only numeric input, categorical features must be converted into numbers through a step called encoding before a model can use them.[10] Categorical data appears as input features, target labels, or both across [classification](/wiki/classification_model), [clustering](/wiki/clustering), and [regression](/wiki/regression_model) tasks.

Examples of categorical data include colors (red, blue, green), country names (USA, France, Japan), blood types (A, B, AB, O), and product ratings (1 star through 5 stars). Since most machine learning algorithms operate on numerical inputs, converting categorical data into a suitable numeric representation is a fundamental step in [data preprocessing](/wiki/preprocessing).[10] The choice of encoding interacts strongly with the model: a 2022 benchmark across 24 datasets and five learning algorithms found that regularized versions of target encoding "consistently provided the best results" for high-cardinality features, outperforming traditional integer and one-hot schemes.[5]

## What are the levels of measurement?

The psychologist Stanley Smith Stevens introduced the classic typology of measurement scales in his 1946 paper "On the Theory of Scales of Measurement," published in *Science*.[1] Stevens identified four levels: nominal, ordinal, interval, and ratio.[1] Categorical data falls under the first two levels.

| Level | Ordered | Equal spacing | True zero | Examples |
|---|---|---|---|---|
| Nominal | No | No | No | Colors, blood types, country codes |
| Ordinal | Yes | No | No | Education level, satisfaction ratings, T-shirt sizes |
| Interval | Yes | Yes | No | Temperature in Celsius, calendar years |
| Ratio | Yes | Yes | Yes | Height, weight, income |

The interval and ratio levels describe quantitative (numerical) data. The distinction between nominal and ordinal categorical data has direct consequences for which encoding methods are appropriate and which statistical tests can be applied.

## What are the types of categorical data?

Categorical data falls into two main subtypes: **nominal** and **ordinal**.

### Nominal data

Nominal data consists of categories with no natural order or ranking. The categories are simply different labels, and no category is "greater than" or "less than" another. Examples include:

- Car colors: red, blue, green, white
- Country names: USA, France, China, Brazil
- Blood types: A, B, AB, O
- Programming languages: Python, Java, C++, Rust

Because there is no ordering relationship among nominal categories, encoding methods that impose an artificial numeric order (such as label encoding) can mislead certain models into assuming a ranking that does not exist.

### Ordinal data

Ordinal data consists of categories with a meaningful order or ranking, but the distances between categories are not necessarily equal or known. Examples include:

- Education level: high school, bachelor's degree, master's degree, doctorate
- Product ratings: 1 star, 2 stars, 3 stars, 4 stars, 5 stars
- Customer satisfaction: very dissatisfied, dissatisfied, neutral, satisfied, very satisfied
- T-shirt sizes: S, M, L, XL

The key difference from nominal data is that ordinal categories carry information about relative position. A "5-star" rating is better than a "3-star" rating, even though the exact numerical gap between them may not be well defined.

### Binary (dichotomous) data

Binary data is a special case of nominal data with exactly two categories. Examples include yes/no, true/false, male/female, and pass/fail. Binary features are the simplest categorical type and can be represented directly as 0 or 1 without any risk of imposing a false ordering. Many encoding methods reduce multi-category features into sets of binary columns.

## What is cardinality and why does high cardinality matter?

The **cardinality** of a categorical [feature](/wiki/feature) refers to the number of unique categories it contains. This distinction has major practical implications for encoding and modeling.

| Cardinality level | Typical range | Examples | Encoding considerations |
|---|---|---|---|
| Low cardinality | 2 to ~20 unique values | Gender, color, day of week, country (small set) | [One-hot encoding](/wiki/one-hot_encoding) works well; most encoding methods are viable |
| Medium cardinality | ~20 to ~100 unique values | US state, product category, job title | One-hot encoding starts creating many columns; target or binary encoding may be preferable |
| High cardinality | 100+ unique values | ZIP code, user ID, product SKU, IP address | One-hot encoding is impractical; feature hashing, target encoding, or entity embeddings are needed |

High-cardinality features pose a particular challenge. Applying one-hot encoding to a feature with 10,000 unique values would create 10,000 new binary columns, leading to extreme dimensionality, increased memory usage, and potential [overfitting](/wiki/overfitting). Specialized encoding strategies are essential for handling these cases effectively.[9] Cerda and Varoquaux (2022) frame the problem directly: high-cardinality string features make "the one-hot encoding scheme impractical," motivating similarity- and embedding-based encoders that scale to thousands of categories.[9]

## How is categorical data encoded into numbers?

Since most machine learning algorithms require numerical inputs, categorical features must be transformed into numbers through a process called **encoding**.[10] The choice of encoding method depends on the type of categorical data, its cardinality, and the model being used.[7]

### One-hot encoding

[One-hot encoding](/wiki/one-hot_encoding) creates a new binary column for each unique category. Each observation gets a 1 in the column corresponding to its category and 0 in all other columns. For a color feature with values red, blue, and green:

| Original | is_red | is_blue | is_green |
|---|---|---|---|
| Red | 1 | 0 | 0 |
| Blue | 0 | 1 | 0 |
| Green | 0 | 0 | 1 |

One-hot encoding is the most widely used method for nominal data with low to moderate cardinality.[14] It avoids introducing any false ordinal relationship between categories. However, it suffers from the **curse of dimensionality** when applied to high-cardinality features, as it creates one column per category.[12]

A common variant is **dummy encoding**, which drops one of the K columns (producing K-1 columns) to avoid perfect multicollinearity in [linear regression](/wiki/linear_regression) and other linear models. The dropped category becomes the implicit reference level.

### Label (integer) encoding

Label encoding assigns each category a unique integer (for example, red = 0, blue = 1, green = 2). This produces a single numeric column, making it memory-efficient. However, it introduces an implicit ordinal relationship between categories. A model might incorrectly interpret green (2) as being "twice" blue (1) or "greater than" red (0).

Label encoding is appropriate for ordinal data where the integer ordering reflects the true ranking. For nominal data, it should generally be avoided with linear models and [neural networks](/wiki/neural_network), though [decision tree](/wiki/decision_tree)-based models are less affected because they split on specific threshold values rather than assuming linear relationships.

### Ordinal encoding

Ordinal encoding is similar to label encoding but explicitly maps categories to integers according to a known, meaningful order. For education level, a mapping like high school = 1, bachelor's = 2, master's = 3, doctorate = 4 preserves the natural ranking. In [scikit-learn](/wiki/scikit-learn), the `OrdinalEncoder` class accepts a user-specified category order, ensuring the mapping aligns with domain knowledge rather than arbitrary alphabetical or appearance-based sorting.[14]

Best practices for ordinal encoding include fitting the encoder on training data only (to prevent data leakage) and handling unknown categories at inference time by assigning them a default value or raising an error.

### Target (mean) encoding

Target encoding replaces each category with the mean of the target variable for that category. For a binary classification task, each category value is replaced by the proportion of positive examples observed for that category in the [training](/wiki/training) data.

Target encoding is powerful because it captures the relationship between the category and the target directly. It works well with high-cardinality features since it always produces a single numeric column.[5] The major risk is **target leakage**: if the encoding is computed using the same data that the model trains on, the model can memorize the target through the encoded values. To mitigate this, practitioners use several techniques:

- **Cross-validation encoding.** Compute the encoding on out-of-fold data so that each sample's encoding never depends on its own target value.
- **Smoothing (Bayesian shrinkage).** Blend the category mean with the global mean, weighted by the number of samples in each category. Categories with few observations are pulled toward the global mean, reducing variance.[5]
- **Additive noise.** Add small random noise to the encoded values during training to prevent overfitting.

Scikit-learn introduced a built-in `TargetEncoder` class in version 1.3 (released June 2023) whose `fit_transform` applies an internal cross-fitting scheme: the data is split into k folds and each fold is encoded using statistics learned from the other k-1 folds, preventing the encoded value of a row from depending on its own target.[14]

### Frequency (count) encoding

Frequency encoding replaces each category with its frequency or proportion of occurrence in the dataset. If "blue" appears in 30 out of 100 rows, it is encoded as 30 (count encoding) or 0.30 (frequency encoding).

This method is simple and produces a single column per feature. It works well when the frequency of a category is genuinely correlated with the target variable. A limitation is that categories with the same frequency receive identical encoded values, causing a loss of distinguishing information.

### Binary encoding

Binary encoding is a compromise between one-hot encoding and label encoding. First, each category is assigned an integer. Then, each integer is converted to its binary (base-2) representation, and each bit becomes a separate column. For a feature with 8 categories, binary encoding produces only 3 columns (since log2(8) = 3), compared to 8 columns for one-hot encoding.

Binary encoding reduces dimensionality significantly for high-cardinality features.[12] However, it can introduce misleading distances between categories: two categories whose binary representations differ by a single bit may appear closer than categories differing by multiple bits, even when no such proximity exists in reality.

### Feature hashing (the hashing trick)

Feature hashing applies a hash function (such as MurmurHash3) to map each category to one of a fixed number of output columns. This approach is useful for extremely high-cardinality features or situations where the full set of categories is not known in advance (for example, streaming data with new categories appearing over time).

The number of output columns is a parameter chosen by the practitioner, providing direct control over dimensionality. The main drawback is **hash collisions**: different categories may map to the same column, mixing unrelated information. Despite this, feature hashing is widely used in large-scale production systems where memory efficiency and speed are needed. The scikit-learn library provides `FeatureHasher` for this purpose.[14]

Weinberger et al. (2009) introduced the hashing trick for large-scale multitask learning and proved exponential tail bounds showing that hash collisions have negligible impact on learning performance with high probability.[4]

### Weight of evidence (WoE) encoding

Weight of evidence encoding originated in the credit scoring industry and is designed specifically for binary classification problems. For each category, WoE is calculated as the natural logarithm of the ratio of the proportion of positive cases to the proportion of negative cases:

```
WoE = ln(Distribution of Positives / Distribution of Negatives)
```

Positive WoE values indicate a category with more positive cases than expected; negative values indicate more negative cases. A WoE of zero means the category has an equal proportion of both classes. WoE encoding is popular in financial risk modeling, fraud detection, and credit scoring because it produces a monotonic relationship between the encoded feature and the log-odds of the target.

### James-Stein encoding

James-Stein encoding is a Bayesian shrinkage technique that computes a weighted average between the category-specific target mean and the overall (global) target mean. The weight depends on the variance and sample size within each category: categories with fewer observations are shrunk more heavily toward the global mean, while categories with many observations retain values closer to their observed mean.

This approach is based on the James-Stein estimator, which was originally defined for normally distributed data. It naturally regularizes against overfitting on rare categories and is available in the `category_encoders` library for Python.[11]

### Leave-one-out encoding

Leave-one-out (LOO) encoding is closely related to target encoding. For each observation, the encoded value is the mean of the target variable for all other observations sharing the same category, excluding the current observation. By leaving out each row's own target value, LOO encoding reduces the direct target leakage present in naive target encoding.[11] However, it can still overfit on small categories where excluding a single observation produces large fluctuations in the mean.

### Entity embeddings

Entity embeddings, introduced by Cheng Guo and Felix Berkhahn in 2016, use a neural network to learn a dense, low-dimensional vector representation for each category.[2] During training, each category is mapped to an [embedding](/wiki/embeddings) vector (similar to [word embeddings](/wiki/word_embedding) in natural language processing), and the embedding weights are updated through [backpropagation](/wiki/backpropagation).

Entity embeddings capture semantic relationships between categories. As Guo and Berkhahn put it, "by mapping similar values close to each other in the embedding space it reveals the intrinsic properties of the categorical variables," while also reducing memory usage and speeding up neural networks compared with one-hot encoding.[2] For example, an embedding for geographic regions might learn that "France" and "Germany" are closer together than "France" and "Japan."[6] This technique excels with high-cardinality features and large datasets, and the learned embeddings can be reused as input features for other models, including tree-based methods.[9] The downside is that it requires a neural network training step and sufficient data to learn meaningful representations. The approach was popularized by the authors' result in the Kaggle "Rossmann Store Sales" competition, where they "were able to reach the third position with relative simple features," and it has since become standard practice in [deep learning](/wiki/deep_learning) for tabular data.[2]

### Contrast coding schemes

In the statistics and social sciences community, several contrast coding systems are used to represent categorical variables in regression models. These methods encode a K-level categorical variable into K-1 contrast vectors with specific mathematical properties.

| Scheme | Reference point | Interpretation of coefficients |
|---|---|---|
| Treatment (dummy) coding | One reference category | Difference between each category and the reference category |
| Sum (deviation) coding | Grand mean | Deviation of each category from the overall mean |
| Helmert coding | Mean of subsequent levels | Difference between each level and the average of all subsequent levels |
| Backward difference coding | Previous level | Difference between each level and the preceding level |

All contrast coding schemes yield the same model predictions; they differ only in how the regression coefficients are interpreted. Treatment coding is the default in most software and is equivalent to dummy encoding. Sum coding is preferred when the goal is to compare each category against the overall average rather than against a single reference category.

### Comparison of encoding methods

| Method | Output columns | Handles high cardinality | Preserves order | Risks / drawbacks | Best for |
|---|---|---|---|---|---|
| One-hot encoding | K (one per category) | No (dimensionality explosion) | No | Sparse, high memory for many categories | Low-cardinality nominal features |
| Label encoding | 1 | Yes | Yes (imposed) | False ordinal relationship for nominal data | Ordinal features; tree-based models |
| Ordinal encoding | 1 | Yes | Yes (user-defined) | Requires domain knowledge of ordering | Ordinal features with known ranking |
| Target encoding | 1 | Yes | N/A | Target leakage without regularization | High-cardinality features with supervised tasks |
| Frequency encoding | 1 | Yes | N/A | Categories with same frequency become identical | Frequency-correlated features |
| Binary encoding | log2(K) | Partially | No | Misleading distances between categories | Medium to high-cardinality features |
| Feature hashing | Fixed (user-chosen) | Yes | No | Hash collisions; irreversible | Very high cardinality; streaming data |
| WoE encoding | 1 | Yes | N/A | Only for binary targets; zero-frequency issues | Credit scoring; risk modeling |
| James-Stein encoding | 1 | Yes | N/A | Assumes normality of target | Target-based tasks with rare categories |
| Leave-one-out encoding | 1 | Yes | N/A | Overfitting on small categories | Moderate-cardinality supervised tasks |
| Entity embeddings | D (embedding dimension) | Yes | Learned | Requires neural net training; needs large data | High-cardinality features with deep learning |

## How do different models handle categorical features?

Different families of machine learning models interact with categorical features in fundamentally different ways. Choosing the right encoding depends heavily on the model.

### Tree-based models

[Decision trees](/wiki/decision_tree), [random forests](/wiki/random_forest), and [gradient boosting](/wiki/gradient_boosting) models split features at specific threshold values. Because they do not assume any linear or continuous relationship between feature values, they are more tolerant of label encoding even for nominal data.[8] A decision tree simply asks "Is feature X equal to 2?" rather than interpreting 2 as numerically meaningful.

Some implementations can handle categorical features natively without any preprocessing:

| Library | Native categorical support | Method used |
|---|---|---|
| CatBoost | Yes (built-in) | Ordered target statistics |
| LightGBM | Yes (built-in) | Optimal split finding on categories |
| XGBoost | Experimental from v1.5, optimal partitioning added in v1.6 | Optimal partitioning of categories |
| Scikit-learn trees | No | Requires manual encoding |

XGBoost exposes native handling through the `enable_categorical=True` parameter on a pandas Categorical column; for partition-based splits the condition is expressed as value in a set of categories rather than a single threshold, controlled by `max_cat_to_onehot`.[15] For libraries without native support, label encoding or target encoding typically works well.

### Linear models

[Linear regression](/wiki/linear_regression), [logistic regression](/wiki/logistic_regression), and support vector machines learn by assigning weights to each feature and computing predictions through weighted sums. These models assume a linear relationship between feature values and the target, so label encoding nominal data is problematic: the model would treat the numeric gaps between category codes as meaningful. One-hot encoding is the standard approach for linear models with low-cardinality features. For high-cardinality features, target encoding or feature hashing are better alternatives.

### Neural networks

[Neural networks](/wiki/neural_network) are flexible enough to learn complex nonlinear relationships, but they still require numerical input. One-hot encoding works for low-cardinality features, but entity embeddings are the preferred approach for medium to high-cardinality features.[7] Embedding layers (available in frameworks like [PyTorch](/wiki/pytorch) and [TensorFlow](/wiki/tensorflow)) learn dense representations during training, capturing relationships between categories that one-hot encoding cannot express.[6]

### Naive Bayes

[Naive Bayes](/wiki/naive_bayes) classifiers can work directly with categorical features without any encoding. The algorithm estimates the conditional probability of each category given each class label from the training data. For this reason, Naive Bayes is sometimes used as a baseline for classification tasks with many categorical features.

## How does CatBoost handle categorical features natively?

CatBoost (short for "Categorical Boosting") is a gradient boosting library developed by Yandex that was designed specifically to handle categorical features without manual preprocessing. It uses a technique called **ordered target statistics** to encode categories during training.

The core idea is to compute a target-based encoding for each category, but in an order-dependent way that prevents target leakage. For each training example, CatBoost calculates the average target value only from examples that appeared before it in a random permutation of the data.[3] This "look only at the past" approach ensures that no example's target value is used to compute its own encoding.

The formula for the ordered target statistic for observation i with category value k is:

```
TS(i, k) = (sum of target values for category k before position i + prior) / (count of category k before position i + 1)
```

The `prior` term is typically the global target mean multiplied by a smoothing parameter `a` (default 1), which regularizes the estimate for rare categories.

CatBoost also has a parameter called `one_hot_max_size` (default is typically 2) that determines a threshold: features with fewer unique values than this threshold are one-hot encoded, while features with more unique values use ordered target statistics. Additionally, CatBoost can automatically generate and encode **feature combinations** (interactions between pairs of categorical features), expanding the effective feature space without manual [feature engineering](/wiki/feature_engineering).

Prokhorenkova et al. (2018) motivate the design by noting that ordered boosting and ordered target statistics "were created to fight a prediction shift caused by a special kind of target leakage present in all currently existing implementations of gradient boosting algorithms," a bias that CatBoost's ordered approach eliminates.[3]

## What statistical tests apply to categorical data?

Beyond encoding for machine learning, categorical data has a rich set of statistical tools for analysis and hypothesis testing.

### Chi-squared test of independence

The Pearson chi-squared test is the most widely used test for determining whether there is a statistically significant association between two categorical variables. It compares observed frequencies in a contingency table against the frequencies expected under the null hypothesis of independence.

The test statistic is:

```
chi-squared = sum of ((O - E)^2 / E)
```

where O is the observed frequency and E is the expected frequency for each cell. If the resulting p-value is below a chosen significance level (commonly 0.05), the null hypothesis of independence is rejected.

### Cramer's V

While the chi-squared test determines whether an association exists, it does not measure the strength of that association. Cramer's V fills this gap. It is derived from the chi-squared statistic and ranges from 0 (no association) to 1 (perfect association).

| Cramer's V | Interpretation |
|---|---|
| 0.00 to 0.10 | Negligible association |
| 0.10 to 0.30 | Weak association |
| 0.30 to 0.50 | Moderate association |
| 0.50 to 1.00 | Strong association |

These thresholds should be adjusted based on the degrees of freedom of the contingency table.

### Fisher's exact test

For small sample sizes where expected cell counts fall below 5, the chi-squared approximation becomes unreliable. Fisher's exact test computes the exact probability of obtaining the observed (or a more extreme) distribution under the null hypothesis, making it suitable for 2x2 contingency tables with small samples.

### Mutual information

Mutual information (MI) measures the amount of information that one variable provides about another. Unlike the chi-squared test, MI can capture nonlinear dependencies between variables. MI equals zero when two variables are independent and increases with stronger dependence. In scikit-learn, `mutual_info_classif` and `mutual_info_regression` compute MI scores for [feature selection](/wiki/feature_importances) with categorical data.

## How do you select the most useful categorical features?

Selecting the most informative categorical features before training a model can improve performance and reduce computation. Common approaches include:

| Method | Type | Handles mixed types | Notes |
|---|---|---|---|
| Chi-squared test | Filter | No (categorical only) | Tests association between each feature and the target; available via `SelectKBest` in scikit-learn |
| Mutual information | Filter | Yes (categorical and numerical) | Captures nonlinear relationships; more general than chi-squared |
| Cramer's V | Filter | No (categorical only) | Measures association strength; useful for feature-feature correlation analysis |
| Permutation importance | Model-based | Yes | Measures the drop in model performance when a feature's values are shuffled |
| Tree-based importance | Model-based | Yes | Uses split-based or gain-based importance from tree models |

## How do you handle missing values in categorical data?

Missing values are common in real-world categorical data and require careful handling. Several strategies are used in practice:

| Strategy | Description | When to use |
|---|---|---|
| Mode imputation | Replace missing values with the most frequent category | Small percentage of missing values (under 5 to 10%); values are missing at random |
| Dedicated "Missing" category | Treat missing as its own category | Missingness itself carries information (for example, a customer who did not provide their occupation) |
| Model-based imputation (KNN, MICE) | Predict the missing category using other features | Moderate missingness; when relationships between features can inform imputation |
| Drop rows or columns | Remove observations or features with missing values | Very small number of affected rows, or a feature with an extremely high missing rate |

Mode imputation is the simplest approach and works when the proportion of missing values is small. However, it can distort the distribution of the feature by inflating the count of the most frequent category. Creating a dedicated "Missing" category is often the most practical approach because it preserves all data and allows the model to learn whether missingness is predictive. For more complex scenarios, K-nearest neighbors (KNN) imputation and Multiple Imputation by Chained Equations (MICE) use relationships among features to predict missing values, though these methods are computationally more expensive.

## How do you avoid data leakage when encoding categorical data?

Data leakage occurs when information from the test set or target variable improperly influences the training process, leading to artificially inflated performance metrics that do not generalize to new data. Categorical encoding is one of the most common sources of data leakage in machine learning pipelines.

### Common leakage scenarios

- **One-hot encoding leakage.** Fitting the encoder on the entire dataset (including the test set) means the model knows which categories exist in the test data. If a category appears only in the test set, the encoder built on all data will create a column for it, while an encoder fit only on training data would not.
- **Target encoding leakage.** Computing category means using the full dataset (including samples the model will be evaluated on) gives the model direct access to target information it should not have. This is the most severe form of leakage for supervised encoders.
- **Frequency encoding leakage.** Counting category frequencies across the entire dataset inflates or deflates counts compared to what the model would see in production.

### Prevention strategies

The fundamental rule is to fit all encoding transformations on the training data only, then apply (transform) to the validation and test sets. In scikit-learn, wrapping encoders inside a `Pipeline` or `ColumnTransformer` and using [cross-validation](/wiki/cross-validation) ensures correct fit/transform separation. For target-based encoders specifically, internal cross-fitting (as implemented in scikit-learn's `TargetEncoder`) provides an additional layer of protection.

## How is categorical data represented in pandas?

The Python library [pandas](/wiki/pandas) provides a dedicated `Categorical` dtype for representing categorical data efficiently.[13] Converting a string column to the Categorical dtype can yield substantial benefits.

**Memory savings:** Internally, pandas stores categories as integer codes rather than repeating the full string for each row. For a column with 1 million rows but only 50 unique values, converting from `object` dtype to `Categorical` can reduce memory usage by over 95% (for example, from 64 MB to under 1 MB).[13]

**Performance improvements:** Operations such as `groupby`, `value_counts`, and comparisons run faster on Categorical columns because they operate on small integer codes rather than variable-length strings. Speedups of 5x to 10x are common for groupby operations on large datasets.

**Usage example in pandas:**

```python
import pandas as pd

df = pd.DataFrame({'color': ['red', 'blue', 'green', 'red', 'blue'] * 200000})
df['color'] = df['color'].astype('category')  # Convert to Categorical dtype

# Optional: specify a custom order for ordinal data
df['size'] = pd.Categorical(
    ['S', 'M', 'L', 'XL', 'S'] * 200000,
    categories=['S', 'M', 'L', 'XL'],
    ordered=True
)
```

The `Categorical` dtype also enforces a fixed set of allowed values, catching data entry errors such as misspellings. Scikit-learn and other libraries increasingly support Categorical columns directly, reducing the need for manual encoding.

## Scikit-learn encoding workflow

Scikit-learn provides a standardized workflow for encoding categorical features within a machine learning pipeline.[14] The `ColumnTransformer` class allows different encoding strategies to be applied to different columns in a single step.

```python
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

# Define which columns get which encoding
preprocessor = ColumnTransformer(
    transformers=[
        ('nominal', OneHotEncoder(handle_unknown='ignore'), ['color', 'country']),
        ('ordinal', OrdinalEncoder(categories=[['S','M','L','XL']]), ['size']),
    ],
    remainder='passthrough'  # keep numeric columns as-is
)

# Combine preprocessing and model in a single pipeline
pipe = Pipeline([
    ('preprocess', preprocessor),
    ('model', RandomForestClassifier())
])

pipe.fit(X_train, y_train)
score = pipe.score(X_test, y_test)
```

Using a pipeline ensures that encoding is fit only on training data and applied consistently to test data, preventing data leakage.

## Categorical data in other tools and frameworks

Beyond pandas and scikit-learn, several tools provide specialized support for categorical data.

| Tool / library | Categorical support |
|---|---|
| [PyTorch](/wiki/pytorch) | `nn.Embedding` layer for entity embeddings; manual encoding for other methods |
| [TensorFlow](/wiki/tensorflow) / Keras | `tf.keras.layers.Embedding`, `tf.feature_column.categorical_column_*` family of functions |
| category_encoders | Python library with 15+ encoding methods including WoE, James-Stein, CatBoost, and leave-one-out encoders |
| Feature-engine | Scikit-learn-compatible library with ordinal, target, count, decision tree, and mean encoding transformers |
| R (base) | Native `factor` type with ordered and unordered variants; contrast coding built into `lm()` and `glm()` |
| Apache Spark MLlib | `StringIndexer` (label encoding) and `OneHotEncoder` for distributed pipelines |

## Explain like I'm 5 (ELI5)

Imagine you have a box of crayons. Each crayon has a different color: red, blue, green, yellow. These colors are categorical data because they are just names for different groups. You cannot say red is "bigger" than blue or add green plus yellow together.

Now, there are two kinds of groups:

- **Nominal** groups have no order. The colors of crayons are nominal because no color is "first" or "last."
- **Ordinal** groups do have an order. Shirt sizes (small, medium, large) are ordinal because small comes before medium, which comes before large.

Computers only understand numbers, not color names. So before a computer can learn from this data, we need to turn the labels into numbers. There are different ways to do this. One way is to give each color its own column and mark it with a 1 or 0. Another way is to replace the labels with their average score. Picking the right way to turn labels into numbers helps the computer learn better.

## See also

- [Numerical data](/wiki/numerical_data)
- [One-hot encoding](/wiki/one-hot_encoding)
- [Feature engineering](/wiki/feature_engineering)
- [Preprocessing](/wiki/preprocessing)
- [Overfitting](/wiki/overfitting)
- [Decision tree](/wiki/decision_tree)
- [Embeddings](/wiki/embeddings)

## References

1. Stevens, S. S. "On the Theory of Scales of Measurement." *Science*, 103(2684), 677-680, 1946.
2. Guo, C. and Berkhahn, F. "Entity Embeddings of Categorical Variables." *arXiv preprint arXiv:1604.06737*, 2016. https://arxiv.org/abs/1604.06737
3. Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., and Gulin, A. "CatBoost: unbiased boosting with categorical features." *Advances in Neural Information Processing Systems (NeurIPS)*, 2018. https://arxiv.org/abs/1706.09516
4. Weinberger, K., Dasgupta, A., Langford, J., Smola, A., and Attenberg, J. "Feature hashing for large scale multitask learning." *Proceedings of the 26th International Conference on Machine Learning (ICML)*, 2009.
5. Pargent, F., Pfisterer, F., Thomas, J., and Bischl, B. "Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features." *Computational Statistics*, 37(5), 2022. https://link.springer.com/article/10.1007/s00180-022-01207-6
6. Hancock, J. T. and Khoshgoftaar, T. M. "Survey on categorical data for neural networks." *Journal of Big Data*, 7(1), 2020.
7. Potdar, K., Pardawala, T., and Pai, C. "A Comparative Study of Categorical Variable Encoding Techniques for Neural Network Classifiers." *International Journal of Computer Applications*, 175(4), 2017.
8. Grinsztajn, L., Oyallon, E., and Varoquaux, G. "Why do tree-based models still outperform deep learning on typical tabular data?" *NeurIPS Datasets and Benchmarks Track*, 2022.
9. Cerda, P. and Varoquaux, G. "Encoding high-cardinality string categorical variables." *IEEE Transactions on Knowledge and Data Engineering*, 34(3), 2022.
10. Dahouda, M. K. and Joe, I. "Effective Methods of Categorical Data Encoding for Artificial Intelligence Algorithms." *Mathematics*, 12(16), 2553, 2024.
11. McGinnis, W. D., Siu, C., Andre, S., and Huang, H. "Category Encoders: a scikit-learn-contrib package of transformers for encoding categorical data." *Journal of Open Source Software*, 3(21), 501, 2018.
12. Seger, C. "An investigation of categorical variable encoding techniques in machine learning: binary versus one-hot and feature hashing." *KTH Royal Institute of Technology*, 2018.
13. pandas documentation. "Categorical data." https://pandas.pydata.org/docs/user_guide/categorical.html
14. scikit-learn documentation. "Preprocessing: Encoding categorical features." https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features
15. XGBoost documentation. "Categorical Data." https://xgboost.readthedocs.io/en/stable/tutorials/categorical.html