In machine learning and statistics, categorical data (also called qualitative data) refers to variables that take on a discrete set of values representing categories, groups, or labels rather than numerical quantities. Unlike continuous or numerical data, categorical values have no inherent mathematical meaning: you cannot meaningfully add, subtract, or average them. Categorical data plays a central role in many machine learning tasks, including classification, clustering, and regression, where it often appears as input features, target labels, or both.
Examples of categorical data include colors (red, blue, green), country names (USA, France, Japan), blood types (A, B, AB, O), and product ratings (1 star through 5 stars). Since most machine learning algorithms operate on numerical inputs, converting categorical data into a suitable numeric representation is a fundamental step in data preprocessing.
The psychologist Stanley Smith Stevens introduced the classic typology of measurement scales in his 1946 paper "On the Theory of Scales of Measurement," published in Science. Stevens identified four levels: nominal, ordinal, interval, and ratio. Categorical data falls under the first two levels.
| Level | Ordered | Equal spacing | True zero | Examples |
|---|---|---|---|---|
| Nominal | No | No | No | Colors, blood types, country codes |
| Ordinal | Yes | No | No | Education level, satisfaction ratings, T-shirt sizes |
| Interval | Yes | Yes | No | Temperature in Celsius, calendar years |
| Ratio | Yes | Yes | Yes | Height, weight, income |
The interval and ratio levels describe quantitative (numerical) data. The distinction between nominal and ordinal categorical data has direct consequences for which encoding methods are appropriate and which statistical tests can be applied.
Categorical data falls into two main subtypes: nominal and ordinal.
Nominal data consists of categories with no natural order or ranking. The categories are simply different labels, and no category is "greater than" or "less than" another. Examples include:
Because there is no ordering relationship among nominal categories, encoding methods that impose an artificial numeric order (such as label encoding) can mislead certain models into assuming a ranking that does not exist.
Ordinal data consists of categories with a meaningful order or ranking, but the distances between categories are not necessarily equal or known. Examples include:
The key difference from nominal data is that ordinal categories carry information about relative position. A "5-star" rating is better than a "3-star" rating, even though the exact numerical gap between them may not be well defined.
Binary data is a special case of nominal data with exactly two categories. Examples include yes/no, true/false, male/female, and pass/fail. Binary features are the simplest categorical type and can be represented directly as 0 or 1 without any risk of imposing a false ordering. Many encoding methods reduce multi-category features into sets of binary columns.
The cardinality of a categorical feature refers to the number of unique categories it contains. This distinction has major practical implications for encoding and modeling.
| Cardinality level | Typical range | Examples | Encoding considerations |
|---|---|---|---|
| Low cardinality | 2 to ~20 unique values | Gender, color, day of week, country (small set) | One-hot encoding works well; most encoding methods are viable |
| Medium cardinality | ~20 to ~100 unique values | US state, product category, job title | One-hot encoding starts creating many columns; target or binary encoding may be preferable |
| High cardinality | 100+ unique values | ZIP code, user ID, product SKU, IP address | One-hot encoding is impractical; feature hashing, target encoding, or entity embeddings are needed |
High-cardinality features pose a particular challenge. Applying one-hot encoding to a feature with 10,000 unique values would create 10,000 new binary columns, leading to extreme dimensionality, increased memory usage, and potential overfitting. Specialized encoding strategies are essential for handling these cases effectively.
Since most machine learning algorithms require numerical inputs, categorical features must be transformed into numbers through a process called encoding. The choice of encoding method depends on the type of categorical data, its cardinality, and the model being used.
One-hot encoding creates a new binary column for each unique category. Each observation gets a 1 in the column corresponding to its category and 0 in all other columns. For a color feature with values red, blue, and green:
| Original | is_red | is_blue | is_green |
|---|---|---|---|
| Red | 1 | 0 | 0 |
| Blue | 0 | 1 | 0 |
| Green | 0 | 0 | 1 |
One-hot encoding is the most widely used method for nominal data with low to moderate cardinality. It avoids introducing any false ordinal relationship between categories. However, it suffers from the curse of dimensionality when applied to high-cardinality features, as it creates one column per category.
A common variant is dummy encoding, which drops one of the K columns (producing K-1 columns) to avoid perfect multicollinearity in linear regression and other linear models. The dropped category becomes the implicit reference level.
Label encoding assigns each category a unique integer (for example, red = 0, blue = 1, green = 2). This produces a single numeric column, making it memory-efficient. However, it introduces an implicit ordinal relationship between categories. A model might incorrectly interpret green (2) as being "twice" blue (1) or "greater than" red (0).
Label encoding is appropriate for ordinal data where the integer ordering reflects the true ranking. For nominal data, it should generally be avoided with linear models and neural networks, though decision tree-based models are less affected because they split on specific threshold values rather than assuming linear relationships.
Ordinal encoding is similar to label encoding but explicitly maps categories to integers according to a known, meaningful order. For education level, a mapping like high school = 1, bachelor's = 2, master's = 3, doctorate = 4 preserves the natural ranking. In scikit-learn, the OrdinalEncoder class accepts a user-specified category order, ensuring the mapping aligns with domain knowledge rather than arbitrary alphabetical or appearance-based sorting.
Best practices for ordinal encoding include fitting the encoder on training data only (to prevent data leakage) and handling unknown categories at inference time by assigning them a default value or raising an error.
Target encoding replaces each category with the mean of the target variable for that category. For a binary classification task, each category value is replaced by the proportion of positive examples observed for that category in the training data.
Target encoding is powerful because it captures the relationship between the category and the target directly. It works well with high-cardinality features since it always produces a single numeric column. The major risk is target leakage: if the encoding is computed using the same data that the model trains on, the model can memorize the target through the encoded values. To mitigate this, practitioners use several techniques:
Scikit-learn introduced a built-in TargetEncoder class (since version 1.3) that applies internal cross-fitting to reduce leakage.
Frequency encoding replaces each category with its frequency or proportion of occurrence in the dataset. If "blue" appears in 30 out of 100 rows, it is encoded as 30 (count encoding) or 0.30 (frequency encoding).
This method is simple and produces a single column per feature. It works well when the frequency of a category is genuinely correlated with the target variable. A limitation is that categories with the same frequency receive identical encoded values, causing a loss of distinguishing information.
Binary encoding is a compromise between one-hot encoding and label encoding. First, each category is assigned an integer. Then, each integer is converted to its binary (base-2) representation, and each bit becomes a separate column. For a feature with 8 categories, binary encoding produces only 3 columns (since log2(8) = 3), compared to 8 columns for one-hot encoding.
Binary encoding reduces dimensionality significantly for high-cardinality features. However, it can introduce misleading distances between categories: two categories whose binary representations differ by a single bit may appear closer than categories differing by multiple bits, even when no such proximity exists in reality.
Feature hashing applies a hash function (such as MurmurHash3) to map each category to one of a fixed number of output columns. This approach is useful for extremely high-cardinality features or situations where the full set of categories is not known in advance (for example, streaming data with new categories appearing over time).
The number of output columns is a parameter chosen by the practitioner, providing direct control over dimensionality. The main drawback is hash collisions: different categories may map to the same column, mixing unrelated information. Despite this, feature hashing is widely used in large-scale production systems where memory efficiency and speed are needed. The scikit-learn library provides FeatureHasher for this purpose.
Weinberger et al. (2009) introduced the hashing trick for large-scale multitask learning and proved exponential tail bounds showing that hash collisions have negligible impact on learning performance with high probability.
Weight of evidence encoding originated in the credit scoring industry and is designed specifically for binary classification problems. For each category, WoE is calculated as the natural logarithm of the ratio of the proportion of positive cases to the proportion of negative cases:
WoE = ln(Distribution of Positives / Distribution of Negatives)
Positive WoE values indicate a category with more positive cases than expected; negative values indicate more negative cases. A WoE of zero means the category has an equal proportion of both classes. WoE encoding is popular in financial risk modeling, fraud detection, and credit scoring because it produces a monotonic relationship between the encoded feature and the log-odds of the target.
James-Stein encoding is a Bayesian shrinkage technique that computes a weighted average between the category-specific target mean and the overall (global) target mean. The weight depends on the variance and sample size within each category: categories with fewer observations are shrunk more heavily toward the global mean, while categories with many observations retain values closer to their observed mean.
This approach is based on the James-Stein estimator, which was originally defined for normally distributed data. It naturally regularizes against overfitting on rare categories and is available in the category_encoders library for Python.
Leave-one-out (LOO) encoding is closely related to target encoding. For each observation, the encoded value is the mean of the target variable for all other observations sharing the same category, excluding the current observation. By leaving out each row's own target value, LOO encoding reduces the direct target leakage present in naive target encoding. However, it can still overfit on small categories where excluding a single observation produces large fluctuations in the mean.
Entity embeddings, introduced by Guo and Berkhahn in 2016, use a neural network to learn a dense, low-dimensional vector representation for each category. During training, each category is mapped to an embedding vector (similar to word embeddings in natural language processing), and the embedding weights are updated through backpropagation.
Entity embeddings capture semantic relationships between categories. For example, an embedding for geographic regions might learn that "France" and "Germany" are closer together than "France" and "Japan." This technique excels with high-cardinality features and large datasets, and the learned embeddings can be reused as input features for other models, including tree-based methods. The downside is that it requires a neural network training step and sufficient data to learn meaningful representations. Entity embeddings were popularized by their success in the Kaggle "Rossmann Store Sales" competition and have since become standard practice in deep learning for tabular data.
In the statistics and social sciences community, several contrast coding systems are used to represent categorical variables in regression models. These methods encode a K-level categorical variable into K-1 contrast vectors with specific mathematical properties.
| Scheme | Reference point | Interpretation of coefficients |
|---|---|---|
| Treatment (dummy) coding | One reference category | Difference between each category and the reference category |
| Sum (deviation) coding | Grand mean | Deviation of each category from the overall mean |
| Helmert coding | Mean of subsequent levels | Difference between each level and the average of all subsequent levels |
| Backward difference coding | Previous level | Difference between each level and the preceding level |
All contrast coding schemes yield the same model predictions; they differ only in how the regression coefficients are interpreted. Treatment coding is the default in most software and is equivalent to dummy encoding. Sum coding is preferred when the goal is to compare each category against the overall average rather than against a single reference category.
| Method | Output columns | Handles high cardinality | Preserves order | Risks / drawbacks | Best for |
|---|---|---|---|---|---|
| One-hot encoding | K (one per category) | No (dimensionality explosion) | No | Sparse, high memory for many categories | Low-cardinality nominal features |
| Label encoding | 1 | Yes | Yes (imposed) | False ordinal relationship for nominal data | Ordinal features; tree-based models |
| Ordinal encoding | 1 | Yes | Yes (user-defined) | Requires domain knowledge of ordering | Ordinal features with known ranking |
| Target encoding | 1 | Yes | N/A | Target leakage without regularization | High-cardinality features with supervised tasks |
| Frequency encoding | 1 | Yes | N/A | Categories with same frequency become identical | Frequency-correlated features |
| Binary encoding | log2(K) | Partially | No | Misleading distances between categories | Medium to high-cardinality features |
| Feature hashing | Fixed (user-chosen) | Yes | No | Hash collisions; irreversible | Very high cardinality; streaming data |
| WoE encoding | 1 | Yes | N/A | Only for binary targets; zero-frequency issues | Credit scoring; risk modeling |
| James-Stein encoding | 1 | Yes | N/A | Assumes normality of target | Target-based tasks with rare categories |
| Leave-one-out encoding | 1 | Yes | N/A | Overfitting on small categories | Moderate-cardinality supervised tasks |
| Entity embeddings | D (embedding dimension) | Yes | Learned | Requires neural net training; needs large data | High-cardinality features with deep learning |
Different families of machine learning models interact with categorical features in fundamentally different ways. Choosing the right encoding depends heavily on the model.
Decision trees, random forests, and gradient boosting models split features at specific threshold values. Because they do not assume any linear or continuous relationship between feature values, they are more tolerant of label encoding even for nominal data. A decision tree simply asks "Is feature X equal to 2?" rather than interpreting 2 as numerically meaningful.
Some implementations can handle categorical features natively without any preprocessing:
| Library | Native categorical support | Method used |
|---|---|---|
| CatBoost | Yes (built-in) | Ordered target statistics |
| LightGBM | Yes (built-in) | Optimal split finding on categories |
| XGBoost | Experimental (since v1.6) | Optimal partitioning of categories |
| Scikit-learn trees | No | Requires manual encoding |
For libraries without native support, label encoding or target encoding typically works well.
Linear regression, logistic regression, and support vector machines learn by assigning weights to each feature and computing predictions through weighted sums. These models assume a linear relationship between feature values and the target, so label encoding nominal data is problematic: the model would treat the numeric gaps between category codes as meaningful. One-hot encoding is the standard approach for linear models with low-cardinality features. For high-cardinality features, target encoding or feature hashing are better alternatives.
Neural networks are flexible enough to learn complex nonlinear relationships, but they still require numerical input. One-hot encoding works for low-cardinality features, but entity embeddings are the preferred approach for medium to high-cardinality features. Embedding layers (available in frameworks like PyTorch and TensorFlow) learn dense representations during training, capturing relationships between categories that one-hot encoding cannot express.
Naive Bayes classifiers can work directly with categorical features without any encoding. The algorithm estimates the conditional probability of each category given each class label from the training data. For this reason, Naive Bayes is sometimes used as a baseline for classification tasks with many categorical features.
CatBoost (short for "Categorical Boosting") is a gradient boosting library developed by Yandex that was designed specifically to handle categorical features without manual preprocessing. It uses a technique called ordered target statistics to encode categories during training.
The core idea is to compute a target-based encoding for each category, but in an order-dependent way that prevents target leakage. For each training example, CatBoost calculates the average target value only from examples that appeared before it in a random permutation of the data. This "look only at the past" approach ensures that no example's target value is used to compute its own encoding.
The formula for the ordered target statistic for observation i with category value k is:
TS(i, k) = (sum of target values for category k before position i + prior) / (count of category k before position i + 1)
The prior term is typically the global target mean multiplied by a smoothing parameter a (default 1), which regularizes the estimate for rare categories.
CatBoost also has a parameter called one_hot_max_size (default is typically 2) that determines a threshold: features with fewer unique values than this threshold are one-hot encoded, while features with more unique values use ordered target statistics. Additionally, CatBoost can automatically generate and encode feature combinations (interactions between pairs of categorical features), expanding the effective feature space without manual feature engineering.
Prokhorenkova et al. (2018) showed that standard target statistics (used in other gradient boosting implementations) introduce a prediction shift, and CatBoost's ordered approach eliminates this bias.
Beyond encoding for machine learning, categorical data has a rich set of statistical tools for analysis and hypothesis testing.
The Pearson chi-squared test is the most widely used test for determining whether there is a statistically significant association between two categorical variables. It compares observed frequencies in a contingency table against the frequencies expected under the null hypothesis of independence.
The test statistic is:
chi-squared = sum of ((O - E)^2 / E)
where O is the observed frequency and E is the expected frequency for each cell. If the resulting p-value is below a chosen significance level (commonly 0.05), the null hypothesis of independence is rejected.
While the chi-squared test determines whether an association exists, it does not measure the strength of that association. Cramer's V fills this gap. It is derived from the chi-squared statistic and ranges from 0 (no association) to 1 (perfect association).
| Cramer's V | Interpretation |
|---|---|
| 0.00 to 0.10 | Negligible association |
| 0.10 to 0.30 | Weak association |
| 0.30 to 0.50 | Moderate association |
| 0.50 to 1.00 | Strong association |
These thresholds should be adjusted based on the degrees of freedom of the contingency table.
For small sample sizes where expected cell counts fall below 5, the chi-squared approximation becomes unreliable. Fisher's exact test computes the exact probability of obtaining the observed (or a more extreme) distribution under the null hypothesis, making it suitable for 2x2 contingency tables with small samples.
Mutual information (MI) measures the amount of information that one variable provides about another. Unlike the chi-squared test, MI can capture nonlinear dependencies between variables. MI equals zero when two variables are independent and increases with stronger dependence. In scikit-learn, mutual_info_classif and mutual_info_regression compute MI scores for feature selection with categorical data.
Selecting the most informative categorical features before training a model can improve performance and reduce computation. Common approaches include:
| Method | Type | Handles mixed types | Notes |
|---|---|---|---|
| Chi-squared test | Filter | No (categorical only) | Tests association between each feature and the target; available via SelectKBest in scikit-learn |
| Mutual information | Filter | Yes (categorical and numerical) | Captures nonlinear relationships; more general than chi-squared |
| Cramer's V | Filter | No (categorical only) | Measures association strength; useful for feature-feature correlation analysis |
| Permutation importance | Model-based | Yes | Measures the drop in model performance when a feature's values are shuffled |
| Tree-based importance | Model-based | Yes | Uses split-based or gain-based importance from tree models |
Missing values are common in real-world categorical data and require careful handling. Several strategies are used in practice:
| Strategy | Description | When to use |
|---|---|---|
| Mode imputation | Replace missing values with the most frequent category | Small percentage of missing values (under 5 to 10%); values are missing at random |
| Dedicated "Missing" category | Treat missing as its own category | Missingness itself carries information (for example, a customer who did not provide their occupation) |
| Model-based imputation (KNN, MICE) | Predict the missing category using other features | Moderate missingness; when relationships between features can inform imputation |
| Drop rows or columns | Remove observations or features with missing values | Very small number of affected rows, or a feature with an extremely high missing rate |
Mode imputation is the simplest approach and works when the proportion of missing values is small. However, it can distort the distribution of the feature by inflating the count of the most frequent category. Creating a dedicated "Missing" category is often the most practical approach because it preserves all data and allows the model to learn whether missingness is predictive. For more complex scenarios, K-nearest neighbors (KNN) imputation and Multiple Imputation by Chained Equations (MICE) use relationships among features to predict missing values, though these methods are computationally more expensive.
Data leakage occurs when information from the test set or target variable improperly influences the training process, leading to artificially inflated performance metrics that do not generalize to new data. Categorical encoding is one of the most common sources of data leakage in machine learning pipelines.
The fundamental rule is to fit all encoding transformations on the training data only, then apply (transform) to the validation and test sets. In scikit-learn, wrapping encoders inside a Pipeline or ColumnTransformer and using cross-validation ensures correct fit/transform separation. For target-based encoders specifically, internal cross-fitting (as implemented in scikit-learn's TargetEncoder) provides an additional layer of protection.
The Python library pandas provides a dedicated Categorical dtype for representing categorical data efficiently. Converting a string column to the Categorical dtype can yield substantial benefits.
Memory savings: Internally, pandas stores categories as integer codes rather than repeating the full string for each row. For a column with 1 million rows but only 50 unique values, converting from object dtype to Categorical can reduce memory usage by over 95% (for example, from 64 MB to under 1 MB).
Performance improvements: Operations such as groupby, value_counts, and comparisons run faster on Categorical columns because they operate on small integer codes rather than variable-length strings. Speedups of 5x to 10x are common for groupby operations on large datasets.
Usage example in pandas:
import pandas as pd
df = pd.DataFrame({'color': ['red', 'blue', 'green', 'red', 'blue'] * 200000})
df['color'] = df['color'].astype('category') # Convert to Categorical dtype
# Optional: specify a custom order for ordinal data
df['size'] = pd.Categorical(
['S', 'M', 'L', 'XL', 'S'] * 200000,
categories=['S', 'M', 'L', 'XL'],
ordered=True
)
The Categorical dtype also enforces a fixed set of allowed values, catching data entry errors such as misspellings. Scikit-learn and other libraries increasingly support Categorical columns directly, reducing the need for manual encoding.
Scikit-learn provides a standardized workflow for encoding categorical features within a machine learning pipeline. The ColumnTransformer class allows different encoding strategies to be applied to different columns in a single step.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
# Define which columns get which encoding
preprocessor = ColumnTransformer(
transformers=[
('nominal', OneHotEncoder(handle_unknown='ignore'), ['color', 'country']),
('ordinal', OrdinalEncoder(categories=[['S','M','L','XL']]), ['size']),
],
remainder='passthrough' # keep numeric columns as-is
)
# Combine preprocessing and model in a single pipeline
pipe = Pipeline([
('preprocess', preprocessor),
('model', RandomForestClassifier())
])
pipe.fit(X_train, y_train)
score = pipe.score(X_test, y_test)
Using a pipeline ensures that encoding is fit only on training data and applied consistently to test data, preventing data leakage.
Beyond pandas and scikit-learn, several tools provide specialized support for categorical data.
| Tool / library | Categorical support |
|---|---|
| PyTorch | nn.Embedding layer for entity embeddings; manual encoding for other methods |
| TensorFlow / Keras | tf.keras.layers.Embedding, tf.feature_column.categorical_column_* family of functions |
| category_encoders | Python library with 15+ encoding methods including WoE, James-Stein, CatBoost, and leave-one-out encoders |
| Feature-engine | Scikit-learn-compatible library with ordinal, target, count, decision tree, and mean encoding transformers |
| R (base) | Native factor type with ordered and unordered variants; contrast coding built into lm() and glm() |
| Apache Spark MLlib | StringIndexer (label encoding) and OneHotEncoder for distributed pipelines |
Imagine you have a box of crayons. Each crayon has a different color: red, blue, green, yellow. These colors are categorical data because they are just names for different groups. You cannot say red is "bigger" than blue or add green plus yellow together.
Now, there are two kinds of groups:
Computers only understand numbers, not color names. So before a computer can learn from this data, we need to turn the labels into numbers. There are different ways to do this. One way is to give each color its own column and mark it with a 1 or 0. Another way is to replace the labels with their average score. Picking the right way to turn labels into numbers helps the computer learn better.