# One-Hot Encoding

> Source: https://aiwiki.ai/wiki/one-hot_encoding
> Updated: 2026-06-21
> Categories: Data & Datasets, Machine Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

One-hot encoding is a data preprocessing technique that converts a [categorical variable](/wiki/categorical_variable) with $k$ distinct categories into $k$ binary columns, where each category is represented by a vector that contains exactly one 1 and all other entries set to 0. It is the standard way to feed nominal (unordered) [categorical data](/wiki/categorical_data) into [machine learning](/wiki/machine_learning) algorithms that require numerical input, and its defining advantage is that it introduces no spurious order or magnitude between categories: every pair of distinct one-hot vectors is exactly the same distance apart ($\sqrt{2}$ in Euclidean distance) and has a cosine similarity of 0. The technique is a core part of [feature engineering](/wiki/feature_engineering) pipelines and is built into every major data science library, including [pandas](/wiki/pandas), [scikit-learn](/wiki/scikit_learn), [TensorFlow](/wiki/tensorflow), and [PyTorch](/wiki/pytorch).

The term "one-hot" originates from digital circuit design, where it describes a group of bits in which only one bit is "hot" (set to 1) at any given time. In machine learning, the concept was adopted to represent discrete categories as mutually exclusive binary columns. The same encoding is sometimes called a 1-of-k representation or indicator variable representation in classical statistics, where it has been used in regression analysis since at least the mid-twentieth century.[1]

## How does one-hot encoding work?

Most machine learning algorithms, including [linear models](/wiki/linear_model), [neural networks](/wiki/neural_network), and support vector machines, operate on numerical input. Categorical variables such as color, country, or product type cannot be directly processed by these algorithms. One-hot encoding solves this by creating a new binary column for each unique category in the original variable.

The encoding process follows these steps:

1. Identify all unique categories in the categorical variable.
2. Create a new binary column (feature) for each unique category.
3. For each observation, set the column corresponding to its category to 1 and all other columns to 0.

### Mathematical formulation

Let $X$ be a [categorical variable](/wiki/categorical_variable) with $k$ distinct levels indexed by $\{c_1, c_2, \ldots, c_k\}$. The one-hot encoding function $\phi : X \to \{0, 1\}^k$ maps each value $x = c_i$ to the standard basis vector $e_i \in \mathbb{R}^k$, where $e_i$ has a 1 in position $i$ and 0 elsewhere. Formally,

$\phi(c_i)_j = \begin{cases} 1 & \text{if } j = i \\ 0 & \text{otherwise} \end{cases}$

This representation has several useful mathematical properties. The Euclidean distance between any two distinct one-hot vectors is $\sqrt{2}$, the dot product between any two distinct vectors is 0, and the L1 (sum) of every encoded vector equals 1. These properties guarantee that no spurious order or magnitude is introduced by the encoding itself.

### Worked example

Consider a dataset with a "Color" feature containing three categories: Red, Green, and Blue.

| Original value | Red | Green | Blue |
|---|---|---|---|
| Red | 1 | 0 | 0 |
| Green | 0 | 1 | 0 |
| Blue | 0 | 0 | 1 |
| Red | 1 | 0 | 0 |
| Blue | 0 | 0 | 1 |

The single "Color" column has been replaced by three binary columns. Each row has exactly one 1, indicating its category. This representation ensures the algorithm treats each color independently without assuming any ordinal relationship between them.

For a variable with $k$ unique categories, one-hot encoding produces $k$ binary columns. Each resulting vector has dimensionality $k$, with a single 1 at the position corresponding to the category index.

## Where did one-hot encoding come from?

The phrase "one-hot" came from digital electronics, where finite state machines often use a one-hot register: a $k$-bit register that holds at most one active bit at a time, with each bit representing a separate state. This pattern was popular because it simplifies state-decoding logic at the cost of using more flip-flops than a binary-encoded counter. By the 1980s, the same idea had migrated into the symbolic AI and connectionist communities for representing discrete symbols in neural networks. Geoffrey Hinton and David Rumelhart's work on distributed representations, and later David Touretzky's symbolic-to-vector encoders, all relied on one-of-k input units.

In classical statistics, the same construction is older still. Regression textbooks have called these "dummy variables" since at least Suits (1957) and have used them to incorporate qualitative predictors into ordinary least squares regression.[1] Daniel Suits framed the central requirement that one-hot encoding still works around today: "The use of dummy variables requires the imposition of additional constraints on the parameters of regression equations if determinate estimates are to be obtained," the most useful of which is to "omit one of the dummy variables from the equation."[1] Modern machine learning inherits both vocabularies: the term dummy variable is more common in statistics and econometrics, while one-hot is preferred in machine learning, deep learning, and software engineering contexts.

## One-hot encoding vs label encoding vs ordinal encoding

Several methods exist for encoding categorical variables. The right choice depends on the nature of the data and the model being used.[2][10]

| Property | One-hot encoding | Label encoding | [Ordinal encoding](/wiki/ordinal_encoding) |
|---|---|---|---|
| Output format | $k$ binary columns (one per category) | Single integer column | Single integer column |
| Assumes order | No | No (but model may infer one) | Yes |
| Best suited for | Nominal data with no inherent order | Tree-based models, low cardinality | Data with a natural ranking |
| Dimensionality impact | Increases by $k$ columns | No increase | No increase |
| Risk of spurious ordering | None | High for linear models | None if order is real |
| Example | Red -> [1,0,0], Green -> [0,1,0] | Red -> 0, Green -> 1, Blue -> 2 | Low -> 0, Medium -> 1, High -> 2 |

**Label encoding** assigns each category a unique integer. While compact, it introduces a numerical ordering that linear models and neural networks may interpret as meaningful. For example, encoding "Red" as 0, "Green" as 1, and "Blue" as 2 could cause the model to treat "Blue" as numerically "greater" than "Red," which is meaningless for nominal categories.[2]

**Ordinal encoding** is appropriate when the categories have a genuine rank order, such as education levels (High School < Bachelor's < Master's < PhD) or satisfaction ratings (Low < Medium < High). In these cases, the integer mapping preserves the meaningful ordering.

**One-hot encoding** is the safest default for nominal categorical variables because it does not impose any ordering. However, it comes with a higher dimensionality cost.[10]

## When should you use one-hot encoding?

One-hot encoding is most appropriate in the following situations:

- The categorical variable is **nominal** (categories have no inherent order), such as country, color, or product type.
- The model is a **linear model**, logistic regression, SVM, or neural network that interprets numeric input as having magnitude and order.[2]
- The number of unique categories is **relatively small** (typically fewer than 15 to 20 categories).
- The dataset is large enough that the added dimensionality does not cause overfitting.

One-hot encoding is generally not the best choice when dealing with [high cardinality](/wiki/high_cardinality) features (hundreds or thousands of unique categories), ordinal variables, or tree-based models that handle integer-encoded categories natively.

## Does every model need one-hot encoding?

The necessity of one-hot encoding depends heavily on the algorithm being used.

### Linear models and neural networks

Linear regression, logistic regression, SVMs, and neural networks all compute weighted sums of input features. If a categorical variable is encoded as a single integer column, the model treats those integers as continuous values on a number line, creating spurious relationships. One-hot encoding prevents this by giving each category its own independent coefficient or weight.[7]

In a linear regression with $k - 1$ dummy variables and an intercept, the coefficient on each dummy column gives the average difference in the response between that category and the reference category, holding all other features constant.[1][15] This direct interpretation is one reason dummy coding remains the default in econometric software such as Stata, R, and statsmodels.

### Tree-based models

[Decision tree](/wiki/decision_tree) algorithms, [random forests](/wiki/random_forest), and gradient-boosted trees (such as [XGBoost](/wiki/xgboost), [LightGBM](/wiki/lightgbm), and [CatBoost](/wiki/catboost)) split features based on threshold comparisons. These models can handle integer-encoded categorical variables without assuming an ordering, because any split on an integer column effectively partitions the categories into two groups. In fact, one-hot encoding can be detrimental for tree-based models: it produces many sparse binary columns, each of which carries very little information individually. The tree algorithm may undervalue these columns relative to continuous features, leading to worse splits.[10] Libraries like LightGBM and CatBoost provide native support for categorical features without requiring one-hot encoding.

### Distance-based and probabilistic models

Algorithms that rely on distance metrics, such as $k$-nearest neighbors, $k$-means clustering, and Gaussian mixture models, also benefit from one-hot encoding. With one-hot vectors, the distance between two observations differing only in a categorical attribute is the same regardless of which two categories are involved, preserving the symmetry of the original problem. Naive Bayes classifiers can also work directly with one-hot encoded features under the multinomial or Bernoulli assumption, although they typically use raw counts or indicator variables rather than full one-hot tables.

## What is the dummy variable trap?

The dummy variable trap is a form of perfect [multicollinearity](/wiki/multicollinearity) that arises when all $k$ one-hot encoded columns are included as predictors in a linear model. Because the columns always sum to 1, any one column can be perfectly predicted from the remaining $k - 1$ columns. This makes the design matrix singular, preventing the model from computing unique coefficient estimates.[15]

The standard solution is to drop one of the $k$ binary columns, producing $k - 1$ "dummy variables." The dropped category becomes the **reference category**, and the model's coefficients for the remaining categories are interpreted relative to it. In Python, both pandas `get_dummies(drop_first=True)`[12] and [scikit-learn](/wiki/scikit_learn)'s `OneHotEncoder(drop='first')`[9] support this directly. Scikit-learn added the `drop` parameter in version 0.21 and the `drop='if_binary'` option in version 0.23.[9]

The dummy variable trap is specific to linear models with an intercept term. Tree-based models and neural networks (which typically use regularization) are not affected by this issue. In regularized regressions such as ridge or lasso, dropping a category is not strictly required either, because the L2 or L1 penalty resolves the singularity by shrinking redundant coefficients toward zero.[15]

## Why does one-hot encoding cause the curse of dimensionality?

One-hot encoding can dramatically increase the number of features in a dataset. A single categorical column with 1,000 unique values expands into 1,000 binary columns. This phenomenon, related to the [curse of dimensionality](/wiki/curse_of_dimensionality), creates several practical problems:

- **Increased memory usage:** The resulting matrix is mostly zeros, consuming storage without carrying proportional information.
- **Overfitting risk:** With many more features than observations, models may memorize noise rather than learn generalizable patterns.[7]
- **Slower training:** More columns mean larger weight matrices, longer gradient computations, and slower convergence.
- **Diluted signal:** In tree-based models, the many low-information binary columns compete with genuinely informative features for selection at each split.

When a categorical variable has high cardinality, practitioners typically turn to alternative encoding methods such as [target encoding](/wiki/target_encoding), hash encoding, or [entity embedding](/wiki/entity_embedding) rather than using one-hot encoding.[11]

## Sparse vs dense representation

One-hot encoded matrices are inherently sparse: in a vector of length $k$, only one element is nonzero. For high-cardinality features, storing the full dense matrix wastes significant memory. [Sparse representation](/wiki/sparse_representation) formats such as compressed sparse row (CSR) or compressed sparse column (CSC) store only the nonzero entries, reducing memory consumption by orders of magnitude.

Scikit-learn's `OneHotEncoder` returns a sparse matrix by default (controlled by `sparse_output=True`, the parameter that was renamed from `sparse` in version 1.2), which is compatible with most scikit-learn estimators.[9] Pandas' `get_dummies()` returns a dense DataFrame by default, though it supports a `sparse=True` option that uses pandas' `SparseArray` internally.[12] When working with large datasets or high-cardinality features, using sparse representations is critical for keeping memory usage manageable.

A simple memory comparison illustrates the gap. A dense float64 matrix with one million rows and 10,000 one-hot columns would consume roughly 80 GB of RAM. The same data stored in CSR format with one nonzero per row uses only about 12 MB for the indices and values, more than three orders of magnitude smaller. This is why production pipelines almost always carry one-hot data in sparse form until just before it enters a model that requires a dense input.

## How is one-hot encoding used in natural language processing?

One-hot encoding has historically played an important role in [natural language processing](/wiki/natural_language_processing) (NLP). In its simplest form, each word in a vocabulary is represented as a one-hot vector whose length equals the vocabulary size. A vocabulary of 50,000 words produces 50,000-dimensional vectors with a single 1 per word.

### Bag of words

The [bag of words](/wiki/bag_of_words) model extends one-hot encoding to entire documents. Rather than a single 1 per vector, a bag-of-words vector counts the occurrences of each vocabulary word in a document. This can be viewed as the sum of one-hot vectors for all words in the document. While simple and effective for basic text classification, bag of words shares the sparsity and high dimensionality problems of one-hot encoding.

### From one-hot vectors to word embeddings

One-hot encoding treats every word as equally distant from every other word; the cosine similarity between any two one-hot vectors is zero. This means "king" and "queen" are no more similar than "king" and "banana." This limitation motivated the development of dense [word embeddings](/wiki/word_embedding) such as Word2Vec (Mikolov et al., 2013),[3] GloVe (Pennington et al., 2014),[4] and later contextual embeddings from models like BERT and GPT.

Word embeddings map words into a lower-dimensional continuous vector space (typically 100 to 300 dimensions) where semantically similar words are close together. In many neural NLP architectures, the first layer is an [embedding](/wiki/embedding) layer that effectively learns to transform one-hot input vectors into dense embedding vectors during training. Modern transformer-based [language models](/wiki/language_model) have made standalone one-hot representations obsolete for most NLP tasks, though one-hot encoding remains the conceptual starting point from which embeddings are derived.

### Embedding lookup as one-hot multiplication

It is useful to view a learned embedding layer as the product of a one-hot input vector and an embedding matrix. If $W \in \mathbb{R}^{V \times d}$ is the embedding matrix for a vocabulary of size $V$ with dimensionality $d$, then the embedding for token $i$ is $e_i^\top W$, which simply selects row $i$ of $W$. In practice, frameworks implement this as a memory lookup rather than a matrix multiplication, but the conceptual identity is the reason embedding layers in PyTorch and TensorFlow accept integer token IDs instead of one-hot tensors.

## One-hot encoding in classification labels

In standard multi-class classification, each sample belongs to exactly one class, and the target can be represented as a one-hot vector. For instance, in a three-class problem, the label for class 2 would be [0, 1, 0]. When this target vector is paired with a softmax output and the [cross-entropy](/wiki/cross_entropy) loss, the loss reduces to $-\log p_{y}$, where $p_y$ is the predicted probability of the true class.[14] This is why frameworks such as PyTorch implement `CrossEntropyLoss` as a function of integer class indices: the one-hot encoding is implicit, and the framework only needs to look up a single output probability per sample.

### Multi-label classification and multi-hot encoding

In [multi-label classification](/wiki/multi_label_classification), a sample can belong to multiple classes simultaneously. The target becomes a "multi-hot" vector where multiple positions can be set to 1. For example, a movie tagged as both "Action" and "Comedy" would have the label vector [1, 0, 1, 0] if the classes are [Action, Drama, Comedy, Horror]. This is sometimes called binary relevance encoding. In scikit-learn, the `MultiLabelBinarizer` class handles this transformation.[9] Multi-hot targets are typically paired with a sigmoid output and a binary cross-entropy loss applied independently to each label.

### Soft labels and label smoothing

One-hot label vectors place all probability mass on a single class. This sharp encoding can cause neural networks to become overconfident, especially when training data contains some labeling noise. [Label smoothing](/wiki/label_smoothing), introduced by Christian Szegedy and colleagues in the Inception-v3 paper (2016), softens the one-hot target by redistributing a small amount of probability mass to the other classes.[6] With smoothing factor $\alpha$, the target for the true class becomes $1 - \alpha$ and the target for each other class becomes $\alpha / (k - 1)$. In the original ImageNet experiments with $k = 1{,}000$ classes, the authors used a uniform prior and $\alpha = 0.1$, and reported a consistent improvement of about 0.2% absolute on both top-1 and top-5 error.[6]

Label smoothing improves calibration of the predicted probabilities, often improves test accuracy, and is now a default option in many large-scale classification pipelines including ImageNet training and machine translation. Other soft-label techniques include knowledge distillation, where a student network is trained to match a teacher network's full output distribution rather than a one-hot label, and mixup, which trains on convex combinations of pairs of samples and their one-hot labels.

## Implementation in popular libraries

Several Python libraries provide built-in support for one-hot encoding.

| Library | Function/class | Key features |
|---|---|---|
| pandas | `pd.get_dummies()` | Quick and simple; works on DataFrames; supports `drop_first` and `sparse` options |
| [scikit-learn](/wiki/scikit_learn) | `OneHotEncoder` | Fits and transforms; handles unseen categories (`handle_unknown='ignore'`); returns sparse matrices; integrates with `Pipeline` and `ColumnTransformer` |
| [TensorFlow](/wiki/tensorflow)/[Keras](/wiki/keras) | `tf.one_hot` and `tf.keras.utils.to_categorical()` | Converts integer class labels to one-hot tensors for neural network targets |
| [PyTorch](/wiki/pytorch) | `torch.nn.functional.one_hot()` | Converts integer tensor to one-hot tensor; useful for loss computation |
| Category Encoders | `ce.OneHotEncoder` | Drop-in replacement with additional options for handling missing values and rare categories |

### Pandas example

The simplest way to one-hot encode a column in a [pandas](/wiki/pandas) DataFrame is the `get_dummies` function:[12]

```python
import pandas as pd

df = pd.DataFrame({"color": ["red", "green", "blue", "red"]})
encoded = pd.get_dummies(df, columns=["color"], prefix="color")
print(encoded)
#    color_blue  color_green  color_red
# 0       False        False       True
# 1       False         True      False
# 2        True        False      False
# 3       False        False       True
```

Passing `drop_first=True` produces $k - 1$ dummy columns, suitable for linear regression. Passing `sparse=True` returns a DataFrame backed by `SparseArray` columns to save memory on wide encodings.

### Scikit-learn example

The scikit-learn `OneHotEncoder` follows the standard fit-transform API and integrates with the rest of the library:[9]

```python
import numpy as np
from sklearn.preprocessing import OneHotEncoder

X = np.array([["red"], ["green"], ["blue"], ["red"]])
encoder = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
encoder.fit(X)
print(encoder.categories_)
# [array(['blue', 'green', 'red'], dtype=object)]

print(encoder.transform(np.array([["green"], ["yellow"]])))
# [[0. 1. 0.]
#  [0. 0. 0.]]
```

For real workflows, the encoder is typically wrapped in a `ColumnTransformer` and a `Pipeline` so that fitting on training data and transforming on new data follows the same code path:

```python
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

pre = ColumnTransformer([
    ("cat", OneHotEncoder(handle_unknown="ignore"), ["color", "country"]),
], remainder="passthrough")

model = Pipeline([("pre", pre), ("clf", LogisticRegression())])
model.fit(X_train, y_train)
```

### TensorFlow example

In TensorFlow, integer class indices are converted to one-hot tensors with `tf.one_hot`:

```python
import tensorflow as tf

labels = tf.constant([0, 2, 1])
one_hot = tf.one_hot(labels, depth=3)
print(one_hot)
# tf.Tensor(
# [[1. 0. 0.]
#  [0. 0. 1.]
#  [0. 1. 0.]], shape=(3, 3), dtype=float32)
```

The Keras helper `tf.keras.utils.to_categorical` performs the same operation on NumPy arrays and is widely used to prepare classification targets before calling `model.fit`.

### PyTorch example

PyTorch exposes a similar utility under the functional namespace. The `num_classes` argument defaults to -1, in which case the number of classes is inferred as one greater than the largest index in the input tensor:

```python
import torch
import torch.nn.functional as F

labels = torch.tensor([0, 2, 1])
one_hot = F.one_hot(labels, num_classes=3)
print(one_hot)
# tensor([[1, 0, 0],
#         [0, 0, 1],
#         [0, 1, 0]])
```

For most classification training loops, however, PyTorch users skip the explicit one-hot conversion and feed the integer label tensor directly into `nn.CrossEntropyLoss`, which combines log-softmax and negative log-likelihood without ever materializing the one-hot vector.

### Handling unseen categories at inference time

A common challenge in production systems is encountering categories during inference that were not present in the training data. Different tools handle this differently:

- **pandas `get_dummies()`** does not retain knowledge of the training categories. New categories produce extra columns, and missing categories lose their columns entirely, causing shape mismatches. This makes it unsuitable for production pipelines without additional safeguards.
- **scikit-learn `OneHotEncoder`** provides the `handle_unknown` parameter. Setting it to `'ignore'` produces an all-zeros row for unseen categories. Setting it to `'infrequent_if_exist'` maps unseen categories to an infrequent category bin. The encoder must be fitted on training data and persisted (for example, with `joblib`) for use at inference time.
- **Category Encoders** and **feature-engine** also provide configurable handling for unknown categories.

Designing a robust encoding strategy that gracefully handles unseen categories is critical for deploying machine learning models in real-world applications.

## What are the alternatives to one-hot encoding?

When one-hot encoding is impractical due to high cardinality or other constraints, several alternative encoding techniques are available.

| Method | Description | Pros | Cons |
|---|---|---|---|
| [Target encoding](/wiki/target_encoding) | Replaces each category with the mean of the target variable for that category | Compact (single column); captures target relationship | Prone to overfitting; requires careful regularization |
| Hash encoding | Applies a hash function to categories and maps them to a fixed number of columns | Fixed output dimensionality; handles unseen categories naturally | Hash collisions can mix unrelated categories |
| [Entity embedding](/wiki/entity_embedding) | Learns dense vector representations for categories via a neural network embedding layer | Captures relationships between categories; compact | Requires a neural network; more complex to implement |
| Binary encoding | Converts category index to binary digits, each digit becoming a column | Fewer columns than one-hot ($\log_2 k$ of category count) | Introduces artificial ordering in bit patterns |
| Frequency encoding | Replaces each category with its frequency or proportion in the dataset | Single column; no dimensionality increase | Categories with similar frequencies become indistinguishable |
| Leave-one-out encoding | Variant of target encoding that excludes the current sample's target when computing the category mean | Reduces target leakage compared with naive target encoding | Still prone to overfitting on small categories |
| Weight of evidence encoding | Replaces category with $\log(P(X \mid y=1) / P(X \mid y=0))$ | Common in credit scoring; monotone in predicted probability | Limited to binary classification; sensitive to small samples |

[Entity embedding](/wiki/entity_embedding), introduced by Guo and Berkhahn (2016), have become particularly popular for high-cardinality features in deep learning. They transform categories into trainable dense vectors, analogous to how word embeddings work in NLP. The authors reported that the method generalized well even with little data: "We applied it successfully in a recent Kaggle competition and were able to reach the third position with relative simple features," referring to the Rossmann store sales challenge, where entity embeddings combined with a simple neural network outperformed gradient-boosted trees on tabular data with hundreds of store and product categories.[5]

### Comparing encodings on a high-cardinality feature

A practitioner deciding how to encode a feature like postal_code, which might contain tens of thousands of distinct values, can refer to the following rough guidelines.

| Cardinality | Linear models | Tree models | Neural networks |
|---|---|---|---|
| Up to ~10 | One-hot (or dummy) | One-hot or integer | One-hot or embedding (small) |
| 10 to ~50 | One-hot | Integer or one-hot | Embedding (small) |
| 50 to ~1,000 | Target or hash | Native categorical or target | Embedding |
| 1,000+ | Hash or target | Native categorical | Embedding |

These suggestions are not absolute rules. The right choice depends on dataset size, the target signal in each category, training time budget, and downstream interpretability requirements.

## Practical considerations and pitfalls

Several practical pitfalls catch practitioners off guard when applying one-hot encoding in real projects.

### Data leakage when fitting before splitting

Fitting an encoder on the full dataset before splitting into training and validation sets can leak information about the validation distribution into the model. If a category appears only in validation but is present in the encoder's vocabulary fitted on the full dataset, the model implicitly knows the category exists. The correct workflow fits the encoder only on the training partition and uses `handle_unknown='ignore'` (or an explicit "other" bucket) at inference time.

### Rare category handling

Categories with very few observations contribute weak signal and increase the risk of overfitting. Common strategies include grouping rare levels into a single "other" bucket, removing them entirely if domain knowledge allows, or using scikit-learn's `min_frequency` and `max_categories` parameters in `OneHotEncoder` (added in version 1.1) to fold infrequent levels automatically.

### Missing values

Missing categorical values can be encoded as their own dedicated column (essentially treating NaN as a valid category) or imputed with the most frequent value before encoding. The first approach is preferable when missingness itself is informative. The scikit-learn encoder treats `np.nan` as a separate category by default, while pandas' `get_dummies` ignores NaN unless `dummy_na=True` is passed.

### Multicollinearity diagnostics

Even outside the strict dummy variable trap, near-collinear dummy columns can inflate the variance of estimated coefficients in linear models. Statisticians often check the variance inflation factor (VIF) for each dummy and consider dropping or combining levels with high VIF. This concern does not apply to regularized linear models, tree models, or neural networks.

### Train and serve skew

A classic deployment bug occurs when the training pipeline uses pandas `get_dummies` and the serving pipeline reconstructs columns by hand or in a different order. The model receives column 7 instead of column 5 and silently produces wrong predictions. Persisting the fitted scikit-learn encoder (or another stateful encoder) and using the same object at training and inference time prevents this class of bug.

## One-hot encoding for ordinal targets

When the target variable itself is ordinal (for example, star ratings from 1 to 5), one-hot encoding the target throws away the ordering information. Two specialized alternatives are common. Cumulative encoding represents class $k$ as a vector of 1s in positions $1$ through $k$ followed by 0s in higher positions; this preserves order and is often used with the proportional odds model. The CORAL framework (Cao et al., 2020) extends this idea to neural networks and yields well-calibrated ordinal classifiers.[13] Both approaches use a non-trivial generalization of one-hot encoding rather than the standard form.

## Comparison with binary indicator variables in statistics

In classical statistics, the term **dummy variable** is essentially synonymous with one-hot encoding minus one column. R's `model.matrix` function applies a contrast scheme (`contr.treatment` by default) that produces the same $k - 1$ dummies as `OneHotEncoder(drop='first')`. Other contrast schemes such as deviation coding, Helmert coding, and orthogonal polynomial coding produce different sets of $k - 1$ columns that are linear combinations of the standard dummies.[14] These richer encodings can produce more interpretable coefficients in analysis-of-variance contexts but generally do not change the predictive performance of the model.

## How does one-hot encoding compare with other methods empirically?

A 2018 study by Cerda, Varoquaux, and Kegl on "dirty" categorical variables found that one-hot encoding remained competitive on datasets with up to a few hundred categories but lost significantly to similarity encoding and entity embeddings on noisier real-world datasets where category strings contain typos and inconsistent capitalization.[8] Pargent and colleagues' 2022 study compared regularized target encoding with one-hot, hash, frequency, and ordinal encodings across 24 OpenML benchmark datasets and reported that regularized target encoding outperformed all alternatives on high-cardinality features for both linear and tree-based models.[11] These results match the rough guideline that one-hot is the safest choice up to a few dozen categories, after which target encoding or learned embeddings tend to dominate.

## Use cases by domain

One-hot encoding shows up across many machine learning domains, each with its own conventions.

### Tabular data

In tabular data tasks such as Kaggle competitions and business analytics, one-hot encoding is applied to features such as country, department, gender, payment method, and product category. Combined with continuous features, the resulting design matrix is fed to logistic regression, gradient-boosted trees, or a small multilayer perceptron. Tools such as scikit-learn's `ColumnTransformer` make it convenient to apply one-hot encoding to categorical columns while leaving numerical columns untouched.[9]

### Computer vision

In [computer vision](/wiki/computer_vision), one-hot encoding mostly appears on the output side. Image classification networks produce class probabilities through a softmax over $k$ classes, and the training target for each image is the one-hot indicator vector of its true class. Datasets such as ImageNet (1,000 classes) and CIFAR-10 (10 classes) follow this convention. Object detection and segmentation extend the idea to per-pixel or per-anchor one-hot targets.

### Natural language processing

NLP uses one-hot encoding for tokens at the conceptual level, but in practice token IDs are stored as integers and turned into dense embeddings on the fly. The output side of language models (next-token prediction) computes a softmax over vocabulary size $V$, and the training target is the one-hot vector of the actual next token. This is mathematically equivalent to standard cross-entropy training even though the one-hot is never materialized in memory.

### Reinforcement learning

In [reinforcement learning](/wiki/reinforcement_learning), discrete action spaces are commonly represented with one-hot vectors. A policy network outputs a softmax over actions, and the action chosen during exploration is encoded as a one-hot vector for use in policy gradient updates. State spaces with discrete components (for example, the type of a unit in a strategy game) are also frequently one-hot encoded before being concatenated with continuous state features.

### Recommender systems

In recommendation, user IDs and item IDs are conceptually one-hot vectors. Classical matrix factorization can be viewed as decomposing a one-hot user vector and a one-hot item vector through learned embedding matrices. Modern recommender architectures such as deep learning recommendation models (DLRM) keep this structure: dense numerical features pass through an MLP, while categorical features are looked up through embedding tables that are mathematically equivalent to multiplying a one-hot vector by an embedding matrix.

### Genomics and bioinformatics

In genomics, DNA sequences are represented as one-hot encoded matrices over the alphabet $\{A, C, G, T\}$, producing a $4 \times L$ matrix for a sequence of length $L$. Convolutional neural networks operating on this matrix have become standard for tasks such as transcription factor binding prediction (DeepBind, Alipanahi et al., 2015) and chromatin accessibility prediction (DeepSEA, Zhou and Troyanskaya, 2015). Protein sequences are similarly one-hot encoded over the 20-letter amino acid alphabet, although learned embeddings from large protein language models such as ESM have largely supplanted plain one-hot input in recent years.

## Explain like I'm 5 (ELI5)

Imagine you have a box of different colored balls: red, blue, and green. You want to tell a robot which color ball to pick up, but the robot only understands numbers, not colors. So you come up with a plan: you create a small chart with three columns, one for each color. When you want the robot to pick up a red ball, you put a 1 in the red column and 0s in the other columns. For a blue ball, you put a 1 in the blue column and 0s elsewhere, and the same for green. This way, you have turned colors (categories) into a set of numbers (binary columns) that the robot can understand. That is how one-hot encoding helps machine learning algorithms work with categories.

## Summary

One-hot encoding is the workhorse method for turning small to mid-sized categorical features into numeric input that any machine learning algorithm can consume. Its strength is its simplicity and the fact that it imposes no spurious order on the categories. Its weakness is the explosion in dimensionality when cardinality grows, which can hurt both memory usage and statistical efficiency. For high-cardinality features, target encoding, hash encoding, and learned embeddings have largely taken over. For everything else, one-hot encoding remains a sound default and is built into every major data science library.

## References

1. Suits, D. B. (1957). "Use of Dummy Variables in Regression Equations." *Journal of the American Statistical Association*, 52(280), 548-551.
2. Potdar, K., Pardawala, T. S., and Pai, C. D. (2017). "A Comparative Study of Categorical Variable Encoding Techniques for Neural Network Classifiers." *International Journal of Computer Applications*, 175(4), 7-9.
3. Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). "Efficient Estimation of Word Representations in Vector Space." *Proceedings of ICLR Workshop*. arXiv:1301.3781.
4. Pennington, J., Socher, R., and Manning, C. D. (2014). "GloVe: Global Vectors for Word Representation." *Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)*, 1532-1543.
5. Guo, C. and Berkhahn, F. (2016). "Entity Embeddings of Categorical Variables." *arXiv preprint* arXiv:1604.06737.
6. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. (2016). "Rethinking the Inception Architecture for Computer Vision." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, 2818-2826.
7. Hancock, J. T. and Khoshgoftaar, T. M. (2020). "Survey on Categorical Data for Neural Networks." *Journal of Big Data*, 7, 28.
8. Cerda, P., Varoquaux, G., and Kegl, B. (2018). "Similarity Encoding for Learning with Dirty Categorical Variables." *Machine Learning*, 107(8-10), 1477-1494.
9. Pedregosa, F., Varoquaux, G., Gramfort, A., et al. (2011). "Scikit-learn: Machine Learning in Python." *Journal of Machine Learning Research*, 12, 2825-2830.
10. Harris, C. G. and Srinivasan, P. (2012). "Comparing Approaches to Encoding Nominal Features in Classification Tasks." *Proceedings of the 2012 International Conference on Information and Knowledge Engineering*.
11. Pargent, F., Pfisterer, F., Thomas, J., and Bischl, B. (2022). "Regularized Target Encoding Outperforms Traditional Methods in Supervised Machine Learning with High Cardinality Features." *Computational Statistics*, 37, 2671-2692.
12. McKinney, W. (2010). "Data Structures for Statistical Computing in Python." *Proceedings of the 9th Python in Science Conference*, 51-56.
13. Cao, W., Mirjalili, V., and Raschka, S. (2020). "Rank Consistent Ordinal Regression for Neural Networks with Application to Age Estimation." *Pattern Recognition Letters*, 140, 325-331.
14. Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Springer.
15. Hastie, T., Tibshirani, R., and Friedman, J. (2009). *The Elements of Statistical Learning*, 2nd ed. Springer.

