One-Hot Encoding

See also: Machine learning terms

One-hot encoding is a fundamental data preprocessing technique used in machine learning and statistics to convert categorical data into a numerical representation. Each category in a categorical variable is transformed into a binary vector where exactly one element is set to 1 and all remaining elements are set to 0. The technique is a core part of feature engineering pipelines and is essential for feeding non-numeric data into algorithms that require numerical input.

The term "one-hot" originates from digital circuit design, where it describes a group of bits in which only one bit is "hot" (set to 1) at any given time. In machine learning, the concept was adopted to represent discrete categories as mutually exclusive binary columns. The same encoding is sometimes called a 1-of-k representation or indicator variable representation in classical statistics, where it has been used in regression analysis since at least the mid-twentieth century.

How one-hot encoding works

Most machine learning algorithms, including linear models, neural networks, and support vector machines, operate on numerical input. Categorical variables such as color, country, or product type cannot be directly processed by these algorithms. One-hot encoding solves this by creating a new binary column for each unique category in the original variable.

The encoding process follows these steps:

Identify all unique categories in the categorical variable.
Create a new binary column (feature) for each unique category.
For each observation, set the column corresponding to its category to 1 and all other columns to 0.

Mathematical formulation

Let $X$ be a categorical variable with $k$ distinct levels indexed by ${c_1, c_2, \ldots, c_k}$. The one-hot encoding function $\phi : X \to {0, 1}^k$ maps each value $x = c_i$ to the standard basis vector $e_i \in \mathbb{R}^k$, where $e_i$ has a 1 in position $i$ and 0 elsewhere. Formally,

$\phi(c_i)_j = \begin{cases} 1 & \text{if } j = i \ 0 & \text{otherwise} \end{cases}$

This representation has several useful mathematical properties. The Euclidean distance between any two distinct one-hot vectors is $\sqrt{2}$, the dot product between any two distinct vectors is 0, and the L1 (sum) of every encoded vector equals 1. These properties guarantee that no spurious order or magnitude is introduced by the encoding itself.

Worked example

Consider a dataset with a "Color" feature containing three categories: Red, Green, and Blue.

Original value	Red	Green	Blue
Red	1	0	0
Green	0	1	0
Blue	0	0	1
Red	1	0	0
Blue	0	0	1

The single "Color" column has been replaced by three binary columns. Each row has exactly one 1, indicating its category. This representation ensures the algorithm treats each color independently without assuming any ordinal relationship between them.

For a variable with $k$ unique categories, one-hot encoding produces $k$ binary columns. Each resulting vector has dimensionality $k$, with a single 1 at the position corresponding to the category index.

History and origin

The phrase "one-hot" came from digital electronics, where finite state machines often use a one-hot register: a $k$-bit register that holds at most one active bit at a time, with each bit representing a separate state. This pattern was popular because it simplifies state-decoding logic at the cost of using more flip-flops than a binary-encoded counter. By the 1980s, the same idea had migrated into the symbolic AI and connectionist communities for representing discrete symbols in neural networks. Geoffrey Hinton and David Rumelhart's work on distributed representations, and later David Touretzky's symbolic-to-vector encoders, all relied on one-of-k input units.

In classical statistics, the same construction is older still. Regression textbooks have called these "dummy variables" since at least Suits (1957) and have used them to incorporate qualitative predictors into ordinary least squares regression. Modern machine learning inherits both vocabularies: the term dummy variable is more common in statistics and econometrics, while one-hot is preferred in machine learning, deep learning, and software engineering contexts.

One-hot encoding vs label encoding vs ordinal encoding

Several methods exist for encoding categorical variables. The right choice depends on the nature of the data and the model being used.

Property	One-hot encoding	Label encoding	Ordinal encoding
Output format	$k$ binary columns (one per category)	Single integer column	Single integer column
Assumes order	No	No (but model may infer one)	Yes
Best suited for	Nominal data with no inherent order	Tree-based models, low cardinality	Data with a natural ranking
Dimensionality impact	Increases by $k$ columns	No increase	No increase
Risk of spurious ordering	None	High for linear models	None if order is real
Example	Red -> [1,0,0], Green -> [0,1,0]	Red -> 0, Green -> 1, Blue -> 2	Low -> 0, Medium -> 1, High -> 2

Label encoding assigns each category a unique integer. While compact, it introduces a numerical ordering that linear models and neural networks may interpret as meaningful. For example, encoding "Red" as 0, "Green" as 1, and "Blue" as 2 could cause the model to treat "Blue" as numerically "greater" than "Red," which is meaningless for nominal categories.

Ordinal encoding is appropriate when the categories have a genuine rank order, such as education levels (High School < Bachelor's < Master's < PhD) or satisfaction ratings (Low < Medium < High). In these cases, the integer mapping preserves the meaningful ordering.

One-hot encoding is the safest default for nominal categorical variables because it does not impose any ordering. However, it comes with a higher dimensionality cost.

When to use one-hot encoding

One-hot encoding is most appropriate in the following situations:

The categorical variable is nominal (categories have no inherent order), such as country, color, or product type.
The model is a linear model, logistic regression, SVM, or neural network that interprets numeric input as having magnitude and order.
The number of unique categories is relatively small (typically fewer than 15 to 20 categories).
The dataset is large enough that the added dimensionality does not cause overfitting.

One-hot encoding is generally not the best choice when dealing with high cardinality features (hundreds or thousands of unique categories), ordinal variables, or tree-based models that handle integer-encoded categories natively.

One-hot encoding and model types

The necessity of one-hot encoding depends heavily on the algorithm being used.

Linear models and neural networks

Linear regression, logistic regression, SVMs, and neural networks all compute weighted sums of input features. If a categorical variable is encoded as a single integer column, the model treats those integers as continuous values on a number line, creating spurious relationships. One-hot encoding prevents this by giving each category its own independent coefficient or weight.

In a linear regression with $k - 1$ dummy variables and an intercept, the coefficient on each dummy column gives the average difference in the response between that category and the reference category, holding all other features constant. This direct interpretation is one reason dummy coding remains the default in econometric software such as Stata, R, and statsmodels.

Tree-based models

Decision tree algorithms, random forests, and gradient-boosted trees (such as XGBoost, LightGBM, and CatBoost) split features based on threshold comparisons. These models can handle integer-encoded categorical variables without assuming an ordering, because any split on an integer column effectively partitions the categories into two groups. In fact, one-hot encoding can be detrimental for tree-based models: it produces many sparse binary columns, each of which carries very little information individually. The tree algorithm may undervalue these columns relative to continuous features, leading to worse splits. Libraries like LightGBM and CatBoost provide native support for categorical features without requiring one-hot encoding.

Distance-based and probabilistic models

Algorithms that rely on distance metrics, such as $k$-nearest neighbors, $k$-means clustering, and Gaussian mixture models, also benefit from one-hot encoding. With one-hot vectors, the distance between two observations differing only in a categorical attribute is the same regardless of which two categories are involved, preserving the symmetry of the original problem. Naive Bayes classifiers can also work directly with one-hot encoded features under the multinomial or Bernoulli assumption, although they typically use raw counts or indicator variables rather than full one-hot tables.

The dummy variable trap

The dummy variable trap is a form of perfect multicollinearity that arises when all $k$ one-hot encoded columns are included as predictors in a linear model. Because the columns always sum to 1, any one column can be perfectly predicted from the remaining $k - 1$ columns. This makes the design matrix singular, preventing the model from computing unique coefficient estimates.

The standard solution is to drop one of the $k$ binary columns, producing $k - 1$ "dummy variables." The dropped category becomes the reference category, and the model's coefficients for the remaining categories are interpreted relative to it. In Python, both pandas get_dummies(drop_first=True) and scikit-learn's OneHotEncoder(drop='first') support this directly.

The dummy variable trap is specific to linear models with an intercept term. Tree-based models and neural networks (which typically use regularization) are not affected by this issue. In regularized regressions such as ridge or lasso, dropping a category is not strictly required either, because the L2 or L1 penalty resolves the singularity by shrinking redundant coefficients toward zero.

Curse of dimensionality

One-hot encoding can dramatically increase the number of features in a dataset. A single categorical column with 1,000 unique values expands into 1,000 binary columns. This phenomenon, related to the curse of dimensionality, creates several practical problems:

Increased memory usage: The resulting matrix is mostly zeros, consuming storage without carrying proportional information.
Overfitting risk: With many more features than observations, models may memorize noise rather than learn generalizable patterns.
Slower training: More columns mean larger weight matrices, longer gradient computations, and slower convergence.
Diluted signal: In tree-based models, the many low-information binary columns compete with genuinely informative features for selection at each split.

When a categorical variable has high cardinality, practitioners typically turn to alternative encoding methods such as target encoding, hash encoding, or entity embedding rather than using one-hot encoding.

Sparse vs dense representation

One-hot encoded matrices are inherently sparse: in a vector of length $k$, only one element is nonzero. For high-cardinality features, storing the full dense matrix wastes significant memory. Sparse representation formats such as compressed sparse row (CSR) or compressed sparse column (CSC) store only the nonzero entries, reducing memory consumption by orders of magnitude.

Scikit-learn's OneHotEncoder returns a sparse matrix by default (using sparse_output=True), which is compatible with most scikit-learn estimators. Pandas' get_dummies() returns a dense DataFrame by default, though it supports a sparse=True option that uses pandas' SparseArray internally. When working with large datasets or high-cardinality features, using sparse representations is critical for keeping memory usage manageable.

A simple memory comparison illustrates the gap. A dense float64 matrix with one million rows and 10,000 one-hot columns would consume roughly 80 GB of RAM. The same data stored in CSR format with one nonzero per row uses only about 12 MB for the indices and values, more than three orders of magnitude smaller. This is why production pipelines almost always carry one-hot data in sparse form until just before it enters a model that requires a dense input.

One-hot encoding in natural language processing

One-hot encoding has historically played an important role in natural language processing (NLP). In its simplest form, each word in a vocabulary is represented as a one-hot vector whose length equals the vocabulary size. A vocabulary of 50,000 words produces 50,000-dimensional vectors with a single 1 per word.

Bag of words

The bag of words model extends one-hot encoding to entire documents. Rather than a single 1 per vector, a bag-of-words vector counts the occurrences of each vocabulary word in a document. This can be viewed as the sum of one-hot vectors for all words in the document. While simple and effective for basic text classification, bag of words shares the sparsity and high dimensionality problems of one-hot encoding.

From one-hot vectors to word embeddings

One-hot encoding treats every word as equally distant from every other word; the cosine similarity between any two one-hot vectors is zero. This means "king" and "queen" are no more similar than "king" and "banana." This limitation motivated the development of dense word embeddings such as Word2Vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014), and later contextual embeddings from models like BERT and GPT.

Word embeddings map words into a lower-dimensional continuous vector space (typically 100 to 300 dimensions) where semantically similar words are close together. In many neural NLP architectures, the first layer is an embedding layer that effectively learns to transform one-hot input vectors into dense embedding vectors during training. Modern transformer-based language models have made standalone one-hot representations obsolete for most NLP tasks, though one-hot encoding remains the conceptual starting point from which embeddings are derived.

Embedding lookup as one-hot multiplication

It is useful to view a learned embedding layer as the product of a one-hot input vector and an embedding matrix. If $W \in \mathbb{R}^{V \times d}$ is the embedding matrix for a vocabulary of size $V$ with dimensionality $d$, then the embedding for token $i$ is $e_i^\top W$, which simply selects row $i$ of $W$. In practice, frameworks implement this as a memory lookup rather than a matrix multiplication, but the conceptual identity is the reason embedding layers in PyTorch and TensorFlow accept integer token IDs instead of one-hot tensors.

One-hot encoding in classification labels

In standard multi-class classification, each sample belongs to exactly one class, and the target can be represented as a one-hot vector. For instance, in a three-class problem, the label for class 2 would be [0, 1, 0]. When this target vector is paired with a softmax output and the cross-entropy loss, the loss reduces to $-\log p_{y}$, where $p_y$ is the predicted probability of the true class. This is why frameworks such as PyTorch implement CrossEntropyLoss as a function of integer class indices: the one-hot encoding is implicit, and the framework only needs to look up a single output probability per sample.

Multi-label classification and multi-hot encoding

In multi-label classification, a sample can belong to multiple classes simultaneously. The target becomes a "multi-hot" vector where multiple positions can be set to 1. For example, a movie tagged as both "Action" and "Comedy" would have the label vector [1, 0, 1, 0] if the classes are [Action, Drama, Comedy, Horror]. This is sometimes called binary relevance encoding. In scikit-learn, the MultiLabelBinarizer class handles this transformation. Multi-hot targets are typically paired with a sigmoid output and a binary cross-entropy loss applied independently to each label.

Soft labels and label smoothing

One-hot label vectors place all probability mass on a single class. This sharp encoding can cause neural networks to become overconfident, especially when training data contains some labeling noise. Label smoothing, introduced by Christian Szegedy and colleagues in the Inception-v3 paper (2016), softens the one-hot target by redistributing a small amount of probability mass to the other classes. With smoothing factor $\alpha$, the target for the true class becomes $1 - \alpha$ and the target for each other class becomes $\alpha / (k - 1)$.

Label smoothing improves calibration of the predicted probabilities, often improves test accuracy, and is now a default option in many large-scale classification pipelines including ImageNet training and machine translation. Other soft-label techniques include knowledge distillation, where a student network is trained to match a teacher network's full output distribution rather than a one-hot label, and mixup, which trains on convex combinations of pairs of samples and their one-hot labels.

Implementation in popular libraries

Several Python libraries provide built-in support for one-hot encoding.

Library	Function/class	Key features
pandas	`pd.get_dummies()`	Quick and simple; works on DataFrames; supports `drop_first` and `sparse` options
scikit-learn	`OneHotEncoder`	Fits and transforms; handles unseen categories (`handle_unknown='ignore'`); returns sparse matrices; integrates with `Pipeline` and `ColumnTransformer`
TensorFlow/Keras	`tf.one_hot` and `tf.keras.utils.to_categorical()`	Converts integer class labels to one-hot tensors for neural network targets
PyTorch	`torch.nn.functional.one_hot()`	Converts integer tensor to one-hot tensor; useful for loss computation
Category Encoders	`ce.OneHotEncoder`	Drop-in replacement with additional options for handling missing values and rare categories

Pandas example

The simplest way to one-hot encode a column in a pandas DataFrame is the get_dummies function:

import pandas as pd

df = pd.DataFrame({"color": ["red", "green", "blue", "red"]})
encoded = pd.get_dummies(df, columns=["color"], prefix="color")
print(encoded)
#    color_blue  color_green  color_red
# 0       False        False       True
# 1       False         True      False
# 2        True        False      False
# 3       False        False       True

Passing drop_first=True produces $k - 1$ dummy columns, suitable for linear regression. Passing sparse=True returns a DataFrame backed by SparseArray columns to save memory on wide encodings.

Scikit-learn example

The scikit-learn OneHotEncoder follows the standard fit-transform API and integrates with the rest of the library:

import numpy as np
from sklearn.preprocessing import OneHotEncoder

X = np.array([["red"], ["green"], ["blue"], ["red"]])
encoder = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
encoder.fit(X)
print(encoder.categories_)
# [array(['blue', 'green', 'red'], dtype=object)]

print(encoder.transform(np.array([["green"], ["yellow"]])))
# [[0. 1. 0.]
#  [0. 0. 0.]]

For real workflows, the encoder is typically wrapped in a ColumnTransformer and a Pipeline so that fitting on training data and transforming on new data follows the same code path:

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

pre = ColumnTransformer([
    ("cat", OneHotEncoder(handle_unknown="ignore"), ["color", "country"]),
], remainder="passthrough")

model = Pipeline([("pre", pre), ("clf", LogisticRegression())])
model.fit(X_train, y_train)

TensorFlow example

In TensorFlow, integer class indices are converted to one-hot tensors with tf.one_hot:

import tensorflow as tf

labels = tf.constant([0, 2, 1])
one_hot = tf.one_hot(labels, depth=3)
print(one_hot)
# tf.Tensor(
# [[1. 0. 0.]
#  [0. 0. 1.]
#  [0. 1. 0.]], shape=(3, 3), dtype=float32)

The Keras helper tf.keras.utils.to_categorical performs the same operation on NumPy arrays and is widely used to prepare classification targets before calling model.fit.

PyTorch example

PyTorch exposes a similar utility under the functional namespace:

import torch
import torch.nn.functional as F

labels = torch.tensor([0, 2, 1])
one_hot = F.one_hot(labels, num_classes=3)
print(one_hot)
# tensor([[1, 0, 0],
#         [0, 0, 1],
#         [0, 1, 0]])

For most classification training loops, however, PyTorch users skip the explicit one-hot conversion and feed the integer label tensor directly into nn.CrossEntropyLoss, which combines log-softmax and negative log-likelihood without ever materializing the one-hot vector.

Handling unseen categories at inference time

A common challenge in production systems is encountering categories during inference that were not present in the training data. Different tools handle this differently:

pandas get_dummies() does not retain knowledge of the training categories. New categories produce extra columns, and missing categories lose their columns entirely, causing shape mismatches. This makes it unsuitable for production pipelines without additional safeguards.
scikit-learn OneHotEncoder provides the handle_unknown parameter. Setting it to 'ignore' produces an all-zeros row for unseen categories. Setting it to 'infrequent_if_exist' maps unseen categories to an infrequent category bin. The encoder must be fitted on training data and persisted (for example, with joblib) for use at inference time.
Category Encoders and feature-engine also provide configurable handling for unknown categories.

Designing a robust encoding strategy that gracefully handles unseen categories is critical for deploying machine learning models in real-world applications.

Alternatives to one-hot encoding

When one-hot encoding is impractical due to high cardinality or other constraints, several alternative encoding techniques are available.

Method	Description	Pros	Cons
Target encoding	Replaces each category with the mean of the target variable for that category	Compact (single column); captures target relationship	Prone to overfitting; requires careful regularization
Hash encoding	Applies a hash function to categories and maps them to a fixed number of columns	Fixed output dimensionality; handles unseen categories naturally	Hash collisions can mix unrelated categories
Entity embedding	Learns dense vector representations for categories via a neural network embedding layer	Captures relationships between categories; compact	Requires a neural network; more complex to implement
Binary encoding	Converts category index to binary digits, each digit becoming a column	Fewer columns than one-hot ($\log_2 k$ of category count)	Introduces artificial ordering in bit patterns
Frequency encoding	Replaces each category with its frequency or proportion in the dataset	Single column; no dimensionality increase	Categories with similar frequencies become indistinguishable
Leave-one-out encoding	Variant of target encoding that excludes the current sample's target when computing the category mean	Reduces target leakage compared with naive target encoding	Still prone to overfitting on small categories
Weight of evidence encoding	Replaces category with $\log(P(X \mid y=1) / P(X \mid y=0))$	Common in credit scoring; monotone in predicted probability	Limited to binary classification; sensitive to small samples

Entity embedding, introduced by Guo and Berkhahn (2016), have become particularly popular for high-cardinality features in deep learning. They transform categories into trainable dense vectors, analogous to how word embeddings work in NLP. In their original Kaggle Rossmann store sales experiment, entity embeddings combined with a simple multilayer perceptron outperformed gradient-boosted trees on tabular data with hundreds of store and product categories.

Comparing encodings on a high-cardinality feature

A practitioner deciding how to encode a feature like postal_code, which might contain tens of thousands of distinct values, can refer to the following rough guidelines.

Cardinality	Linear models	Tree models	Neural networks
Up to ~10	One-hot (or dummy)	One-hot or integer	One-hot or embedding (small)
10 to ~50	One-hot	Integer or one-hot	Embedding (small)
50 to ~1,000	Target or hash	Native categorical or target	Embedding
1,000+	Hash or target	Native categorical	Embedding

These suggestions are not absolute rules. The right choice depends on dataset size, the target signal in each category, training time budget, and downstream interpretability requirements.

Practical considerations and pitfalls

Several practical pitfalls catch practitioners off guard when applying one-hot encoding in real projects.

Data leakage when fitting before splitting

Fitting an encoder on the full dataset before splitting into training and validation sets can leak information about the validation distribution into the model. If a category appears only in validation but is present in the encoder's vocabulary fitted on the full dataset, the model implicitly knows the category exists. The correct workflow fits the encoder only on the training partition and uses handle_unknown='ignore' (or an explicit "other" bucket) at inference time.

Rare category handling

Categories with very few observations contribute weak signal and increase the risk of overfitting. Common strategies include grouping rare levels into a single "other" bucket, removing them entirely if domain knowledge allows, or using scikit-learn's min_frequency and max_categories parameters in OneHotEncoder (added in version 1.1) to fold infrequent levels automatically.

Missing values

Missing categorical values can be encoded as their own dedicated column (essentially treating NaN as a valid category) or imputed with the most frequent value before encoding. The first approach is preferable when missingness itself is informative. The scikit-learn encoder treats np.nan as a separate category by default, while pandas' get_dummies ignores NaN unless dummy_na=True is passed.

Multicollinearity diagnostics

Even outside the strict dummy variable trap, near-collinear dummy columns can inflate the variance of estimated coefficients in linear models. Statisticians often check the variance inflation factor (VIF) for each dummy and consider dropping or combining levels with high VIF. This concern does not apply to regularized linear models, tree models, or neural networks.

Train and serve skew

A classic deployment bug occurs when the training pipeline uses pandas get_dummies and the serving pipeline reconstructs columns by hand or in a different order. The model receives column 7 instead of column 5 and silently produces wrong predictions. Persisting the fitted scikit-learn encoder (or another stateful encoder) and using the same object at training and inference time prevents this class of bug.

One-hot encoding for ordinal targets

When the target variable itself is ordinal (for example, star ratings from 1 to 5), one-hot encoding the target throws away the ordering information. Two specialized alternatives are common. Cumulative encoding represents class $k$ as a vector of 1s in positions $1$ through $k$ followed by 0s in higher positions; this preserves order and is often used with the proportional odds model. The CORAL framework (Cao et al., 2020) extends this idea to neural networks and yields well-calibrated ordinal classifiers. Both approaches use a non-trivial generalization of one-hot encoding rather than the standard form.

Comparison with binary indicator variables in statistics

In classical statistics, the term dummy variable is essentially synonymous with one-hot encoding minus one column. R's model.matrix function applies a contrast scheme (contr.treatment by default) that produces the same $k - 1$ dummies as OneHotEncoder(drop='first'). Other contrast schemes such as deviation coding, Helmert coding, and orthogonal polynomial coding produce different sets of $k - 1$ columns that are linear combinations of the standard dummies. These richer encodings can produce more interpretable coefficients in analysis-of-variance contexts but generally do not change the predictive performance of the model.

Performance benchmarks

A 2018 study by Cerda, Varoquaux, and Kegl on "dirty" categorical variables found that one-hot encoding remained competitive on datasets with up to a few hundred categories but lost significantly to similarity encoding and entity embeddings on noisier real-world datasets where category strings contain typos and inconsistent capitalization. Pargent and colleagues' 2022 study compared regularized target encoding with one-hot, hash, frequency, and ordinal encodings across 24 OpenML benchmark datasets and reported that regularized target encoding outperformed all alternatives on high-cardinality features for both linear and tree-based models. These results match the rough guideline that one-hot is the safest choice up to a few dozen categories, after which target encoding or learned embeddings tend to dominate.

Use cases by domain

One-hot encoding shows up across many machine learning domains, each with its own conventions.

Tabular data

In tabular data tasks such as Kaggle competitions and business analytics, one-hot encoding is applied to features such as country, department, gender, payment method, and product category. Combined with continuous features, the resulting design matrix is fed to logistic regression, gradient-boosted trees, or a small multilayer perceptron. Tools such as scikit-learn's ColumnTransformer make it convenient to apply one-hot encoding to categorical columns while leaving numerical columns untouched.

Computer vision

In computer vision, one-hot encoding mostly appears on the output side. Image classification networks produce class probabilities through a softmax over $k$ classes, and the training target for each image is the one-hot indicator vector of its true class. Datasets such as ImageNet (1,000 classes) and CIFAR-10 (10 classes) follow this convention. Object detection and segmentation extend the idea to per-pixel or per-anchor one-hot targets.

Natural language processing

NLP uses one-hot encoding for tokens at the conceptual level, but in practice token IDs are stored as integers and turned into dense embeddings on the fly. The output side of language models (next-token prediction) computes a softmax over vocabulary size $V$, and the training target is the one-hot vector of the actual next token. This is mathematically equivalent to standard cross-entropy training even though the one-hot is never materialized in memory.

Reinforcement learning

In reinforcement learning, discrete action spaces are commonly represented with one-hot vectors. A policy network outputs a softmax over actions, and the action chosen during exploration is encoded as a one-hot vector for use in policy gradient updates. State spaces with discrete components (for example, the type of a unit in a strategy game) are also frequently one-hot encoded before being concatenated with continuous state features.

Recommender systems

In recommendation, user IDs and item IDs are conceptually one-hot vectors. Classical matrix factorization can be viewed as decomposing a one-hot user vector and a one-hot item vector through learned embedding matrices. Modern recommender architectures such as deep learning recommendation models (DLRM) keep this structure: dense numerical features pass through an MLP, while categorical features are looked up through embedding tables that are mathematically equivalent to multiplying a one-hot vector by an embedding matrix.

Genomics and bioinformatics

In genomics, DNA sequences are represented as one-hot encoded matrices over the alphabet ${A, C, G, T}$, producing a $4 \times L$ matrix for a sequence of length $L$. Convolutional neural networks operating on this matrix have become standard for tasks such as transcription factor binding prediction (DeepBind, Alipanahi et al., 2015) and chromatin accessibility prediction (DeepSEA, Zhou and Troyanskaya, 2015). Protein sequences are similarly one-hot encoded over the 20-letter amino acid alphabet, although learned embeddings from large protein language models such as ESM have largely supplanted plain one-hot input in recent years.

Explain like I'm 5 (ELI5)

Imagine you have a box of different colored balls: red, blue, and green. You want to tell a robot which color ball to pick up, but the robot only understands numbers, not colors. So you come up with a plan: you create a small chart with three columns, one for each color. When you want the robot to pick up a red ball, you put a 1 in the red column and 0s in the other columns. For a blue ball, you put a 1 in the blue column and 0s elsewhere, and the same for green. This way, you have turned colors (categories) into a set of numbers (binary columns) that the robot can understand. That is how one-hot encoding helps machine learning algorithms work with categories.

Summary

One-hot encoding is the workhorse method for turning small to mid-sized categorical features into numeric input that any machine learning algorithm can consume. Its strength is its simplicity and the fact that it imposes no spurious order on the categories. Its weakness is the explosion in dimensionality when cardinality grows, which can hurt both memory usage and statistical efficiency. For high-cardinality features, target encoding, hash encoding, and learned embeddings have largely taken over. For everything else, one-hot encoding remains a sound default and is built into every major data science library.

References

Suits, D. B. (1957). "Use of Dummy Variables in Regression Equations." *Journal of the American Statistical Association*, 52(280), 548-551.
Potdar, K., Pardawala, T. S., and Pai, C. D. (2017). "A Comparative Study of Categorical Variable Encoding Techniques for Neural Network Classifiers." *International Journal of Computer Applications*, 175(4), 7-9.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). "Efficient Estimation of Word Representations in Vector Space." *Proceedings of ICLR Workshop*. arXiv:1301.3781.
Pennington, J., Socher, R., and Manning, C. D. (2014). "GloVe: Global Vectors for Word Representation." *Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)*, 1532-1543.
Guo, C. and Berkhahn, F. (2016). "Entity Embeddings of Categorical Variables." *arXiv preprint* arXiv:1604.06737.
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. (2016). "Rethinking the Inception Architecture for Computer Vision." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, 2818-2826.
Hancock, J. T. and Khoshgoftaar, T. M. (2020). "Survey on Categorical Data for Neural Networks." *Journal of Big Data*, 7, 28.
Cerda, P., Varoquaux, G., and Kegl, B. (2018). "Similarity Encoding for Learning with Dirty Categorical Variables." *Machine Learning*, 107(8-10), 1477-1494.
Pedregosa, F., Varoquaux, G., Gramfort, A., et al. (2011). "Scikit-learn: Machine Learning in Python." *Journal of Machine Learning Research*, 12, 2825-2830.
Harris, C. G. and Srinivasan, P. (2012). "Comparing Approaches to Encoding Nominal Features in Classification Tasks." *Proceedings of the 2012 International Conference on Information and Knowledge Engineering*.
Pargent, F., Pfisterer, F., Thomas, J., and Bischl, B. (2022). "Regularized Target Encoding Outperforms Traditional Methods in Supervised Machine Learning with High Cardinality Features." *Computational Statistics*, 37, 2671-2692.
McKinney, W. (2010). "Data Structures for Statistical Computing in Python." *Proceedings of the 9th Python in Science Conference*, 51-56.
Cao, W., Mirjalili, V., and Raschka, S. (2020). "Rank Consistent Ordinal Regression for Neural Networks with Application to Age Estimation." *Pattern Recognition Letters*, 140, 325-331.
Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Springer.
Hastie, T., Tibshirani, R., and Friedman, J. (2009). *The Elements of Statistical Learning*, 2nd ed. Springer.

How one-hot encoding works

Mathematical formulation

Worked example

History and origin

One-hot encoding vs label encoding vs ordinal encoding

When to use one-hot encoding

One-hot encoding and model types

Linear models and neural networks

Tree-based models

Distance-based and probabilistic models

The dummy variable trap

Curse of dimensionality

Sparse vs dense representation

One-hot encoding in natural language processing

Bag of words

From one-hot vectors to word embeddings

Embedding lookup as one-hot multiplication

One-hot encoding in classification labels

Multi-label classification and multi-hot encoding

Soft labels and label smoothing

Implementation in popular libraries

Pandas example

Scikit-learn example

TensorFlow example

PyTorch example

Handling unseen categories at inference time

Alternatives to one-hot encoding

Comparing encodings on a high-cardinality feature

Practical considerations and pitfalls

Data leakage when fitting before splitting

Rare category handling

Missing values

Multicollinearity diagnostics

Train and serve skew

One-hot encoding for ordinal targets

Comparison with binary indicator variables in statistics

Performance benchmarks

Use cases by domain

Tabular data

Computer vision

Natural language processing

Reinforcement learning

Recommender systems

Genomics and bioinformatics

Explain like I'm 5 (ELI5)

Summary

References

Improve this article

Related Articles

ARC-AGI 2

Bucketing

Feature Engineering

Discrete Feature

Categorical Data

Continuous Feature

How one-hot encoding works

Mathematical formulation

Worked example

History and origin

One-hot encoding vs label encoding vs ordinal encoding

When to use one-hot encoding

One-hot encoding and model types

Linear models and neural networks

Tree-based models

Distance-based and probabilistic models

The dummy variable trap

Curse of dimensionality

Sparse vs dense representation

One-hot encoding in natural language processing

Bag of words

From one-hot vectors to word embeddings

Embedding lookup as one-hot multiplication

One-hot encoding in classification labels

Multi-label classification and multi-hot encoding

Soft labels and label smoothing

Implementation in popular libraries

Pandas example

Scikit-learn example

TensorFlow example

PyTorch example