One-Hot Encoding
Last reviewed
May 9, 2026
Sources
15 citations
Review status
Source-backed
Revision
v4 ยท 5,430 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 9, 2026
Sources
15 citations
Review status
Source-backed
Revision
v4 ยท 5,430 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms
One-hot encoding is a fundamental data preprocessing technique used in machine learning and statistics to convert categorical data into a numerical representation. Each category in a categorical variable is transformed into a binary vector where exactly one element is set to 1 and all remaining elements are set to 0. The technique is a core part of feature engineering pipelines and is essential for feeding non-numeric data into algorithms that require numerical input.
The term "one-hot" originates from digital circuit design, where it describes a group of bits in which only one bit is "hot" (set to 1) at any given time. In machine learning, the concept was adopted to represent discrete categories as mutually exclusive binary columns. The same encoding is sometimes called a 1-of-k representation or indicator variable representation in classical statistics, where it has been used in regression analysis since at least the mid-twentieth century.
Most machine learning algorithms, including linear models, neural networks, and support vector machines, operate on numerical input. Categorical variables such as color, country, or product type cannot be directly processed by these algorithms. One-hot encoding solves this by creating a new binary column for each unique category in the original variable.
The encoding process follows these steps:
Let $X$ be a categorical variable with $k$ distinct levels indexed by ${c_1, c_2, \ldots, c_k}$. The one-hot encoding function $\phi : X \to {0, 1}^k$ maps each value $x = c_i$ to the standard basis vector $e_i \in \mathbb{R}^k$, where $e_i$ has a 1 in position $i$ and 0 elsewhere. Formally,
$\phi(c_i)_j = \begin{cases} 1 & \text{if } j = i \ 0 & \text{otherwise} \end{cases}$
This representation has several useful mathematical properties. The Euclidean distance between any two distinct one-hot vectors is $\sqrt{2}$, the dot product between any two distinct vectors is 0, and the L1 (sum) of every encoded vector equals 1. These properties guarantee that no spurious order or magnitude is introduced by the encoding itself.
Consider a dataset with a "Color" feature containing three categories: Red, Green, and Blue.
| Original value | Red | Green | Blue |
|---|---|---|---|
| Red | 1 | 0 | 0 |
| Green | 0 | 1 | 0 |
| Blue | 0 | 0 | 1 |
| Red | 1 | 0 | 0 |
| Blue | 0 | 0 | 1 |
The single "Color" column has been replaced by three binary columns. Each row has exactly one 1, indicating its category. This representation ensures the algorithm treats each color independently without assuming any ordinal relationship between them.
For a variable with $k$ unique categories, one-hot encoding produces $k$ binary columns. Each resulting vector has dimensionality $k$, with a single 1 at the position corresponding to the category index.
The phrase "one-hot" came from digital electronics, where finite state machines often use a one-hot register: a $k$-bit register that holds at most one active bit at a time, with each bit representing a separate state. This pattern was popular because it simplifies state-decoding logic at the cost of using more flip-flops than a binary-encoded counter. By the 1980s, the same idea had migrated into the symbolic AI and connectionist communities for representing discrete symbols in neural networks. Geoffrey Hinton and David Rumelhart's work on distributed representations, and later David Touretzky's symbolic-to-vector encoders, all relied on one-of-k input units.
In classical statistics, the same construction is older still. Regression textbooks have called these "dummy variables" since at least Suits (1957) and have used them to incorporate qualitative predictors into ordinary least squares regression. Modern machine learning inherits both vocabularies: the term dummy variable is more common in statistics and econometrics, while one-hot is preferred in machine learning, deep learning, and software engineering contexts.
Several methods exist for encoding categorical variables. The right choice depends on the nature of the data and the model being used.
| Property | One-hot encoding | Label encoding | Ordinal encoding |
|---|---|---|---|
| Output format | $k$ binary columns (one per category) | Single integer column | Single integer column |
| Assumes order | No | No (but model may infer one) | Yes |
| Best suited for | Nominal data with no inherent order | Tree-based models, low cardinality | Data with a natural ranking |
| Dimensionality impact | Increases by $k$ columns | No increase | No increase |
| Risk of spurious ordering | None | High for linear models | None if order is real |
| Example | Red -> [1,0,0], Green -> [0,1,0] | Red -> 0, Green -> 1, Blue -> 2 | Low -> 0, Medium -> 1, High -> 2 |
Label encoding assigns each category a unique integer. While compact, it introduces a numerical ordering that linear models and neural networks may interpret as meaningful. For example, encoding "Red" as 0, "Green" as 1, and "Blue" as 2 could cause the model to treat "Blue" as numerically "greater" than "Red," which is meaningless for nominal categories.
Ordinal encoding is appropriate when the categories have a genuine rank order, such as education levels (High School < Bachelor's < Master's < PhD) or satisfaction ratings (Low < Medium < High). In these cases, the integer mapping preserves the meaningful ordering.
One-hot encoding is the safest default for nominal categorical variables because it does not impose any ordering. However, it comes with a higher dimensionality cost.
One-hot encoding is most appropriate in the following situations:
One-hot encoding is generally not the best choice when dealing with high cardinality features (hundreds or thousands of unique categories), ordinal variables, or tree-based models that handle integer-encoded categories natively.
The necessity of one-hot encoding depends heavily on the algorithm being used.
Linear regression, logistic regression, SVMs, and neural networks all compute weighted sums of input features. If a categorical variable is encoded as a single integer column, the model treats those integers as continuous values on a number line, creating spurious relationships. One-hot encoding prevents this by giving each category its own independent coefficient or weight.
In a linear regression with $k - 1$ dummy variables and an intercept, the coefficient on each dummy column gives the average difference in the response between that category and the reference category, holding all other features constant. This direct interpretation is one reason dummy coding remains the default in econometric software such as Stata, R, and statsmodels.
Decision tree algorithms, random forests, and gradient-boosted trees (such as XGBoost, LightGBM, and CatBoost) split features based on threshold comparisons. These models can handle integer-encoded categorical variables without assuming an ordering, because any split on an integer column effectively partitions the categories into two groups. In fact, one-hot encoding can be detrimental for tree-based models: it produces many sparse binary columns, each of which carries very little information individually. The tree algorithm may undervalue these columns relative to continuous features, leading to worse splits. Libraries like LightGBM and CatBoost provide native support for categorical features without requiring one-hot encoding.
Algorithms that rely on distance metrics, such as $k$-nearest neighbors, $k$-means clustering, and Gaussian mixture models, also benefit from one-hot encoding. With one-hot vectors, the distance between two observations differing only in a categorical attribute is the same regardless of which two categories are involved, preserving the symmetry of the original problem. Naive Bayes classifiers can also work directly with one-hot encoded features under the multinomial or Bernoulli assumption, although they typically use raw counts or indicator variables rather than full one-hot tables.
The dummy variable trap is a form of perfect multicollinearity that arises when all $k$ one-hot encoded columns are included as predictors in a linear model. Because the columns always sum to 1, any one column can be perfectly predicted from the remaining $k - 1$ columns. This makes the design matrix singular, preventing the model from computing unique coefficient estimates.
The standard solution is to drop one of the $k$ binary columns, producing $k - 1$ "dummy variables." The dropped category becomes the reference category, and the model's coefficients for the remaining categories are interpreted relative to it. In Python, both pandas get_dummies(drop_first=True) and scikit-learn's OneHotEncoder(drop='first') support this directly.
The dummy variable trap is specific to linear models with an intercept term. Tree-based models and neural networks (which typically use regularization) are not affected by this issue. In regularized regressions such as ridge or lasso, dropping a category is not strictly required either, because the L2 or L1 penalty resolves the singularity by shrinking redundant coefficients toward zero.
One-hot encoding can dramatically increase the number of features in a dataset. A single categorical column with 1,000 unique values expands into 1,000 binary columns. This phenomenon, related to the curse of dimensionality, creates several practical problems:
When a categorical variable has high cardinality, practitioners typically turn to alternative encoding methods such as target encoding, hash encoding, or entity embedding rather than using one-hot encoding.
One-hot encoded matrices are inherently sparse: in a vector of length $k$, only one element is nonzero. For high-cardinality features, storing the full dense matrix wastes significant memory. Sparse representation formats such as compressed sparse row (CSR) or compressed sparse column (CSC) store only the nonzero entries, reducing memory consumption by orders of magnitude.
Scikit-learn's OneHotEncoder returns a sparse matrix by default (using sparse_output=True), which is compatible with most scikit-learn estimators. Pandas' get_dummies() returns a dense DataFrame by default, though it supports a sparse=True option that uses pandas' SparseArray internally. When working with large datasets or high-cardinality features, using sparse representations is critical for keeping memory usage manageable.
A simple memory comparison illustrates the gap. A dense float64 matrix with one million rows and 10,000 one-hot columns would consume roughly 80 GB of RAM. The same data stored in CSR format with one nonzero per row uses only about 12 MB for the indices and values, more than three orders of magnitude smaller. This is why production pipelines almost always carry one-hot data in sparse form until just before it enters a model that requires a dense input.
One-hot encoding has historically played an important role in natural language processing (NLP). In its simplest form, each word in a vocabulary is represented as a one-hot vector whose length equals the vocabulary size. A vocabulary of 50,000 words produces 50,000-dimensional vectors with a single 1 per word.
The bag of words model extends one-hot encoding to entire documents. Rather than a single 1 per vector, a bag-of-words vector counts the occurrences of each vocabulary word in a document. This can be viewed as the sum of one-hot vectors for all words in the document. While simple and effective for basic text classification, bag of words shares the sparsity and high dimensionality problems of one-hot encoding.
One-hot encoding treats every word as equally distant from every other word; the cosine similarity between any two one-hot vectors is zero. This means "king" and "queen" are no more similar than "king" and "banana." This limitation motivated the development of dense word embeddings such as Word2Vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014), and later contextual embeddings from models like BERT and GPT.
Word embeddings map words into a lower-dimensional continuous vector space (typically 100 to 300 dimensions) where semantically similar words are close together. In many neural NLP architectures, the first layer is an embedding layer that effectively learns to transform one-hot input vectors into dense embedding vectors during training. Modern transformer-based language models have made standalone one-hot representations obsolete for most NLP tasks, though one-hot encoding remains the conceptual starting point from which embeddings are derived.
It is useful to view a learned embedding layer as the product of a one-hot input vector and an embedding matrix. If $W \in \mathbb{R}^{V \times d}$ is the embedding matrix for a vocabulary of size $V$ with dimensionality $d$, then the embedding for token $i$ is $e_i^\top W$, which simply selects row $i$ of $W$. In practice, frameworks implement this as a memory lookup rather than a matrix multiplication, but the conceptual identity is the reason embedding layers in PyTorch and TensorFlow accept integer token IDs instead of one-hot tensors.
In standard multi-class classification, each sample belongs to exactly one class, and the target can be represented as a one-hot vector. For instance, in a three-class problem, the label for class 2 would be [0, 1, 0]. When this target vector is paired with a softmax output and the cross-entropy loss, the loss reduces to $-\log p_{y}$, where $p_y$ is the predicted probability of the true class. This is why frameworks such as PyTorch implement CrossEntropyLoss as a function of integer class indices: the one-hot encoding is implicit, and the framework only needs to look up a single output probability per sample.
In multi-label classification, a sample can belong to multiple classes simultaneously. The target becomes a "multi-hot" vector where multiple positions can be set to 1. For example, a movie tagged as both "Action" and "Comedy" would have the label vector [1, 0, 1, 0] if the classes are [Action, Drama, Comedy, Horror]. This is sometimes called binary relevance encoding. In scikit-learn, the MultiLabelBinarizer class handles this transformation. Multi-hot targets are typically paired with a sigmoid output and a binary cross-entropy loss applied independently to each label.
One-hot label vectors place all probability mass on a single class. This sharp encoding can cause neural networks to become overconfident, especially when training data contains some labeling noise. Label smoothing, introduced by Christian Szegedy and colleagues in the Inception-v3 paper (2016), softens the one-hot target by redistributing a small amount of probability mass to the other classes. With smoothing factor $\alpha$, the target for the true class becomes $1 - \alpha$ and the target for each other class becomes $\alpha / (k - 1)$.
Label smoothing improves calibration of the predicted probabilities, often improves test accuracy, and is now a default option in many large-scale classification pipelines including ImageNet training and machine translation. Other soft-label techniques include knowledge distillation, where a student network is trained to match a teacher network's full output distribution rather than a one-hot label, and mixup, which trains on convex combinations of pairs of samples and their one-hot labels.
Several Python libraries provide built-in support for one-hot encoding.
| Library | Function/class | Key features |
|---|---|---|
| pandas | pd.get_dummies() | Quick and simple; works on DataFrames; supports drop_first and sparse options |
| scikit-learn | OneHotEncoder | Fits and transforms; handles unseen categories (handle_unknown='ignore'); returns sparse matrices; integrates with Pipeline and ColumnTransformer |
| TensorFlow/Keras | tf.one_hot and tf.keras.utils.to_categorical() | Converts integer class labels to one-hot tensors for neural network targets |
| PyTorch | torch.nn.functional.one_hot() | Converts integer tensor to one-hot tensor; useful for loss computation |
| Category Encoders | ce.OneHotEncoder | Drop-in replacement with additional options for handling missing values and rare categories |
The simplest way to one-hot encode a column in a pandas DataFrame is the get_dummies function:
import pandas as pd
df = pd.DataFrame({"color": ["red", "green", "blue", "red"]})
encoded = pd.get_dummies(df, columns=["color"], prefix="color")
print(encoded)
# color_blue color_green color_red
# 0 False False True
# 1 False True False
# 2 True False False
# 3 False False True
Passing drop_first=True produces $k - 1$ dummy columns, suitable for linear regression. Passing sparse=True returns a DataFrame backed by SparseArray columns to save memory on wide encodings.
The scikit-learn OneHotEncoder follows the standard fit-transform API and integrates with the rest of the library:
import numpy as np
from sklearn.preprocessing import OneHotEncoder
X = np.array([["red"], ["green"], ["blue"], ["red"]])
encoder = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
encoder.fit(X)
print(encoder.categories_)
# [array(['blue', 'green', 'red'], dtype=object)]
print(encoder.transform(np.array([["green"], ["yellow"]])))
# [[0. 1. 0.]
# [0. 0. 0.]]
For real workflows, the encoder is typically wrapped in a ColumnTransformer and a Pipeline so that fitting on training data and transforming on new data follows the same code path:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
pre = ColumnTransformer([
("cat", OneHotEncoder(handle_unknown="ignore"), ["color", "country"]),
], remainder="passthrough")
model = Pipeline([("pre", pre), ("clf", LogisticRegression())])
model.fit(X_train, y_train)
In TensorFlow, integer class indices are converted to one-hot tensors with tf.one_hot:
import tensorflow as tf
labels = tf.constant([0, 2, 1])
one_hot = tf.one_hot(labels, depth=3)
print(one_hot)
# tf.Tensor(
# [[1. 0. 0.]
# [0. 0. 1.]
# [0. 1. 0.]], shape=(3, 3), dtype=float32)
The Keras helper tf.keras.utils.to_categorical performs the same operation on NumPy arrays and is widely used to prepare classification targets before calling model.fit.
PyTorch exposes a similar utility under the functional namespace:
import torch
import torch.nn.functional as F
labels = torch.tensor([0, 2, 1])
one_hot = F.one_hot(labels, num_classes=3)
print(one_hot)
# tensor([[1, 0, 0],
# [0, 0, 1],
# [0, 1, 0]])
For most classification training loops, however, PyTorch users skip the explicit one-hot conversion and feed the integer label tensor directly into nn.CrossEntropyLoss, which combines log-softmax and negative log-likelihood without ever materializing the one-hot vector.
A common challenge in production systems is encountering categories during inference that were not present in the training data. Different tools handle this differently:
get_dummies() does not retain knowledge of the training categories. New categories produce extra columns, and missing categories lose their columns entirely, causing shape mismatches. This makes it unsuitable for production pipelines without additional safeguards.OneHotEncoder provides the handle_unknown parameter. Setting it to 'ignore' produces an all-zeros row for unseen categories. Setting it to 'infrequent_if_exist' maps unseen categories to an infrequent category bin. The encoder must be fitted on training data and persisted (for example, with joblib) for use at inference time.Designing a robust encoding strategy that gracefully handles unseen categories is critical for deploying machine learning models in real-world applications.
When one-hot encoding is impractical due to high cardinality or other constraints, several alternative encoding techniques are available.
| Method | Description | Pros | Cons |
|---|---|---|---|
| Target encoding | Replaces each category with the mean of the target variable for that category | Compact (single column); captures target relationship | Prone to overfitting; requires careful regularization |
| Hash encoding | Applies a hash function to categories and maps them to a fixed number of columns | Fixed output dimensionality; handles unseen categories naturally | Hash collisions can mix unrelated categories |
| Entity embedding | Learns dense vector representations for categories via a neural network embedding layer | Captures relationships between categories; compact | Requires a neural network; more complex to implement |
| Binary encoding | Converts category index to binary digits, each digit becoming a column | Fewer columns than one-hot ($\log_2 k$ of category count) | Introduces artificial ordering in bit patterns |
| Frequency encoding | Replaces each category with its frequency or proportion in the dataset | Single column; no dimensionality increase | Categories with similar frequencies become indistinguishable |
| Leave-one-out encoding | Variant of target encoding that excludes the current sample's target when computing the category mean | Reduces target leakage compared with naive target encoding | Still prone to overfitting on small categories |
| Weight of evidence encoding | Replaces category with $\log(P(X \mid y=1) / P(X \mid y=0))$ | Common in credit scoring; monotone in predicted probability | Limited to binary classification; sensitive to small samples |
Entity embedding, introduced by Guo and Berkhahn (2016), have become particularly popular for high-cardinality features in deep learning. They transform categories into trainable dense vectors, analogous to how word embeddings work in NLP. In their original Kaggle Rossmann store sales experiment, entity embeddings combined with a simple multilayer perceptron outperformed gradient-boosted trees on tabular data with hundreds of store and product categories.
A practitioner deciding how to encode a feature like postal_code, which might contain tens of thousands of distinct values, can refer to the following rough guidelines.
| Cardinality | Linear models | Tree models | Neural networks |
|---|---|---|---|
| Up to ~10 | One-hot (or dummy) | One-hot or integer | One-hot or embedding (small) |
| 10 to ~50 | One-hot | Integer or one-hot | Embedding (small) |
| 50 to ~1,000 | Target or hash | Native categorical or target | Embedding |
| 1,000+ | Hash or target | Native categorical | Embedding |
These suggestions are not absolute rules. The right choice depends on dataset size, the target signal in each category, training time budget, and downstream interpretability requirements.
Several practical pitfalls catch practitioners off guard when applying one-hot encoding in real projects.
Fitting an encoder on the full dataset before splitting into training and validation sets can leak information about the validation distribution into the model. If a category appears only in validation but is present in the encoder's vocabulary fitted on the full dataset, the model implicitly knows the category exists. The correct workflow fits the encoder only on the training partition and uses handle_unknown='ignore' (or an explicit "other" bucket) at inference time.
Categories with very few observations contribute weak signal and increase the risk of overfitting. Common strategies include grouping rare levels into a single "other" bucket, removing them entirely if domain knowledge allows, or using scikit-learn's min_frequency and max_categories parameters in OneHotEncoder (added in version 1.1) to fold infrequent levels automatically.
Missing categorical values can be encoded as their own dedicated column (essentially treating NaN as a valid category) or imputed with the most frequent value before encoding. The first approach is preferable when missingness itself is informative. The scikit-learn encoder treats np.nan as a separate category by default, while pandas' get_dummies ignores NaN unless dummy_na=True is passed.
Even outside the strict dummy variable trap, near-collinear dummy columns can inflate the variance of estimated coefficients in linear models. Statisticians often check the variance inflation factor (VIF) for each dummy and consider dropping or combining levels with high VIF. This concern does not apply to regularized linear models, tree models, or neural networks.
A classic deployment bug occurs when the training pipeline uses pandas get_dummies and the serving pipeline reconstructs columns by hand or in a different order. The model receives column 7 instead of column 5 and silently produces wrong predictions. Persisting the fitted scikit-learn encoder (or another stateful encoder) and using the same object at training and inference time prevents this class of bug.
When the target variable itself is ordinal (for example, star ratings from 1 to 5), one-hot encoding the target throws away the ordering information. Two specialized alternatives are common. Cumulative encoding represents class $k$ as a vector of 1s in positions $1$ through $k$ followed by 0s in higher positions; this preserves order and is often used with the proportional odds model. The CORAL framework (Cao et al., 2020) extends this idea to neural networks and yields well-calibrated ordinal classifiers. Both approaches use a non-trivial generalization of one-hot encoding rather than the standard form.
In classical statistics, the term dummy variable is essentially synonymous with one-hot encoding minus one column. R's model.matrix function applies a contrast scheme (contr.treatment by default) that produces the same $k - 1$ dummies as OneHotEncoder(drop='first'). Other contrast schemes such as deviation coding, Helmert coding, and orthogonal polynomial coding produce different sets of $k - 1$ columns that are linear combinations of the standard dummies. These richer encodings can produce more interpretable coefficients in analysis-of-variance contexts but generally do not change the predictive performance of the model.
A 2018 study by Cerda, Varoquaux, and Kegl on "dirty" categorical variables found that one-hot encoding remained competitive on datasets with up to a few hundred categories but lost significantly to similarity encoding and entity embeddings on noisier real-world datasets where category strings contain typos and inconsistent capitalization. Pargent and colleagues' 2022 study compared regularized target encoding with one-hot, hash, frequency, and ordinal encodings across 24 OpenML benchmark datasets and reported that regularized target encoding outperformed all alternatives on high-cardinality features for both linear and tree-based models. These results match the rough guideline that one-hot is the safest choice up to a few dozen categories, after which target encoding or learned embeddings tend to dominate.
One-hot encoding shows up across many machine learning domains, each with its own conventions.
In tabular data tasks such as Kaggle competitions and business analytics, one-hot encoding is applied to features such as country, department, gender, payment method, and product category. Combined with continuous features, the resulting design matrix is fed to logistic regression, gradient-boosted trees, or a small multilayer perceptron. Tools such as scikit-learn's ColumnTransformer make it convenient to apply one-hot encoding to categorical columns while leaving numerical columns untouched.
In computer vision, one-hot encoding mostly appears on the output side. Image classification networks produce class probabilities through a softmax over $k$ classes, and the training target for each image is the one-hot indicator vector of its true class. Datasets such as ImageNet (1,000 classes) and CIFAR-10 (10 classes) follow this convention. Object detection and segmentation extend the idea to per-pixel or per-anchor one-hot targets.
NLP uses one-hot encoding for tokens at the conceptual level, but in practice token IDs are stored as integers and turned into dense embeddings on the fly. The output side of language models (next-token prediction) computes a softmax over vocabulary size $V$, and the training target is the one-hot vector of the actual next token. This is mathematically equivalent to standard cross-entropy training even though the one-hot is never materialized in memory.
In reinforcement learning, discrete action spaces are commonly represented with one-hot vectors. A policy network outputs a softmax over actions, and the action chosen during exploration is encoded as a one-hot vector for use in policy gradient updates. State spaces with discrete components (for example, the type of a unit in a strategy game) are also frequently one-hot encoded before being concatenated with continuous state features.
In recommendation, user IDs and item IDs are conceptually one-hot vectors. Classical matrix factorization can be viewed as decomposing a one-hot user vector and a one-hot item vector through learned embedding matrices. Modern recommender architectures such as deep learning recommendation models (DLRM) keep this structure: dense numerical features pass through an MLP, while categorical features are looked up through embedding tables that are mathematically equivalent to multiplying a one-hot vector by an embedding matrix.
In genomics, DNA sequences are represented as one-hot encoded matrices over the alphabet ${A, C, G, T}$, producing a $4 \times L$ matrix for a sequence of length $L$. Convolutional neural networks operating on this matrix have become standard for tasks such as transcription factor binding prediction (DeepBind, Alipanahi et al., 2015) and chromatin accessibility prediction (DeepSEA, Zhou and Troyanskaya, 2015). Protein sequences are similarly one-hot encoded over the 20-letter amino acid alphabet, although learned embeddings from large protein language models such as ESM have largely supplanted plain one-hot input in recent years.
Imagine you have a box of different colored balls: red, blue, and green. You want to tell a robot which color ball to pick up, but the robot only understands numbers, not colors. So you come up with a plan: you create a small chart with three columns, one for each color. When you want the robot to pick up a red ball, you put a 1 in the red column and 0s in the other columns. For a blue ball, you put a 1 in the blue column and 0s elsewhere, and the same for green. This way, you have turned colors (categories) into a set of numbers (binary columns) that the robot can understand. That is how one-hot encoding helps machine learning algorithms work with categories.
One-hot encoding is the workhorse method for turning small to mid-sized categorical features into numeric input that any machine learning algorithm can consume. Its strength is its simplicity and the fact that it imposes no spurious order on the categories. Its weakness is the explosion in dimensionality when cardinality grows, which can hurt both memory usage and statistical efficiency. For high-cardinality features, target encoding, hash encoding, and learned embeddings have largely taken over. For everything else, one-hot encoding remains a sound default and is built into every major data science library.