Feature

In machine learning and statistics, a feature is an individual measurable property or characteristic of a phenomenon being observed. Features serve as the input variables that a model uses to learn patterns and make predictions. The term is synonymous with several other names depending on the field: variable in statistics, attribute in database systems, predictor or covariate in regression analysis, and input or independent variable in experimental design. Selecting informative, discriminating, and independent features is one of the most important steps in building effective models for pattern recognition, classification, and regression.

Explain like I'm 5 (ELI5)

Imagine you are playing a guessing game where your friend has to figure out which animal you are thinking of. You give clues like "it has fur," "it is really big," and "it lives in the ocean." Each of those clues is like a feature. In machine learning, we give a computer a bunch of clues (features) about something, and the computer uses those clues to figure out the answer. The better and more useful the clues are, the easier it is for the computer to guess correctly.

Formal definition

Given a dataset with n observations, each observation can be described by a set of d measurable properties. Each such property is a feature. Formally, if an observation is represented as a vector x = (x₁, x₂, ..., x_d), then each x_i is a feature. The complete set of features used by a model is often called the feature set, and the vector x is called the feature vector. The space spanned by all possible feature vectors is the feature space, a d-dimensional space where each axis corresponds to one feature.

Types of features

Features can be classified along several axes. The table below summarizes the major categories.

Type	Subtype	Description	Examples
Numerical	Continuous	Takes any real value within a range	Height (1.65 m), temperature (36.7 °C), income ($52,000)
Numerical	Discrete	Takes countable integer values	Number of children (3), word count (450), page views (12,000)
Categorical	Nominal	Unordered categories with no intrinsic ranking	Color (red, blue, green), country (USA, Japan, Brazil)
Categorical	Ordinal	Ordered categories with a meaningful ranking	Education level (high school < bachelor's < master's < PhD), satisfaction (low < medium < high)
Binary	-	A special case of categorical with exactly two values	Spam or not spam (0/1), male or female (0/1)
Text	-	Natural language strings requiring tokenization	Product reviews, tweets, medical notes

Numerical features can often be used directly by most algorithms, while categorical data typically requires encoding (such as one-hot encoding or label encoding) before it can be fed into a model. Text features usually undergo further preprocessing, such as conversion into a bag-of-words matrix or a TF-IDF representation.

Feature vectors and feature spaces

A feature vector is an n-dimensional vector of numerical values that represents an observation. For example, a house might be represented by the feature vector (3, 2, 1500, 1), corresponding to 3 bedrooms, 2 bathrooms, 1500 square feet, and a binary indicator for having a garage.

The feature space is the geometric space defined by all possible feature vectors. Each feature corresponds to one axis, and each data point occupies a position in this space. Many machine learning algorithms, including k-nearest neighbors, support vector machines, and k-means clustering, operate by computing distances or boundaries within the feature space. The structure of the feature space therefore has a direct impact on model performance.

Illustrative example

Feature	House A	House B
Bedrooms	3	2
Bathrooms	2	2
Square footage	1,500	1,100
Has garage	1	0
Price (label)	$800,000	$500,000

In this example, the first four columns are features and the last column is the label (target variable) the model is trained to predict.

Feature engineering

Feature engineering is the process of using domain knowledge to create, transform, or select features that make machine learning algorithms work more effectively. It is widely regarded as one of the most impactful steps in the modeling pipeline, and skilled feature engineering can often improve model accuracy more than switching to a more complex algorithm.

Common feature engineering techniques include:

Mathematical transformations. Applying log, square root, or power transforms to reduce skewness.
Binning (discretization). Converting continuous variables into categorical bins (for example, grouping ages into ranges like 18 to 25, 26 to 35, and so on).
Date and time decomposition. Extracting day of week, month, hour, or "is weekend" flags from timestamps.
Aggregation. Computing summary statistics (mean, sum, count) over grouped records.
Domain-specific creation. Constructing new features based on expert knowledge, such as body mass index (BMI) from height and weight.

Feature selection

Feature selection is the process of identifying and retaining only the most relevant features for a given modeling task, discarding those that are redundant or irrelevant. Reducing the number of features can improve model accuracy, decrease training time, and enhance interpretability.

Three broad families of methods exist.

Method family	How it works	Examples
Filter methods	Rank features using statistical measures independent of any model	Pearson correlation, chi-squared test, mutual information, variance threshold
Wrapper methods	Evaluate subsets of features by training a model and measuring performance	Forward selection, backward elimination, recursive feature elimination (RFE)
Embedded methods	Perform feature selection as part of the model training process	LASSO (L1 regularization), random forest feature importance, gradient boosting importance scores

Filter methods are computationally cheap but ignore feature interactions. Wrapper methods capture interactions but are expensive. Embedded methods offer a practical middle ground for many applications.

Feature extraction

Feature extraction transforms raw data into a new, typically lower-dimensional set of features that retains the most important information. Unlike feature selection, which picks a subset of existing features, feature extraction creates entirely new features.

Prominent techniques include:

Principal component analysis (PCA). A linear method that projects data onto the directions of maximum variance, producing uncorrelated components.
Linear discriminant analysis (LDA). Finds linear combinations that maximize class separability.
t-SNE and UMAP. Nonlinear methods used primarily for visualization of high-dimensional data in two or three dimensions.
Autoencoders. Neural networks trained to compress data into a lower-dimensional latent space and then reconstruct it, learning a compact representation in the process.
Independent component analysis (ICA). Separates a multivariate signal into additive, statistically independent components.

Feature extraction is a core component of dimension reduction and is especially valuable when the original feature space is very high-dimensional.

Feature importance

Feature importance quantifies how much each feature contributes to a model's predictions. Understanding feature importance helps with model interpretability, debugging, and further feature selection.

Method	Description	Scope
Gini importance (MDI)	Measures average reduction in impurity across tree splits	Model-specific (tree-based)
Permutation importance	Measures drop in model performance when a feature's values are randomly shuffled	Model-agnostic
SHAP values	Based on Shapley values from cooperative game theory; assigns each feature a contribution to each individual prediction	Model-agnostic
Coefficient magnitude	In linear models, the absolute value of a feature's coefficient (after scaling) indicates importance	Model-specific (linear)
LIME	Builds a local interpretable model around a single prediction to estimate feature contributions	Model-agnostic

Permutation importance and SHAP are particularly popular because they work with any model type. SHAP provides both global importance (averaged across all predictions) and local importance (for a single prediction), making it a versatile tool for model explanation.

Feature scaling and normalization

Many machine learning algorithms, especially those that rely on distance calculations (such as k-nearest neighbors and support vector machines) or gradient-based optimization (such as neural networks and logistic regression), are sensitive to the scale of input features. Normalization and standardization bring features to a comparable scale.

Technique	Formula	Output range	When to use
Min-max scaling	x' = (x - x_min) / (x_max - x_min)	[0, 1]	When data has no significant outliers and a bounded range is needed
Z-score standardization	x' = (x - mean) / std	Unbounded (mean = 0, std = 1)	When data may contain outliers; required by many linear models and neural networks
Robust scaling	x' = (x - median) / IQR	Unbounded	When data contains many outliers
Unit vector (L2 norm)	x' = x / ‖x‖	Unit length	When only the direction of the feature vector matters (for example, in text classification with TF-IDF)

Tree-based algorithms like decision trees, random forests, and gradient boosting are generally invariant to feature scaling because they make split decisions based on thresholds rather than distances.

Feature interactions and polynomial features

A feature interaction occurs when the combined effect of two or more features on the target variable differs from the sum of their individual effects. Capturing interactions can significantly improve model performance for algorithms that do not inherently model them (such as linear regression).

Polynomial features expand the feature space by generating all polynomial combinations of features up to a specified degree. For two features a and b, degree-2 polynomial expansion produces: 1, a, b, a², ab, b². The interaction term ab captures the joint effect of the two features.

Feature crosses are a related technique used primarily with categorical features. A feature cross combines two or more categorical features into a single composite feature. For example, crossing "city" and "device type" creates a new feature "city_device" that captures location-specific device preferences.

Polynomial and interaction features can dramatically increase the dimensionality of the dataset. Careful use of regularization and feature selection is recommended to avoid overfitting when employing these techniques.

High-dimensional feature spaces and the curse of dimensionality

As the number of features grows, the volume of the feature space increases exponentially. This phenomenon, known as the curse of dimensionality (a term coined by Richard Bellman in 1961), creates several problems.

Data sparsity. In high dimensions, data points become increasingly spread out, making it difficult for algorithms to find meaningful patterns without enormous amounts of training data.
Distance concentration. Euclidean distances between points tend to converge as dimensionality increases, reducing the effectiveness of distance-based algorithms.
Overfitting risk. Models with many features relative to the number of observations can memorize noise rather than learning genuine patterns.
Computational cost. Training time and memory requirements grow with the number of features, sometimes prohibitively.

The primary remedies are dimension reduction (via feature selection or feature extraction), regularization techniques (L1 and L2 penalties), and collecting more training data.

Sparse vs. dense features

Features can be characterized by how many of their values are nonzero.

Property	Sparse features	Dense features
Definition	Vectors where most elements are zero	Vectors where most or all elements are nonzero
Typical representation	One-hot encoding, bag-of-words, TF-IDF	Word embeddings, neural network hidden states
Dimensionality	Often very high (thousands to millions)	Typically low to moderate (50 to 1024)
Interpretability	High; each dimension usually corresponds to a specific known feature	Lower; dimensions are learned and may not have obvious meanings
Storage	Efficient with sparse matrix formats (CSR, CSC)	Requires full matrix storage
Semantic capture	Limited; does not encode relationships between features	Strong; similar items have similar vectors

In natural language processing, sparse representations like bag-of-words have been largely supplanted by dense embeddings produced by models such as Word2Vec, GloVe, and BERT for most downstream tasks, though sparse representations remain useful in information retrieval and certain hybrid search architectures.

Learned features and representation learning

Traditional machine learning relies on handcrafted features: domain experts manually design and extract features from raw data before training a model. This approach requires significant expertise and effort, and the resulting features may not capture all relevant patterns.

Representation learning (also called feature learning) automates this process. Deep learning models, particularly convolutional neural networks (CNNs) and transformers, learn hierarchical feature representations directly from raw data during training. Early layers typically learn low-level features (edges, textures in images; character n-grams in text), while deeper layers learn increasingly abstract, high-level features (object parts, semantic concepts).

Aspect	Handcrafted features	Learned features
Creation	Designed manually by domain experts	Learned automatically during model training
Domain knowledge required	High	Low (though architecture design still requires expertise)
Adaptability	Fixed once designed; must be redesigned for new domains	Adapt to data; can transfer across tasks via transfer learning
Performance ceiling	Limited by the engineer's insight	Can discover patterns humans might miss
Interpretability	Generally high	Often low ("black box" representations)
Data requirements	Works with smaller datasets	Typically requires large datasets to learn effective representations

The success of deep learning in computer vision, natural language processing, and speech recognition is largely attributed to its ability to learn powerful feature representations without manual engineering. Techniques like transfer learning allow features learned on large datasets (such as ImageNet for vision or large text corpora for language models) to be reused for related tasks with limited data.

Features in different domains

The nature and design of features varies significantly across application areas.

Domain	Typical features	Notes
Computer vision	Pixel intensities, edge histograms (HOG), SIFT descriptors, CNN activations	Deep learning has largely replaced hand-designed visual features
Natural language processing	Bag-of-words, TF-IDF, n-grams, word embeddings, contextual embeddings	Transformer-based models learn contextualized features
Speech recognition	Mel-frequency cepstral coefficients (MFCCs), spectrograms, filter banks	Modern end-to-end models learn features from raw audio
Tabular data	Numerical columns, encoded categorical columns, engineered ratios and aggregations	Feature engineering remains highly impactful for tabular data
Recommender systems	User demographics, item attributes, interaction history, collaborative signals	Hybrid features combining content and behavior are common
Bioinformatics	Gene expression levels, protein sequence motifs, molecular descriptors	High-dimensional and often sparse

Summary of key concepts

Concept	Definition
Feature	A measurable input property used by a model
Feature vector	A numerical vector representing one observation
Feature space	The multidimensional space of all possible feature vectors
Feature engineering	Creating and transforming features using domain knowledge
Feature selection	Choosing the most relevant subset of features
Feature extraction	Deriving new (often lower-dimensional) features from raw data
Feature importance	Quantifying each feature's contribution to predictions
Feature scaling	Normalizing features to a common scale
Feature interaction	Combined effect of multiple features that differs from their individual effects
Curse of dimensionality	Problems arising from having too many features relative to data

References

Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Springer. ISBN 978-0-387-31073-2.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning: Data Mining, Inference, and Prediction* (2nd ed.). Springer. ISBN 978-0-387-84857-0.
Guyon, I., & Elisseeff, A. (2003). "An Introduction to Variable and Feature Selection." *Journal of Machine Learning Research*, 3, 1157-1182.
Bengio, Y., Courville, A., & Vincent, P. (2013). "Representation Learning: A Review and New Perspectives." *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 35(8), 1798-1828.
Bellman, R. E. (1961). *Adaptive Control Processes: A Guided Tour*. Princeton University Press.
Lundberg, S. M., & Lee, S.-I. (2017). "A Unified Approach to Interpreting Model Predictions." *Advances in Neural Information Processing Systems*, 30.
Zheng, A., & Casari, A. (2018). *Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists*. O'Reilly Media. ISBN 978-1-491-95324-2.
Kuhn, M., & Johnson, K. (2019). *Feature Engineering and Selection: A Practical Approach for Predictive Models*. CRC Press. ISBN 978-1-138-07922-3.
Jolliffe, I. T. (2002). *Principal Component Analysis* (2nd ed.). Springer. ISBN 978-0-387-95442-4.
Molnar, C. (2022). *Interpretable Machine Learning: A Guide for Making Black Box Models Explainable* (2nd ed.). Available at christophm.github.io/interpretable-ml-book/.

Explain like I'm 5 (ELI5)

Formal definition

Types of features

Feature vectors and feature spaces

Illustrative example

Feature engineering

Feature selection

Feature extraction

Feature importance

Feature scaling and normalization

Feature interactions and polynomial features

High-dimensional feature spaces and the curse of dimensionality

Sparse vs. dense features

Learned features and representation learning

Features in different domains

Summary of key concepts

References

Improve this article

Related Articles

ARC-AGI 2

Dimension Reduction

Bucketing

Class-Imbalanced Dataset

Data Augmentation

Data Set or Dataset

Explain like I'm 5 (ELI5)

Formal definition

Types of features

Feature vectors and feature spaces

Illustrative example

Feature engineering

Feature selection

Feature extraction

Feature importance

Feature scaling and normalization

Feature interactions and polynomial features

High-dimensional feature spaces and the curse of dimensionality

Sparse vs. dense features

Learned features and representation learning

Features in different domains

Summary of key concepts

References

Related Articles

ARC-AGI 2

Dimension Reduction

Bucketing

Class-Imbalanced Dataset

Data Augmentation

Data Set or Dataset