# Feature

> Source: https://aiwiki.ai/wiki/feature
> Updated: 2026-06-21
> Categories: Data & Datasets, Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

In [machine learning](/wiki/machine_learning) and statistics, a **feature** is an individual measurable property or characteristic of a phenomenon being observed, used as an input variable from which a model learns patterns and makes predictions.[1] In a dataset, the features are the columns the model reads, and the target it predicts is the label. The choice of features is widely considered the single biggest driver of model quality: in his 2012 survey, machine learning researcher Pedro Domingos wrote, "At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used."[11]

The term is synonymous with several other names depending on the field: **variable** in statistics, **attribute** in database systems, **predictor** or **covariate** in regression analysis, and **input** or **independent variable** in experimental design.[2] Selecting informative, discriminating, and independent features is one of the most important steps in building effective models for pattern recognition, classification, and regression.[3]

## Explain like I'm 5 (ELI5)

Imagine you are playing a guessing game where your friend has to figure out which animal you are thinking of. You give clues like "it has fur," "it is really big," and "it lives in the ocean." Each of those clues is like a feature. In machine learning, we give a computer a bunch of clues (features) about something, and the computer uses those clues to figure out the answer. The better and more useful the clues are, the easier it is for the computer to guess correctly.

## Formal definition

Given a dataset with *n* observations, each observation can be described by a set of *d* measurable properties. Each such property is a feature. Formally, if an observation is represented as a vector **x** = (x₁, x₂, ..., x_d), then each x_i is a feature.[1] The complete set of features used by a model is often called the [feature set](/wiki/feature_set), and the vector **x** is called the [feature vector](/wiki/feature_vector). The space spanned by all possible feature vectors is the **feature space**, a *d*-dimensional space where each axis corresponds to one feature.

## Types of features

Features can be classified along several axes. The table below summarizes the major categories.

| Type | Subtype | Description | Examples |
|---|---|---|---|
| [Numerical](/wiki/numerical_data) | Continuous | Takes any real value within a range | Height (1.65 m), temperature (36.7 °C), income ($52,000) |
| Numerical | Discrete | Takes countable integer values | Number of children (3), word count (450), page views (12,000) |
| [Categorical](/wiki/categorical_data) | Nominal | Unordered categories with no intrinsic ranking | Color (red, blue, green), country (USA, Japan, Brazil) |
| Categorical | Ordinal | Ordered categories with a meaningful ranking | Education level (high school < bachelor's < master's < PhD), satisfaction (low < medium < high) |
| Binary | - | A special case of categorical with exactly two values | Spam or not spam (0/1), male or female (0/1) |
| Text | - | Natural language strings requiring tokenization | Product reviews, tweets, medical notes |

Numerical features can often be used directly by most algorithms, while [categorical data](/wiki/categorical_data) typically requires encoding (such as [one-hot encoding](/wiki/one-hot_encoding) or label encoding) before it can be fed into a model.[7] Text features usually undergo further preprocessing, such as conversion into a bag-of-words matrix or a TF-IDF representation.

## Feature vectors and feature spaces

A [feature vector](/wiki/feature_vector) is an *n*-dimensional vector of numerical values that represents an observation. For example, a house might be represented by the feature vector (3, 2, 1500, 1), corresponding to 3 bedrooms, 2 bathrooms, 1500 square feet, and a binary indicator for having a garage.

The **feature space** is the geometric space defined by all possible feature vectors. Each feature corresponds to one axis, and each data point occupies a position in this space. Many machine learning algorithms, including [k-nearest neighbors](/wiki/k_nearest_neighbors), [support vector machines](/wiki/support_vector_machine_svm), and [k-means clustering](/wiki/k-means), operate by computing distances or boundaries within the feature space.[1] The structure of the feature space therefore has a direct impact on model performance.

### Illustrative example

| Feature | House A | House B |
|---|---|---|
| Bedrooms | 3 | 2 |
| Bathrooms | 2 | 2 |
| Square footage | 1,500 | 1,100 |
| Has garage | 1 | 0 |
| **Price (label)** | **$800,000** | **$500,000** |

In this example, the first four columns are features and the last column is the label (target variable) the model is trained to predict.

## What is feature engineering and why does it matter?

[Feature engineering](/wiki/feature_engineering) is the process of using domain knowledge to create, transform, or select features that make machine learning algorithms work more effectively.[7] It is widely regarded as one of the most impactful steps in the modeling pipeline, and skilled feature engineering can often improve model accuracy more than switching to a more complex algorithm.[8] Domingos frames it as the part of machine learning that is hardest to automate, observing that "much of the effort in building a machine learning application is in the design of features," because feature engineering is domain-specific while learning algorithms are largely general purpose.[11]

Common feature engineering techniques include:

- **Mathematical transformations.** Applying log, square root, or power transforms to reduce skewness.
- **Binning (discretization).** Converting continuous variables into categorical bins (for example, grouping ages into ranges like 18 to 25, 26 to 35, and so on).
- **Date and time decomposition.** Extracting day of week, month, hour, or "is weekend" flags from timestamps.
- **Aggregation.** Computing summary statistics (mean, sum, count) over grouped records.
- **Domain-specific creation.** Constructing new features based on expert knowledge, such as body mass index (BMI) from height and weight.[7]

## Feature selection

Feature selection is the process of identifying and retaining only the most relevant features for a given modeling task, discarding those that are redundant or irrelevant.[3] Reducing the number of features can improve model accuracy, decrease training time, and enhance interpretability.[3] Guyon and Elisseeff state the objectives of feature selection as "improving the prediction performance of the predictors, providing faster and more cost-effective predictors, and providing a better understanding of the underlying process that generated the data."[3]

Three broad families of methods exist.

| Method family | How it works | Examples |
|---|---|---|
| Filter methods | Rank features using statistical measures independent of any model | Pearson correlation, chi-squared test, mutual information, variance threshold |
| Wrapper methods | Evaluate subsets of features by training a model and measuring performance | Forward selection, backward elimination, recursive feature elimination (RFE) |
| Embedded methods | Perform feature selection as part of the model training process | [LASSO](/wiki/lasso_regression) (L1 regularization), [random forest](/wiki/random_forest) feature importance, [gradient boosting](/wiki/gradient_boosting) importance scores |

Filter methods are computationally cheap but ignore feature interactions. Wrapper methods capture interactions but are expensive. Embedded methods offer a practical middle ground for many applications.[8]

## Feature extraction

[Feature extraction](/wiki/feature_extraction) transforms raw data into a new, typically lower-dimensional set of features that retains the most important information. Unlike feature selection, which picks a subset of existing features, feature extraction creates entirely new features.

Prominent techniques include:

- **Principal component analysis (PCA).** A linear method that projects data onto the directions of maximum variance, producing uncorrelated components.[9]
- **Linear discriminant analysis (LDA).** Finds linear combinations that maximize class separability.
- **t-SNE and UMAP.** Nonlinear methods used primarily for visualization of high-dimensional data in two or three dimensions.
- **[Autoencoders](/wiki/autoencoder).** [Neural networks](/wiki/neural_network) trained to compress data into a lower-dimensional latent space and then reconstruct it, learning a compact representation in the process.[4]
- **Independent component analysis (ICA).** Separates a multivariate signal into additive, statistically independent components.

Feature extraction is a core component of [dimension reduction](/wiki/dimension_reduction) and is especially valuable when the original feature space is very high-dimensional.[9]

## Feature importance

Feature importance quantifies how much each feature contributes to a model's predictions. Understanding feature importance helps with model interpretability, debugging, and further feature selection.[10]

| Method | Description | Scope |
|---|---|---|
| Gini importance (MDI) | Measures average reduction in impurity across tree splits | Model-specific (tree-based) |
| Permutation importance | Measures drop in model performance when a feature's values are randomly shuffled | Model-agnostic |
| SHAP values | Based on Shapley values from cooperative game theory; assigns each feature a contribution to each individual prediction | Model-agnostic |
| Coefficient magnitude | In linear models, the absolute value of a feature's coefficient (after scaling) indicates importance | Model-specific (linear) |
| LIME | Builds a local interpretable model around a single prediction to estimate feature contributions | Model-agnostic |

Permutation importance and SHAP are particularly popular because they work with any model type.[10] SHAP (SHapley Additive exPlanations), introduced by Scott Lundberg and Su-In Lee at NeurIPS 2017, unifies several earlier attribution methods under the framework of Shapley values and provides both global importance (averaged across all predictions) and local importance (for a single prediction), making it a versatile tool for model explanation.[6]

## When does feature scaling matter?

Many machine learning algorithms, especially those that rely on distance calculations (such as k-nearest neighbors and [support vector machines](/wiki/support_vector_machine_svm)) or gradient-based optimization (such as [neural networks](/wiki/neural_network) and [logistic regression](/wiki/logistic_regression)), are sensitive to the scale of input features.[7] [Normalization](/wiki/normalization) and standardization bring features to a comparable scale.

| Technique | Formula | Output range | When to use |
|---|---|---|---|
| Min-max scaling | x' = (x - x_min) / (x_max - x_min) | [0, 1] | When data has no significant outliers and a bounded range is needed |
| Z-score standardization | x' = (x - mean) / std | Unbounded (mean = 0, std = 1) | When data may contain outliers; required by many linear models and neural networks |
| Robust scaling | x' = (x - median) / IQR | Unbounded | When data contains many outliers |
| Unit vector (L2 norm) | x' = x / ‖x‖ | Unit length | When only the direction of the feature vector matters (for example, in text classification with TF-IDF) |

Tree-based algorithms like [decision trees](/wiki/decision_tree), [random forests](/wiki/random_forest), and [gradient boosting](/wiki/gradient_boosting) are generally invariant to feature scaling because they make split decisions based on thresholds rather than distances.[8]

## Feature interactions and polynomial features

A feature interaction occurs when the combined effect of two or more features on the target variable differs from the sum of their individual effects. Capturing interactions can significantly improve model performance for algorithms that do not inherently model them (such as linear regression).[7]

**Polynomial features** expand the feature space by generating all polynomial combinations of features up to a specified degree. For two features *a* and *b*, degree-2 polynomial expansion produces: 1, a, b, a², ab, b². The interaction term *ab* captures the joint effect of the two features.

**Feature crosses** are a related technique used primarily with categorical features. A feature cross combines two or more categorical features into a single composite feature. For example, crossing "city" and "device type" creates a new feature "city_device" that captures location-specific device preferences.

Polynomial and interaction features can dramatically increase the dimensionality of the dataset. Careful use of regularization and feature selection is recommended to avoid overfitting when employing these techniques.[7]

## High-dimensional feature spaces and the curse of dimensionality

As the number of features grows, the volume of the feature space increases exponentially. This phenomenon, known as the **curse of dimensionality** (a term coined by Richard Bellman in his 1957 book *Dynamic Programming* and popularized in his 1961 *Adaptive Control Processes*), creates several problems.[12][5]

1. **Data sparsity.** In high dimensions, data points become increasingly spread out, making it difficult for algorithms to find meaningful patterns without enormous amounts of training data.[5]
2. **Distance concentration.** Euclidean distances between points tend to converge as dimensionality increases, reducing the effectiveness of distance-based algorithms.
3. **Overfitting risk.** Models with many features relative to the number of observations can memorize noise rather than learning genuine patterns.[2]
4. **Computational cost.** Training time and memory requirements grow with the number of features, sometimes prohibitively.

The primary remedies are [dimension reduction](/wiki/dimension_reduction) (via feature selection or feature extraction), regularization techniques (L1 and L2 penalties), and collecting more training data.[2]

## Sparse vs. dense features

Features can be characterized by how many of their values are nonzero.

| Property | Sparse features | Dense features |
|---|---|---|
| Definition | Vectors where most elements are zero | Vectors where most or all elements are nonzero |
| Typical representation | One-hot encoding, bag-of-words, TF-IDF | [Word embeddings](/wiki/word_embedding), neural network hidden states |
| Dimensionality | Often very high (thousands to millions) | Typically low to moderate (50 to 1024) |
| Interpretability | High; each dimension usually corresponds to a specific known feature | Lower; dimensions are learned and may not have obvious meanings |
| Storage | Efficient with sparse matrix formats (CSR, CSC) | Requires full matrix storage |
| Semantic capture | Limited; does not encode relationships between features | Strong; similar items have similar vectors |

In natural language processing, sparse representations like bag-of-words have been largely supplanted by dense [embeddings](/wiki/embedding_vector) produced by models such as Word2Vec, GloVe, and [BERT](/wiki/bert_bidirectional_encoder_representations_from_transformers) for most downstream tasks, though sparse representations remain useful in information retrieval and certain hybrid search architectures.[4]

## How do learned features differ from handcrafted features?

Traditional machine learning relies on **handcrafted features**: domain experts manually design and extract features from raw data before training a model. This approach requires significant expertise and effort, and the resulting features may not capture all relevant patterns.[4]

**Representation learning** (also called feature learning) automates this process.[4] As Yoshua Bengio, Aaron Courville, and Pascal Vincent put it in their 2013 review, "The success of machine learning algorithms generally depends on data representation," because different representations can hide or expose the explanatory factors of variation behind the data.[4] [Deep learning](/wiki/deep_learning) models, particularly [convolutional neural networks](/wiki/convolutional_neural_network) (CNNs) and [transformers](/wiki/transformer), learn hierarchical feature representations directly from raw data during training. Early layers typically learn low-level features (edges, textures in images; character n-grams in text), while deeper layers learn increasingly abstract, high-level features (object parts, semantic concepts).[4]

| Aspect | Handcrafted features | Learned features |
|---|---|---|
| Creation | Designed manually by domain experts | Learned automatically during model training |
| Domain knowledge required | High | Low (though architecture design still requires expertise) |
| Adaptability | Fixed once designed; must be redesigned for new domains | Adapt to data; can transfer across tasks via [transfer learning](/wiki/transfer_learning) |
| Performance ceiling | Limited by the engineer's insight | Can discover patterns humans might miss |
| Interpretability | Generally high | Often low ("black box" representations) |
| Data requirements | Works with smaller datasets | Typically requires large datasets to learn effective representations |

The success of deep learning in computer vision, natural language processing, and speech recognition is largely attributed to its ability to learn powerful feature representations without manual engineering.[4] Techniques like [transfer learning](/wiki/transfer_learning) allow features learned on large datasets (such as ImageNet for vision or large text corpora for language models) to be reused for related tasks with limited data.

## Features in different domains

The nature and design of features varies significantly across application areas.

| Domain | Typical features | Notes |
|---|---|---|
| Computer vision | Pixel intensities, edge histograms (HOG), SIFT descriptors, CNN activations | Deep learning has largely replaced hand-designed visual features |
| Natural language processing | Bag-of-words, TF-IDF, n-grams, word embeddings, contextual embeddings | Transformer-based models learn contextualized features |
| Speech recognition | Mel-frequency cepstral coefficients (MFCCs), spectrograms, filter banks | Modern end-to-end models learn features from raw audio |
| Tabular data | Numerical columns, encoded categorical columns, engineered ratios and aggregations | Feature engineering remains highly impactful for tabular data |
| Recommender systems | User demographics, item attributes, interaction history, collaborative signals | Hybrid features combining content and behavior are common |
| Bioinformatics | Gene expression levels, protein sequence motifs, molecular descriptors | High-dimensional and often sparse |

## Summary of key concepts

| Concept | Definition |
|---|---|
| Feature | A measurable input property used by a model |
| Feature vector | A numerical vector representing one observation |
| Feature space | The multidimensional space of all possible feature vectors |
| Feature engineering | Creating and transforming features using domain knowledge |
| Feature selection | Choosing the most relevant subset of features |
| Feature extraction | Deriving new (often lower-dimensional) features from raw data |
| Feature importance | Quantifying each feature's contribution to predictions |
| Feature scaling | Normalizing features to a common scale |
| Feature interaction | Combined effect of multiple features that differs from their individual effects |
| Curse of dimensionality | Problems arising from having too many features relative to data |

## References

1. Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Springer. ISBN 978-0-387-31073-2.
2. Hastie, T., Tibshirani, R., & Friedman, J. (2009). *The Elements of Statistical Learning: Data Mining, Inference, and Prediction* (2nd ed.). Springer. ISBN 978-0-387-84857-0.
3. Guyon, I., & Elisseeff, A. (2003). "An Introduction to Variable and Feature Selection." *Journal of Machine Learning Research*, 3, 1157-1182.
4. Bengio, Y., Courville, A., & Vincent, P. (2013). "Representation Learning: A Review and New Perspectives." *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 35(8), 1798-1828.
5. Bellman, R. E. (1961). *Adaptive Control Processes: A Guided Tour*. Princeton University Press.
6. Lundberg, S. M., & Lee, S.-I. (2017). "A Unified Approach to Interpreting Model Predictions." *Advances in Neural Information Processing Systems*, 30.
7. Zheng, A., & Casari, A. (2018). *Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists*. O'Reilly Media. ISBN 978-1-491-95324-2.
8. Kuhn, M., & Johnson, K. (2019). *Feature Engineering and Selection: A Practical Approach for Predictive Models*. CRC Press. ISBN 978-1-138-07922-3.
9. Jolliffe, I. T. (2002). *Principal Component Analysis* (2nd ed.). Springer. ISBN 978-0-387-95442-4.
10. Molnar, C. (2022). *Interpretable Machine Learning: A Guide for Making Black Box Models Explainable* (2nd ed.). Available at christophm.github.io/interpretable-ml-book/.
11. Domingos, P. (2012). "A Few Useful Things to Know about Machine Learning." *Communications of the ACM*, 55(10), 78-87.
12. Bellman, R. E. (1957). *Dynamic Programming*. Princeton University Press.