In machine learning and statistics, a feature is an individual measurable property or characteristic of a phenomenon being observed. Features serve as the input variables that a model uses to learn patterns and make predictions. The term is synonymous with several other names depending on the field: variable in statistics, attribute in database systems, predictor or covariate in regression analysis, and input or independent variable in experimental design. Selecting informative, discriminating, and independent features is one of the most important steps in building effective models for pattern recognition, classification, and regression.
Imagine you are playing a guessing game where your friend has to figure out which animal you are thinking of. You give clues like "it has fur," "it is really big," and "it lives in the ocean." Each of those clues is like a feature. In machine learning, we give a computer a bunch of clues (features) about something, and the computer uses those clues to figure out the answer. The better and more useful the clues are, the easier it is for the computer to guess correctly.
Given a dataset with n observations, each observation can be described by a set of d measurable properties. Each such property is a feature. Formally, if an observation is represented as a vector x = (x₁, x₂, ..., x_d), then each x_i is a feature. The complete set of features used by a model is often called the feature set, and the vector x is called the feature vector. The space spanned by all possible feature vectors is the feature space, a d-dimensional space where each axis corresponds to one feature.
Features can be classified along several axes. The table below summarizes the major categories.
| Type | Subtype | Description | Examples |
|---|---|---|---|
| Numerical | Continuous | Takes any real value within a range | Height (1.65 m), temperature (36.7 °C), income ($52,000) |
| Numerical | Discrete | Takes countable integer values | Number of children (3), word count (450), page views (12,000) |
| Categorical | Nominal | Unordered categories with no intrinsic ranking | Color (red, blue, green), country (USA, Japan, Brazil) |
| Categorical | Ordinal | Ordered categories with a meaningful ranking | Education level (high school < bachelor's < master's < PhD), satisfaction (low < medium < high) |
| Binary | - | A special case of categorical with exactly two values | Spam or not spam (0/1), male or female (0/1) |
| Text | - | Natural language strings requiring tokenization | Product reviews, tweets, medical notes |
Numerical features can often be used directly by most algorithms, while categorical data typically requires encoding (such as one-hot encoding or label encoding) before it can be fed into a model. Text features usually undergo further preprocessing, such as conversion into a bag-of-words matrix or a TF-IDF representation.
A feature vector is an n-dimensional vector of numerical values that represents an observation. For example, a house might be represented by the feature vector (3, 2, 1500, 1), corresponding to 3 bedrooms, 2 bathrooms, 1500 square feet, and a binary indicator for having a garage.
The feature space is the geometric space defined by all possible feature vectors. Each feature corresponds to one axis, and each data point occupies a position in this space. Many machine learning algorithms, including k-nearest neighbors, support vector machines, and k-means clustering, operate by computing distances or boundaries within the feature space. The structure of the feature space therefore has a direct impact on model performance.
| Feature | House A | House B |
|---|---|---|
| Bedrooms | 3 | 2 |
| Bathrooms | 2 | 2 |
| Square footage | 1,500 | 1,100 |
| Has garage | 1 | 0 |
| Price (label) | $800,000 | $500,000 |
In this example, the first four columns are features and the last column is the label (target variable) the model is trained to predict.
Feature engineering is the process of using domain knowledge to create, transform, or select features that make machine learning algorithms work more effectively. It is widely regarded as one of the most impactful steps in the modeling pipeline, and skilled feature engineering can often improve model accuracy more than switching to a more complex algorithm.
Common feature engineering techniques include:
Feature selection is the process of identifying and retaining only the most relevant features for a given modeling task, discarding those that are redundant or irrelevant. Reducing the number of features can improve model accuracy, decrease training time, and enhance interpretability.
Three broad families of methods exist.
| Method family | How it works | Examples |
|---|---|---|
| Filter methods | Rank features using statistical measures independent of any model | Pearson correlation, chi-squared test, mutual information, variance threshold |
| Wrapper methods | Evaluate subsets of features by training a model and measuring performance | Forward selection, backward elimination, recursive feature elimination (RFE) |
| Embedded methods | Perform feature selection as part of the model training process | LASSO (L1 regularization), random forest feature importance, gradient boosting importance scores |
Filter methods are computationally cheap but ignore feature interactions. Wrapper methods capture interactions but are expensive. Embedded methods offer a practical middle ground for many applications.
Feature extraction transforms raw data into a new, typically lower-dimensional set of features that retains the most important information. Unlike feature selection, which picks a subset of existing features, feature extraction creates entirely new features.
Prominent techniques include:
Feature extraction is a core component of dimension reduction and is especially valuable when the original feature space is very high-dimensional.
Feature importance quantifies how much each feature contributes to a model's predictions. Understanding feature importance helps with model interpretability, debugging, and further feature selection.
| Method | Description | Scope |
|---|---|---|
| Gini importance (MDI) | Measures average reduction in impurity across tree splits | Model-specific (tree-based) |
| Permutation importance | Measures drop in model performance when a feature's values are randomly shuffled | Model-agnostic |
| SHAP values | Based on Shapley values from cooperative game theory; assigns each feature a contribution to each individual prediction | Model-agnostic |
| Coefficient magnitude | In linear models, the absolute value of a feature's coefficient (after scaling) indicates importance | Model-specific (linear) |
| LIME | Builds a local interpretable model around a single prediction to estimate feature contributions | Model-agnostic |
Permutation importance and SHAP are particularly popular because they work with any model type. SHAP provides both global importance (averaged across all predictions) and local importance (for a single prediction), making it a versatile tool for model explanation.
Many machine learning algorithms, especially those that rely on distance calculations (such as k-nearest neighbors and support vector machines) or gradient-based optimization (such as neural networks and logistic regression), are sensitive to the scale of input features. Normalization and standardization bring features to a comparable scale.
| Technique | Formula | Output range | When to use |
|---|---|---|---|
| Min-max scaling | x' = (x - x_min) / (x_max - x_min) | [0, 1] | When data has no significant outliers and a bounded range is needed |
| Z-score standardization | x' = (x - mean) / std | Unbounded (mean = 0, std = 1) | When data may contain outliers; required by many linear models and neural networks |
| Robust scaling | x' = (x - median) / IQR | Unbounded | When data contains many outliers |
| Unit vector (L2 norm) | x' = x / ‖x‖ | Unit length | When only the direction of the feature vector matters (for example, in text classification with TF-IDF) |
Tree-based algorithms like decision trees, random forests, and gradient boosting are generally invariant to feature scaling because they make split decisions based on thresholds rather than distances.
A feature interaction occurs when the combined effect of two or more features on the target variable differs from the sum of their individual effects. Capturing interactions can significantly improve model performance for algorithms that do not inherently model them (such as linear regression).
Polynomial features expand the feature space by generating all polynomial combinations of features up to a specified degree. For two features a and b, degree-2 polynomial expansion produces: 1, a, b, a², ab, b². The interaction term ab captures the joint effect of the two features.
Feature crosses are a related technique used primarily with categorical features. A feature cross combines two or more categorical features into a single composite feature. For example, crossing "city" and "device type" creates a new feature "city_device" that captures location-specific device preferences.
Polynomial and interaction features can dramatically increase the dimensionality of the dataset. Careful use of regularization and feature selection is recommended to avoid overfitting when employing these techniques.
As the number of features grows, the volume of the feature space increases exponentially. This phenomenon, known as the curse of dimensionality (a term coined by Richard Bellman in 1961), creates several problems.
The primary remedies are dimension reduction (via feature selection or feature extraction), regularization techniques (L1 and L2 penalties), and collecting more training data.
Features can be characterized by how many of their values are nonzero.
| Property | Sparse features | Dense features |
|---|---|---|
| Definition | Vectors where most elements are zero | Vectors where most or all elements are nonzero |
| Typical representation | One-hot encoding, bag-of-words, TF-IDF | Word embeddings, neural network hidden states |
| Dimensionality | Often very high (thousands to millions) | Typically low to moderate (50 to 1024) |
| Interpretability | High; each dimension usually corresponds to a specific known feature | Lower; dimensions are learned and may not have obvious meanings |
| Storage | Efficient with sparse matrix formats (CSR, CSC) | Requires full matrix storage |
| Semantic capture | Limited; does not encode relationships between features | Strong; similar items have similar vectors |
In natural language processing, sparse representations like bag-of-words have been largely supplanted by dense embeddings produced by models such as Word2Vec, GloVe, and BERT for most downstream tasks, though sparse representations remain useful in information retrieval and certain hybrid search architectures.
Traditional machine learning relies on handcrafted features: domain experts manually design and extract features from raw data before training a model. This approach requires significant expertise and effort, and the resulting features may not capture all relevant patterns.
Representation learning (also called feature learning) automates this process. Deep learning models, particularly convolutional neural networks (CNNs) and transformers, learn hierarchical feature representations directly from raw data during training. Early layers typically learn low-level features (edges, textures in images; character n-grams in text), while deeper layers learn increasingly abstract, high-level features (object parts, semantic concepts).
| Aspect | Handcrafted features | Learned features |
|---|---|---|
| Creation | Designed manually by domain experts | Learned automatically during model training |
| Domain knowledge required | High | Low (though architecture design still requires expertise) |
| Adaptability | Fixed once designed; must be redesigned for new domains | Adapt to data; can transfer across tasks via transfer learning |
| Performance ceiling | Limited by the engineer's insight | Can discover patterns humans might miss |
| Interpretability | Generally high | Often low ("black box" representations) |
| Data requirements | Works with smaller datasets | Typically requires large datasets to learn effective representations |
The success of deep learning in computer vision, natural language processing, and speech recognition is largely attributed to its ability to learn powerful feature representations without manual engineering. Techniques like transfer learning allow features learned on large datasets (such as ImageNet for vision or large text corpora for language models) to be reused for related tasks with limited data.
The nature and design of features varies significantly across application areas.
| Domain | Typical features | Notes |
|---|---|---|
| Computer vision | Pixel intensities, edge histograms (HOG), SIFT descriptors, CNN activations | Deep learning has largely replaced hand-designed visual features |
| Natural language processing | Bag-of-words, TF-IDF, n-grams, word embeddings, contextual embeddings | Transformer-based models learn contextualized features |
| Speech recognition | Mel-frequency cepstral coefficients (MFCCs), spectrograms, filter banks | Modern end-to-end models learn features from raw audio |
| Tabular data | Numerical columns, encoded categorical columns, engineered ratios and aggregations | Feature engineering remains highly impactful for tabular data |
| Recommender systems | User demographics, item attributes, interaction history, collaborative signals | Hybrid features combining content and behavior are common |
| Bioinformatics | Gene expression levels, protein sequence motifs, molecular descriptors | High-dimensional and often sparse |
| Concept | Definition |
|---|---|
| Feature | A measurable input property used by a model |
| Feature vector | A numerical vector representing one observation |
| Feature space | The multidimensional space of all possible feature vectors |
| Feature engineering | Creating and transforming features using domain knowledge |
| Feature selection | Choosing the most relevant subset of features |
| Feature extraction | Deriving new (often lower-dimensional) features from raw data |
| Feature importance | Quantifying each feature's contribution to predictions |
| Feature scaling | Normalizing features to a common scale |
| Feature interaction | Combined effect of multiple features that differs from their individual effects |
| Curse of dimensionality | Problems arising from having too many features relative to data |