Feature Vector
Last reviewed
Sources
17 citations
Review status
Source-backed
Revision
v7 ยท 4,418 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
17 citations
Review status
Source-backed
Revision
v7 ยท 4,418 words
Add missing citations, update stale details, or suggest a clearer explanation.
A feature vector is an n-dimensional, ordered list of numerical values that represents the measurable properties of an object, data point, or observation in a format suitable for processing by machine learning algorithms.[1][14][15] Each element corresponds to a specific feature, and the complete vector serves as the numerical identity of that data point within a mathematical space called the feature space; as the Wikipedia definition states, "A feature vector is an n-dimensional vector of numerical features that represent some object," and "the vector space associated with these vectors is often called the feature space."[15] Many learning algorithms require this numerical form because, in Wikipedia's words, "such representations facilitate processing and statistical analysis."[15]
Feature vectors are one of the most fundamental data structures in machine learning and pattern recognition. They act as the bridge between raw, unstructured data (images, text, audio) and the mathematical operations that learning algorithms perform. Without a consistent numerical representation, algorithms would have no way to compare, group, or learn from data. In practice the values are rarely raw: Google's Machine Learning Crash Course notes that "feature vectors seldom use the dataset's raw values," and that engineers "must typically process the dataset's values into representations that your model can better learn from."[14]
Imagine you want to teach a robot to sort fruit. You cannot just show the robot an apple and say "this is an apple," because the robot only understands numbers. So instead, you describe the apple using a list of numbers: how red it is, how round it is, how heavy it is, and how sweet it is. That list of numbers might look like [8, 9, 150, 7]. A banana would get a different list, maybe [2, 3, 120, 6], because it is yellow instead of red and long instead of round.
That list of numbers is a feature vector. Every fruit gets its own list. The robot compares these lists to figure out which fruits are similar and which are different. Apples have similar lists to each other, and bananas have similar lists to each other, so the robot learns to tell them apart.
A feature vector is an element of a vector space, typically denoted as x (boldface lowercase), belonging to R^n where n is the number of features. For a single observation, the feature vector is written as:
x = [x_1, x_2, x_3, ..., x_n]^T
where each x_i represents the value of the i-th feature and T denotes the transpose (indicating a column vector by convention). In Google's terminology, the number of elements n is called the dimension of the feature vector, and a model "acts on" data through these floating-point arrays.[14]
When a dataset contains m observations, each with n features, the individual feature vectors are stacked row-by-row into a design matrix (also called a feature matrix) X of dimensions m x n:
X = | x_1^T | | x_11 x_12 ... x_1n |
| x_2^T | = | x_21 x_22 ... x_2n |
| ... | | ... ... ... ... |
| x_m^T | | x_m1 x_m2 ... x_mn |
Each row of the design matrix is one feature vector (one data point), and each column contains all observed values for a single feature across the dataset.[1] This row-in-a-design-matrix view is the standard way supervised learning frames its input: a model is fit on the m x n matrix X paired with a target vector y of length m.
Feature vectors can be broadly classified by how their values are distributed across their dimensions.
A sparse representation is a vector where the majority of elements are zero. Sparse vectors arise naturally in text processing and categorical encoding.
| Technique | Description | Sparsity level | Typical dimensionality |
|---|---|---|---|
| One-hot encoding | A single 1 in the position corresponding to the category; all other entries are 0 | Very high | Equal to vocabulary or category count |
| Bag of words | Each dimension is a word count for that term in the document | High | Vocabulary size (often 10,000+) |
| TF-IDF | Weights words by term frequency multiplied by inverse document frequency | High | Vocabulary size |
| Binary feature vectors | Each dimension is 0 or 1 indicating presence or absence of a feature | Moderate to high | Varies |
One-hot encoding is the canonical way categorical attributes enter a feature vector. Google's Machine Learning Crash Course explains that "each category is represented by a vector of N elements, where N is the number of categories," with "exactly one of the elements" set to 1.0 and all others to 0.0; the one-hot vector, not the original string or index, "gets passed to the feature vector, and the model learns a separate weight for each element."[16] So a categorical attribute such as car_color with eight possible values expands into eight feature-vector dimensions.[16]
Sparse vectors are memory-efficient when stored using specialized data structures (such as compressed sparse row format) but can be computationally expensive in standard matrix operations because of their high dimensionality.
Dense feature vectors have most or all elements set to non-zero floating-point values. They are typically lower-dimensional and encode information more compactly than sparse vectors. Examples include:
Dense vectors capture semantic relationships more effectively than sparse representations.[2] In a well-trained embedding space, vectors for semantically similar items are close together, enabling algorithms to generalize from limited examples.[3]
The terms "feature vector" and "embedding" are sometimes used interchangeably, but they have distinct origins and connotations. An embedding is a learned, dense feature vector: Google defines it as "a vector representation of data in embedding space" and notes that "embeddings make it easier to do machine learning on large feature vectors," because the position of points in that space encodes meaning, so "words that are used in similar contexts will be closer to each other in embedding space."[17]
| Aspect | Feature vector (traditional) | Embedding |
|---|---|---|
| Creation method | Manual feature engineering or rule-based extraction | Learned automatically by a neural network during training |
| Dimensionality | Often high-dimensional and sparse | Typically low-dimensional and dense |
| Interpretability | Individual dimensions often have clear meaning (e.g., word count, pixel intensity) | Individual dimensions usually lack direct interpretation |
| Semantic relationships | May not capture similarity between concepts | Designed to place similar items near each other in vector space |
| Generalization | Tied to specific domain assumptions | Often transfer well across tasks via transfer learning |
| Example | TF-IDF vector for a document | BERT embedding for a sentence |
In modern practice, embeddings are a specific type of feature vector. The broader term "feature vector" covers both hand-crafted and learned representations.[2] A useful rule of thumb: every embedding is a feature vector, but not every feature vector is an embedding. Embeddings are commonly used to turn high-cardinality one-hot inputs into compact dense feature vectors before they enter a downstream network.[17]
The process of converting raw data into feature vectors depends on the data type and the problem domain.
For structured data (spreadsheets, databases), each column is typically a feature. Numerical columns can be used directly, while categorical columns require encoding:
Text data requires tokenization before numerical representation.
Image feature vectors can be constructed manually or extracted using deep learning.
Audio data is commonly transformed into spectral feature vectors.
Raw feature values often have different scales. A height measured in centimeters might range from 150 to 200, while income in dollars might range from 20,000 to 200,000. Many algorithms (particularly distance-based ones like k-nearest neighbors and support vector machines) are sensitive to scale differences, making normalization an important preprocessing step.[14] Scaling is one example of why "feature vectors seldom use the dataset's raw values": unscaled features let a single large-range dimension dominate the distance calculation.[14]
| Technique | Formula | Output range | Best suited for |
|---|---|---|---|
| Min-max scaling | (x - min) / (max - min) | [0, 1] | When features need a bounded range; neural networks with sigmoid outputs |
| Z-score standardization | (x - mean) / std | Unbounded (mean=0, std=1) | Algorithms assuming normally distributed features; principal component analysis |
| Max-abs scaling | x / max(abs(x)) | [-1, 1] | Sparse data where zero entries should remain zero |
| Robust scaling | (x - median) / IQR | Unbounded | Data with outliers, since median and IQR are less affected by extreme values |
| L2 normalization | x / | x |
Comparing feature vectors is central to many machine learning tasks, including classification, clustering, and information retrieval. The choice of distance or similarity metric depends on the data type and the problem.
The straight-line distance between two points in n-dimensional space:
d(x, y) = sqrt( sum_i (x_i - y_i)^2 )
Euclidean distance is intuitive and widely used, but it can become less discriminative in very high-dimensional spaces because distances between all pairs of points tend to converge (a phenomenon related to the curse of dimensionality).
Cosine similarity measures the cosine of the angle between two vectors:
cos(x, y) = (x . y) / (||x|| * ||y||)
Values range from -1 (opposite directions) to 1 (same direction), with 0 indicating orthogonality. Cosine similarity ignores vector magnitude and focuses on orientation, making it well-suited for text and other high-dimensional data where document length should not affect comparison.[7]
Also called L1 distance or taxicab distance, Manhattan distance sums the absolute differences along each dimension:
d(x, y) = sum_i |x_i - y_i|
Manhattan distance can be more robust than Euclidean distance in high-dimensional spaces and is a natural choice when features represent counts or frequencies.
A generalization of both Euclidean and Manhattan distance:
d(x, y) = ( sum_i |x_i - y_i|^p )^(1/p)
When p=1, this reduces to Manhattan distance. When p=2, it becomes Euclidean distance. Adjusting p allows control over how much weight is given to large differences along individual dimensions.
The dot product measures both directional similarity and magnitude:
x . y = sum_i (x_i * y_i)
When vectors are normalized to unit length, the dot product is equivalent to cosine similarity. Many modern vector databases use dot product as their default similarity metric because of its computational efficiency.[10]
| Metric | Considers magnitude | Works well in high dimensions | Common use cases |
|---|---|---|---|
| Euclidean distance | Yes | Limited (distances converge) | Low to moderate dimensional data; spatial data |
| Cosine similarity | No | Yes | Text similarity; document retrieval; embeddings |
| Manhattan distance | Yes | Better than Euclidean | Sparse data; count-based features |
| Dot product | Yes | Yes (when normalized) | Neural network outputs; vector databases |
High-dimensional feature vectors can suffer from the curse of dimensionality, where data becomes sparse and distances become less meaningful. Dimensionality reduction techniques project feature vectors into a lower-dimensional space while preserving as much relevant information as possible.
Principal component analysis is a linear method that finds the directions of maximum variance in the data and projects vectors onto those directions (called principal components). PCA is computationally efficient, preserves global structure, and provides interpretable components. It is commonly used as a preprocessing step before training classifiers or for visualization.
t-distributed Stochastic Neighbor Embedding (t-SNE) is a nonlinear technique designed for visualization.[12] It converts high-dimensional pairwise distances into probability distributions and minimizes the Kullback-Leibler divergence between the original and low-dimensional distributions.[12] t-SNE excels at preserving local neighborhood structure, making it popular for visualizing clusters in two or three dimensions. However, it is computationally expensive for large datasets and does not preserve global distances.
Uniform Manifold Approximation and Projection (UMAP) is a newer nonlinear method that is faster than t-SNE and better preserves both local and global structure.[11] UMAP constructs a topological representation of the high-dimensional data and optimizes a low-dimensional layout that matches it.[11] It has become a popular choice for exploring feature spaces in natural language processing and computer vision.
An autoencoder is a neural network trained to compress its input into a lower-dimensional bottleneck layer and then reconstruct the original input.[2] The bottleneck activations serve as a reduced feature vector. Variational autoencoders (VAE) extend this approach by learning a probabilistic latent space.
| Method | Type | Preserves global structure | Preserves local structure | Scalability | Typical use |
|---|---|---|---|---|---|
| PCA | Linear | Yes | Partially | High | Preprocessing; feature decorrelation |
| t-SNE | Nonlinear | No | Yes | Low to moderate | 2D/3D visualization of clusters |
| UMAP | Nonlinear | Yes | Yes | High | Visualization and downstream tasks |
| Autoencoder | Nonlinear (learned) | Depends on architecture | Yes | Moderate | Learned compression; generative models |
The phrase "curse of dimensionality," coined by Richard Bellman in 1961, refers to a set of problems that arise when working with high-dimensional feature vectors.[5]
As the number of dimensions increases, the volume of the feature space grows exponentially. This leads to several practical consequences:
Mitigation strategies include feature selection (choosing a subset of relevant features), feature extraction (creating new lower-dimensional features via PCA or autoencoders), and using algorithms that are naturally robust to high dimensionality, such as random forests.
Feature vectors are used across virtually every domain of machine learning and artificial intelligence.
In computer vision, feature vectors represent images or regions within images. Applications include:
In natural language processing, feature vectors enable text to be processed mathematically.
In recommendation systems, both users and items are represented as feature vectors.
Feature vectors represent biological data such as gene expression profiles, protein sequences, and molecular structures. Researchers use these vectors for drug discovery, protein function prediction, and disease classification.
Feature vectors that fall far from the normal distribution of training data can be flagged as anomalies. Applications include fraud detection in financial transactions, intrusion detection in network security, and quality control in manufacturing.
In production machine learning systems, feature vectors must be served consistently and with low latency. A feature store is a specialized data platform that manages the computation, storage, and retrieval of feature vectors for both training and real-time inference.
Key capabilities of a feature store include:
Popular open-source feature stores include Feast, while managed solutions are offered by Databricks, Hopsworks, and Tecton.
The proliferation of dense feature vectors has driven the development of specialized vector databases designed for efficient similarity search. Given a query vector, these systems find the most similar vectors in a large collection.
Exact nearest neighbor search compares the query against every stored vector and guarantees finding the true closest match, but it scales poorly to millions or billions of vectors. Approximate nearest neighbor (ANN) algorithms trade a small amount of accuracy for large speed improvements.[10]
Common ANN techniques include:
Prominent vector database implementations include FAISS (developed by Meta), Pinecone, Milvus, Weaviate, and Qdrant.[10]
The concept of representing objects as numerical vectors predates machine learning itself. In statistics, multivariate data has been organized into vectors and matrices since the early 20th century. Ronald Fisher's 1936 work on discriminant analysis used feature vectors to classify species, producing one of the most well-known datasets in machine learning: the iris dataset contains 150 samples (50 from each of three species) described by four measurements per flower, sepal length, sepal width, petal length, and petal width.[6] Each flower is therefore a four-dimensional feature vector, and Fisher used these vectors to demonstrate linear discriminant analysis.[6]
During the 1950s and 1960s, the development of the perceptron and early pattern recognition systems formalized the use of feature vectors as input to learning algorithms. The field of information retrieval adopted vector space models in the 1970s, representing documents and queries as TF-IDF vectors and ranking documents by cosine similarity.[7]
The 2000s and 2010s saw a shift from hand-engineered features toward learned representations. Word2Vec (Mikolov et al., 2013) demonstrated that neural networks could learn dense word vectors capturing semantic relationships.[3] The success of deep convolutional networks in image recognition (AlexNet, 2012) showed that CNNs could automatically learn feature vectors far more effective than handcrafted alternatives.[8] The transformer architecture (Vaswani et al., 2017)[9] and models like BERT (Devlin et al., 2018)[4] extended learned feature vectors to full sentences and documents, enabling a new generation of NLP applications.
Consider a dataset of 28x28 grayscale images of handwritten digits (the MNIST dataset). Each image can be converted into a feature vector in several ways:
Approach 1: Raw pixel features
Flatten the 28x28 pixel grid into a 784-dimensional vector (28 x 28 = 784). Each dimension holds a pixel intensity value, conventionally between 0 and 255 (or normalized to the range 0 to 1):
x = [0, 0, 0, ..., 128, 255, 230, ..., 0, 0, 0] (784 values)
Approach 2: Statistical features
Compute summary statistics from the pixel values:
x = [mean_intensity, std_deviation, skewness, kurtosis]
x = [123.4, 10.2, 0.5, 2.0]
This produces a compact 4-dimensional vector but discards spatial information.
Approach 3: CNN-extracted features
Pass the image through a pre-trained convolutional neural network and extract activations from a hidden layer, producing a dense vector (e.g., 128 or 256 dimensions) that captures learned visual patterns like edges, curves, and stroke thickness.[8]
A classification algorithm (such as a softmax classifier or a random forest) receives the feature vector as input and predicts the digit label (0 through 9).[14] The quality of the feature vector directly affects classification accuracy: CNN-extracted features typically outperform raw pixels, which in turn outperform simple statistical summaries.