Feature Vector

Data & Datasets Machine Learning

22 min read

Updated Jul 11, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 11, 2026

Fact-checked

In review queue

Sources

17 citations

Revision

v8 · 4,433 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

A feature vector is an n-dimensional, ordered list of numerical values that represents the measurable properties of an object, data point, or observation in a format suitable for processing by machine learning algorithms.^[1]^[14]^[15] Each element corresponds to a specific feature, and the complete vector serves as the numerical identity of that data point within a mathematical space called the feature space; as the Wikipedia definition states, "A feature vector is an n-dimensional vector of numerical features that represent some object," and "the vector space associated with these vectors is often called the feature space."^[15] Many learning algorithms require this numerical form because, in Wikipedia's words, "such representations facilitate processing and statistical analysis."^[15]

Feature vectors are one of the most fundamental data structures in machine learning and pattern recognition. They act as the bridge between raw, unstructured data (images, text, audio) and the mathematical operations that learning algorithms perform. Without a consistent numerical representation, algorithms would have no way to compare, group, or learn from data. In practice the values are rarely raw: Google's Machine Learning Crash Course notes that "feature vectors seldom use the dataset's raw values," and that engineers "must typically process the dataset's values into representations that your model can better learn from."^[14]

Explain like I'm 5 (ELI5)

Imagine you want to teach a robot to sort fruit. You cannot just show the robot an apple and say "this is an apple," because the robot only understands numbers. So instead, you describe the apple using a list of numbers: how red it is, how round it is, how heavy it is, and how sweet it is. That list of numbers might look like [8, 9, 150, 7]. A banana would get a different list, maybe [2, 3, 120, 6], because it is yellow instead of red and long instead of round.

That list of numbers is a feature vector. Every fruit gets its own list. The robot compares these lists to figure out which fruits are similar and which are different. Apples have similar lists to each other, and bananas have similar lists to each other, so the robot learns to tell them apart.

What is the formal definition of a feature vector?

A feature vector is an element of a vector space, typically denoted as x (boldface lowercase), belonging to $\mathbb{R}^n$ where $n$ is the number of features. For a single observation, the feature vector is written as:

\mathbf{x} = [x_1, x_2, x_3, \ldots, x_n]^\top

where each x_i represents the value of the i-th feature and T denotes the transpose (indicating a column vector by convention). In Google's terminology, the number of elements n is called the dimension of the feature vector, and a model "acts on" data through these floating-point arrays.^[14]

When a dataset contains m observations, each with n features, the individual feature vectors are stacked row-by-row into a design matrix (also called a feature matrix) X of dimensions m x n:

X = \begin{bmatrix} x_1^\top \\ x_2^\top \\ \vdots \\ x_m^\top \end{bmatrix} = \begin{bmatrix} x_{11} & x_{12} & \cdots & x_{1n} \\ x_{21} & x_{22} & \cdots & x_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ x_{m1} & x_{m2} & \cdots & x_{mn} \end{bmatrix}

Each row of the design matrix is one feature vector (one data point), and each column contains all observed values for a single feature across the dataset.^[1] This row-in-a-design-matrix view is the standard way supervised learning frames its input: a model is fit on the m x n matrix X paired with a target vector y of length m.

Types of feature vectors

Feature vectors can be broadly classified by how their values are distributed across their dimensions.

Sparse feature vectors

A sparse representation is a vector where the majority of elements are zero. Sparse vectors arise naturally in text processing and categorical encoding.

Technique	Description	Sparsity level	Typical dimensionality
One-hot encoding	A single 1 in the position corresponding to the category; all other entries are 0	Very high	Equal to vocabulary or category count
Bag of words	Each dimension is a word count for that term in the document	High	Vocabulary size (often 10,000+)
TF-IDF	Weights words by term frequency multiplied by inverse document frequency	High	Vocabulary size
Binary feature vectors	Each dimension is 0 or 1 indicating presence or absence of a feature	Moderate to high	Varies

One-hot encoding is the canonical way categorical attributes enter a feature vector. Google's Machine Learning Crash Course explains that "each category is represented by a vector of N elements, where N is the number of categories," with "exactly one of the elements" set to 1.0 and all others to 0.0; the one-hot vector, not the original string or index, "gets passed to the feature vector, and the model learns a separate weight for each element."^[16] So a categorical attribute such as car_color with eight possible values expands into eight feature-vector dimensions.^[16]

Sparse vectors are memory-efficient when stored using specialized data structures (such as compressed sparse row format) but can be computationally expensive in standard matrix operations because of their high dimensionality.

Dense feature vectors

Dense feature vectors have most or all elements set to non-zero floating-point values. They are typically lower-dimensional and encode information more compactly than sparse vectors. Examples include:

Word embeddings produced by Word2Vec^[3] or GloVe^[13] (commonly 100 to 300 dimensions)
Image feature vectors extracted from convolutional neural networks (e.g., 512 or 2048 dimensions from ResNet)
Sentence or document embeddings from BERT^[4] or other transformer models (768 or 1024 dimensions)

Dense vectors capture semantic relationships more effectively than sparse representations.^[2] In a well-trained embedding space, vectors for semantically similar items are close together, enabling algorithms to generalize from limited examples.^[3]

How do feature vectors differ from embeddings?

The terms "feature vector" and "embedding" are sometimes used interchangeably, but they have distinct origins and connotations. An embedding is a learned, dense feature vector: Google defines it as "a vector representation of data in embedding space" and notes that "embeddings make it easier to do machine learning on large feature vectors," because the position of points in that space encodes meaning, so "words that are used in similar contexts will be closer to each other in embedding space."^[17]

Aspect	Feature vector (traditional)	Embedding
Creation method	Manual feature engineering or rule-based extraction	Learned automatically by a neural network during training
Dimensionality	Often high-dimensional and sparse	Typically low-dimensional and dense
Interpretability	Individual dimensions often have clear meaning (e.g., word count, pixel intensity)	Individual dimensions usually lack direct interpretation
Semantic relationships	May not capture similarity between concepts	Designed to place similar items near each other in vector space
Generalization	Tied to specific domain assumptions	Often transfer well across tasks via transfer learning
Example	TF-IDF vector for a document	BERT embedding for a sentence

In modern practice, embeddings are a specific type of feature vector. The broader term "feature vector" covers both hand-crafted and learned representations.^[2] A useful rule of thumb: every embedding is a feature vector, but not every feature vector is an embedding. Embeddings are commonly used to turn high-cardinality one-hot inputs into compact dense feature vectors before they enter a downstream network.^[17]

How are feature vectors created?

The process of converting raw data into feature vectors depends on the data type and the problem domain.

From tabular data

For structured data (spreadsheets, databases), each column is typically a feature. Numerical columns can be used directly, while categorical columns require encoding:

Numerical features: Use values as-is or apply scaling (see the section on normalization below).
Categorical features: Apply one-hot encoding, label encoding, or target encoding.
Missing values: Impute with mean, median, mode, or a learned value.
Derived features: Create new features through arithmetic combinations, binning, or domain-specific transformations.

From text

Text data requires tokenization before numerical representation.

Bag of words: Count the frequency of each word in the document. The resulting vector has one dimension per vocabulary word.
TF-IDF: Weight term frequencies by inverse document frequency to downweight common words and highlight distinctive terms.
Word2Vec / GloVe: Map each word to a dense vector learned from co-occurrence statistics.^[3]^[13] Document-level vectors can be formed by averaging word vectors.
Transformer-based models: Pass text through models like BERT to produce contextualized dense vectors that account for word order and meaning.^[4]

From images

Image feature vectors can be constructed manually or extracted using deep learning.

Pixel-level features: Flatten the image into a vector of raw pixel intensities. A 28x28 grayscale image (like those in the MNIST dataset) becomes a 784-dimensional vector.
Handcrafted descriptors: Compute histograms of oriented gradients (HOG), scale-invariant feature transform (SIFT) descriptors, or local binary patterns (LBP).
CNN features: Pass the image through a pre-trained convolutional neural network and extract activations from an intermediate layer.^[8] For example, running an image through VGG16 up to the final pooling layer yields a 25,088-dimensional feature vector (7 x 7 x 512 flattened), which can then be used as input to a downstream classifier.

From audio

Audio data is commonly transformed into spectral feature vectors.

Mel-frequency cepstral coefficients (MFCCs): A compact representation of the short-term power spectrum of sound, widely used in speech recognition.
Spectrograms: Time-frequency representations that can be treated as images and processed by CNNs.
Learned features: Models like wav2vec extract dense feature vectors directly from raw audio waveforms.

Why does feature scaling and normalization matter?

Raw feature values often have different scales. A height measured in centimeters might range from 150 to 200, while income in dollars might range from 20,000 to 200,000. Many algorithms (particularly distance-based ones like k-nearest neighbors and support vector machines) are sensitive to scale differences, making normalization an important preprocessing step.^[14] Scaling is one example of why "feature vectors seldom use the dataset's raw values": unscaled features let a single large-range dimension dominate the distance calculation.^[14]

Technique	Formula	Output range	Best suited for
Min-max scaling	$\frac{x - \min}{\max - \min}$	[0, 1]	When features need a bounded range; neural networks with sigmoid outputs
Z-score standardization	$\frac{x - \mathrm{mean}}{\mathrm{std}}$	Unbounded (mean=0, std=1)	Algorithms assuming normally distributed features; principal component analysis
Max-abs scaling	$\frac{x}{\max(\lvert x \rvert)}$	[-1, 1]	Sparse data where zero entries should remain zero
Robust scaling	$\frac{x - \mathrm{median}}{\mathrm{IQR}}$	Unbounded	Data with outliers, since median and IQR are less affected by extreme values
L2 normalization	$x / \lVert x \rVert_2$	Unit length (norm = 1)	When direction matters more than magnitude; cosine similarity comparisons

How is similarity between feature vectors measured?

Comparing feature vectors is central to many machine learning tasks, including classification, clustering, and information retrieval. The choice of distance or similarity metric depends on the data type and the problem.

Euclidean distance

The straight-line distance between two points in n-dimensional space:

d(x, y) = \sqrt{\sum_i (x_i - y_i)^2}

Euclidean distance is intuitive and widely used, but it can become less discriminative in very high-dimensional spaces because distances between all pairs of points tend to converge (a phenomenon related to the curse of dimensionality).

Cosine similarity

Cosine similarity measures the cosine of the angle between two vectors:

\cos(x, y) = \frac{x \cdot y}{\lVert x \rVert \, \lVert y \rVert}

Values range from -1 (opposite directions) to 1 (same direction), with 0 indicating orthogonality. Cosine similarity ignores vector magnitude and focuses on orientation, making it well-suited for text and other high-dimensional data where document length should not affect comparison.^[7]

Manhattan distance

Also called L1 distance or taxicab distance, Manhattan distance sums the absolute differences along each dimension:

d(x, y) = \sum_i \lvert x_i - y_i \rvert

Manhattan distance can be more robust than Euclidean distance in high-dimensional spaces and is a natural choice when features represent counts or frequencies.

Minkowski distance

A generalization of both Euclidean and Manhattan distance:

d(x, y) = \left(\sum_i \lvert x_i - y_i \rvert^p\right)^{1/p}

When p=1, this reduces to Manhattan distance. When p=2, it becomes Euclidean distance. Adjusting p allows control over how much weight is given to large differences along individual dimensions.

Dot product

The dot product measures both directional similarity and magnitude:

x \cdot y = \sum_i x_i y_i

When vectors are normalized to unit length, the dot product is equivalent to cosine similarity. Many modern vector databases use dot product as their default similarity metric because of its computational efficiency.^[10]

Metric	Considers magnitude	Works well in high dimensions	Common use cases
Euclidean distance	Yes	Limited (distances converge)	Low to moderate dimensional data; spatial data
Cosine similarity	No	Yes	Text similarity; document retrieval; embeddings
Manhattan distance	Yes	Better than Euclidean	Sparse data; count-based features
Dot product	Yes	Yes (when normalized)	Neural network outputs; vector databases

Dimensionality reduction

High-dimensional feature vectors can suffer from the curse of dimensionality, where data becomes sparse and distances become less meaningful. Dimensionality reduction techniques project feature vectors into a lower-dimensional space while preserving as much relevant information as possible.

Principal component analysis (PCA)

Principal component analysis is a linear method that finds the directions of maximum variance in the data and projects vectors onto those directions (called principal components). PCA is computationally efficient, preserves global structure, and provides interpretable components. It is commonly used as a preprocessing step before training classifiers or for visualization.

t-SNE

t-distributed Stochastic Neighbor Embedding (t-SNE) is a nonlinear technique designed for visualization.^[12] It converts high-dimensional pairwise distances into probability distributions and minimizes the Kullback-Leibler divergence between the original and low-dimensional distributions.^[12] t-SNE excels at preserving local neighborhood structure, making it popular for visualizing clusters in two or three dimensions. However, it is computationally expensive for large datasets and does not preserve global distances.

UMAP

Uniform Manifold Approximation and Projection (UMAP) is a newer nonlinear method that is faster than t-SNE and better preserves both local and global structure.^[11] UMAP constructs a topological representation of the high-dimensional data and optimizes a low-dimensional layout that matches it.^[11] It has become a popular choice for exploring feature spaces in natural language processing and computer vision.

Autoencoders

An autoencoder is a neural network trained to compress its input into a lower-dimensional bottleneck layer and then reconstruct the original input.^[2] The bottleneck activations serve as a reduced feature vector. Variational autoencoders (VAE) extend this approach by learning a probabilistic latent space.

Method	Type	Preserves global structure	Preserves local structure	Scalability	Typical use
PCA	Linear	Yes	Partially	High	Preprocessing; feature decorrelation
t-SNE	Nonlinear	No	Yes	Low to moderate	2D/3D visualization of clusters
UMAP	Nonlinear	Yes	Yes	High	Visualization and downstream tasks
Autoencoder	Nonlinear (learned)	Depends on architecture	Yes	Moderate	Learned compression; generative models

What is the curse of dimensionality?

The phrase "curse of dimensionality," coined by Richard Bellman in 1961, refers to a set of problems that arise when working with high-dimensional feature vectors.^[5]

As the number of dimensions increases, the volume of the feature space grows exponentially. This leads to several practical consequences:

Data sparsity: The amount of training data needed to adequately cover the space grows exponentially with dimensionality. A common heuristic suggests at least 5 to 10 training samples per feature dimension.
Distance concentration: In very high dimensions, the ratio of the distance between the nearest and farthest neighbors of any point approaches 1. This means distance-based algorithms like k-nearest neighbors lose their ability to distinguish close from distant points.
Overfitting: Models with many features relative to the number of training examples can memorize noise rather than learning true patterns. Regularization techniques (L1, L2) help counteract this.
Computational cost: Storage requirements, distance calculations, and training times all increase with dimensionality.

Mitigation strategies include feature selection (choosing a subset of relevant features), feature extraction (creating new lower-dimensional features via PCA or autoencoders), and using algorithms that are naturally robust to high dimensionality, such as random forests.

What are feature vectors used for?

Feature vectors are used across virtually every domain of machine learning and artificial intelligence.

Computer vision

In computer vision, feature vectors represent images or regions within images. Applications include:

Image recognition: Classifying images into categories based on their feature vectors.
Object detection: Representing candidate regions as feature vectors and classifying them as specific object types.
Image segmentation: Assigning each pixel a feature vector and grouping pixels with similar vectors.
Face recognition: Comparing face feature vectors to identify or verify individuals. Modern systems use CNN-extracted embeddings where the Euclidean distance between two face vectors indicates identity similarity.^[8]

Natural language processing

In natural language processing, feature vectors enable text to be processed mathematically.

Sentiment analysis: Representing documents as feature vectors and classifying them as positive, negative, or neutral.
Named entity recognition: Using word-level feature vectors to identify entities like people, organizations, and locations.
Machine translation: Encoding source sentences into feature vectors that a decoder converts into target language text.
Semantic search: Encoding queries and documents as dense feature vectors, then ranking documents by vector similarity to the query.^[7]

Recommendation systems

In recommendation systems, both users and items are represented as feature vectors.

Collaborative filtering: Users and items are embedded in a shared vector space. Similar users (close vectors) are assumed to enjoy similar items.
Content-based filtering: Items are described by feature vectors derived from their attributes (genre, keywords, price). A user profile vector is compared against item vectors to generate recommendations.
Matrix factorization: Decomposes a user-item interaction matrix into two lower-rank matrices, yielding user and item feature vectors in a latent space.

Bioinformatics

Feature vectors represent biological data such as gene expression profiles, protein sequences, and molecular structures. Researchers use these vectors for drug discovery, protein function prediction, and disease classification.

Anomaly detection

Feature vectors that fall far from the normal distribution of training data can be flagged as anomalies. Applications include fraud detection in financial transactions, intrusion detection in network security, and quality control in manufacturing.

Feature stores and production serving

In production machine learning systems, feature vectors must be served consistently and with low latency. A feature store is a specialized data platform that manages the computation, storage, and retrieval of feature vectors for both training and real-time inference.

Key capabilities of a feature store include:

Offline store: A columnar database (often built on data lakes or warehouses) that stores historical feature vectors for batch training.
Online store: A low-latency key-value store that serves precomputed feature vectors for real-time predictions, typically with response times under 10 milliseconds.
Feature consistency: Ensures the same transformation logic is applied during training and inference, preventing training-serving skew.
Feature reuse: Allows teams to share feature definitions across projects, reducing duplicated work.

Popular open-source feature stores include Feast, while managed solutions are offered by Databricks, Hopsworks, and Tecton.

Vector databases and nearest neighbor search

The proliferation of dense feature vectors has driven the development of specialized vector databases designed for efficient similarity search. Given a query vector, these systems find the most similar vectors in a large collection.

Exact nearest neighbor search compares the query against every stored vector and guarantees finding the true closest match, but it scales poorly to millions or billions of vectors. Approximate nearest neighbor (ANN) algorithms trade a small amount of accuracy for large speed improvements.^[10]

Common ANN techniques include:

Inverted file index (IVF): Clusters vectors using k-means, then searches only the closest clusters at query time.
Hierarchical Navigable Small World (HNSW): Builds a multi-layered graph where each node is a vector. Queries navigate the graph from coarse to fine layers, converging on nearest neighbors.
Product quantization (PQ): Compresses vectors by splitting them into subvectors and quantizing each independently, reducing memory usage and speeding up distance calculations.

Prominent vector database implementations include FAISS (developed by Meta), Pinecone, Milvus, Weaviate, and Qdrant.^[10]

When did feature vectors originate?

The concept of representing objects as numerical vectors predates machine learning itself. In statistics, multivariate data has been organized into vectors and matrices since the early 20th century. Ronald Fisher's 1936 work on discriminant analysis used feature vectors to classify species, producing one of the most well-known datasets in machine learning: the iris dataset contains 150 samples (50 from each of three species) described by four measurements per flower, sepal length, sepal width, petal length, and petal width.^[6] Each flower is therefore a four-dimensional feature vector, and Fisher used these vectors to demonstrate linear discriminant analysis.^[6]

During the 1950s and 1960s, the development of the perceptron and early pattern recognition systems formalized the use of feature vectors as input to learning algorithms. The field of information retrieval adopted vector space models in the 1970s, representing documents and queries as TF-IDF vectors and ranking documents by cosine similarity.^[7]

The 2000s and 2010s saw a shift from hand-engineered features toward learned representations. Word2Vec (Mikolov et al., 2013) demonstrated that neural networks could learn dense word vectors capturing semantic relationships.^[3] The success of deep convolutional networks in image recognition (AlexNet, 2012) showed that CNNs could automatically learn feature vectors far more effective than handcrafted alternatives.^[8] The transformer architecture (Vaswani et al., 2017)^[9] and models like BERT (Devlin et al., 2018)^[4] extended learned feature vectors to full sentences and documents, enabling a new generation of NLP applications.

Worked example: MNIST digit classification

Consider a dataset of 28x28 grayscale images of handwritten digits (the MNIST dataset). Each image can be converted into a feature vector in several ways:

Approach 1: Raw pixel features

Flatten the 28x28 pixel grid into a 784-dimensional vector (28 x 28 = 784). Each dimension holds a pixel intensity value, conventionally between 0 and 255 (or normalized to the range 0 to 1):

x = [0, 0, 0, ..., 128, 255, 230, ..., 0, 0, 0]  (784 values)

Approach 2: Statistical features

Compute summary statistics from the pixel values:

x = [mean_intensity, std_deviation, skewness, kurtosis]
x = [123.4, 10.2, 0.5, 2.0]

This produces a compact 4-dimensional vector but discards spatial information.

Approach 3: CNN-extracted features

Pass the image through a pre-trained convolutional neural network and extract activations from a hidden layer, producing a dense vector (e.g., 128 or 256 dimensions) that captures learned visual patterns like edges, curves, and stroke thickness.^[8]

A classification algorithm (such as a softmax classifier or a random forest) receives the feature vector as input and predicts the digit label (0 through 9).^[14] The quality of the feature vector directly affects classification accuracy: CNN-extracted features typically outperform raw pixels, which in turn outperform simple statistical summaries.

Best practices

Match features to the algorithm: Tree-based models (decision trees, random forests) handle mixed feature types and different scales well. Distance-based models (k-nearest neighbors, SVMs) require normalized features.
Remove redundant features: Highly correlated features add noise without adding information. Use correlation analysis or feature importance scores to prune them.
Handle missing values deliberately: Impute with statistical measures (mean, median) or train a model to predict missing values. Dropping rows with missing values can introduce bias.
Use domain knowledge: Features informed by expert understanding of the problem often outperform purely automated approaches. For example, in financial fraud detection, features like "transaction amount relative to user average" are more informative than raw transaction amounts.
Monitor feature drift: In production systems, the statistical distribution of feature vectors can shift over time (concept drift). Monitoring feature distributions helps detect when a model needs retraining.
Start simple: Begin with a small set of well-understood features and add complexity only if baseline performance is insufficient.

References

Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Springer. Chapter 1: Introduction to feature vectors and design matrices. ↩
Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press. Chapters on representation learning and feature hierarchies. ↩
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). "Efficient Estimation of Word Representations in Vector Space." *arXiv:1301.3781*. Introduced Word2Vec for learning dense word feature vectors. ↩
Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." *arXiv:1810.04805*. ↩
Bellman, R. (1961). *Adaptive Control Processes: A Guided Tour*. Princeton University Press. Origin of the phrase "curse of dimensionality." ↩
Fisher, R. A. (1936). "The Use of Multiple Measurements in Taxonomic Problems." *Annals of Eugenics*, 7(2), 179-188. ↩
Salton, G., Wong, A., & Yang, C. S. (1975). "A Vector Space Model for Automatic Indexing." *Communications of the ACM*, 18(11), 613-620. ↩
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). "ImageNet Classification with Deep Convolutional Neural Networks." *Advances in Neural Information Processing Systems*, 25. ↩
Vaswani, A., et al. (2017). "Attention Is All You Need." *Advances in Neural Information Processing Systems*, 30. ↩
Johnson, J., Douze, M., & Jegou, H. (2019). "Billion-scale similarity search with GPUs." *IEEE Transactions on Big Data*, 7(3), 535-547. Describes the FAISS library for vector similarity search. ↩
McInnes, L., Healy, J., & Melville, J. (2018). "UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction." *arXiv:1802.03426*. ↩
van der Maaten, L. & Hinton, G. (2008). "Visualizing Data using t-SNE." *Journal of Machine Learning Research*, 9, 2579-2605. ↩
Pennington, J., Socher, R., & Manning, C. D. (2014). "GloVe: Global Vectors for Word Representation." *Proceedings of EMNLP*, 1532-1543. ↩
Google Developers. "Numerical data: How a model ingests data using feature vectors." *Machine Learning Crash Course*. https://developers.google.com/machine-learning/crash-course/numerical-data/feature-vectors ↩
"Feature (machine learning)." *Wikipedia*. https://en.wikipedia.org/wiki/Feature_(machine_learning) ↩
Google Developers. "Categorical data: Vocabulary and one-hot encoding." *Machine Learning Crash Course*. https://developers.google.com/machine-learning/crash-course/categorical-data/one-hot-encoding ↩
Google Developers. "Embeddings" and "Embedding space and static embeddings." *Machine Learning Crash Course*. https://developers.google.com/machine-learning/crash-course/embeddings ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

7 revisions by 1 contributors · full history

Suggest edit

What links here

Attribute Dense Feature Feature Feature Set Instance Machine learning terms/All Machine learning terms/Fundamentals Synthetic Feature Terms

Explain like I'm 5 (ELI5)

What is the formal definition of a feature vector?

Types of feature vectors

Sparse feature vectors

Dense feature vectors

How do feature vectors differ from embeddings?

How are feature vectors created?

From tabular data

From text

From images

From audio

Why does feature scaling and normalization matter?

How is similarity between feature vectors measured?

Euclidean distance

Cosine similarity

Manhattan distance

Minkowski distance

Dot product

Dimensionality reduction

Principal component analysis (PCA)

t-SNE

UMAP

Autoencoders

What is the curse of dimensionality?

What are feature vectors used for?

Computer vision

Natural language processing

Recommendation systems

Bioinformatics

Anomaly detection

Feature stores and production serving

Vector databases and nearest neighbor search

When did feature vectors originate?

Worked example: MNIST digit classification

Best practices

See also

References

Improve this article

Related Articles

Dimension Reduction

Discrete Feature

Proxy labels

Bucketing

Categorical Data

Class-Imbalanced Dataset

What links here

Related Articles

Dimension Reduction

Discrete Feature

Proxy labels

Bucketing

Categorical Data

Class-Imbalanced Dataset

What links here