# Feature Vector

> Source: https://aiwiki.ai/wiki/feature_vector
> Updated: 2026-06-23
> Categories: Data & Datasets, Machine Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

A **feature vector** is an n-dimensional, ordered list of numerical values that represents the measurable properties of an object, data point, or observation in a format suitable for processing by [machine learning](/wiki/machine_learning) algorithms.[1][14][15] Each element corresponds to a specific [feature](/wiki/feature), and the complete vector serves as the numerical identity of that data point within a mathematical space called the feature space; as the Wikipedia definition states, "A feature vector is an n-dimensional vector of numerical features that represent some object," and "the vector space associated with these vectors is often called the feature space."[15] Many learning algorithms require this numerical form because, in Wikipedia's words, "such representations facilitate processing and statistical analysis."[15]

Feature vectors are one of the most fundamental data structures in machine learning and [pattern recognition](/wiki/pattern_recognition). They act as the bridge between raw, unstructured data (images, text, audio) and the mathematical operations that learning algorithms perform. Without a consistent numerical representation, algorithms would have no way to compare, group, or learn from data. In practice the values are rarely raw: Google's Machine Learning Crash Course notes that "feature vectors seldom use the dataset's raw values," and that engineers "must typically process the dataset's values into representations that your model can better learn from."[14]

## Explain like I'm 5 (ELI5)

Imagine you want to teach a robot to sort fruit. You cannot just show the robot an apple and say "this is an apple," because the robot only understands numbers. So instead, you describe the apple using a list of numbers: how red it is, how round it is, how heavy it is, and how sweet it is. That list of numbers might look like [8, 9, 150, 7]. A banana would get a different list, maybe [2, 3, 120, 6], because it is yellow instead of red and long instead of round.

That list of numbers is a feature vector. Every fruit gets its own list. The robot compares these lists to figure out which fruits are similar and which are different. Apples have similar lists to each other, and bananas have similar lists to each other, so the robot learns to tell them apart.

## What is the formal definition of a feature vector?

A feature vector is an element of a vector space, typically denoted as **x** (boldface lowercase), belonging to R^n where n is the number of features. For a single observation, the feature vector is written as:

**x** = [x_1, x_2, x_3, ..., x_n]^T

where each x_i represents the value of the i-th feature and T denotes the transpose (indicating a column vector by convention). In Google's terminology, the number of elements n is called the dimension of the feature vector, and a model "acts on" data through these floating-point arrays.[14]

When a [dataset](/wiki/dataset) contains m observations, each with n features, the individual feature vectors are stacked row-by-row into a **design matrix** (also called a feature matrix) **X** of dimensions m x n:

```
X = | x_1^T |     | x_11  x_12  ...  x_1n |
    | x_2^T |  =  | x_21  x_22  ...  x_2n |
    |  ...  |     |  ...   ...  ...   ... |
    | x_m^T |     | x_m1  x_m2  ...  x_mn |
```

Each row of the design matrix is one feature vector (one data point), and each column contains all observed values for a single feature across the dataset.[1] This row-in-a-design-matrix view is the standard way supervised learning frames its input: a model is fit on the m x n matrix X paired with a target vector y of length m.

## Types of feature vectors

Feature vectors can be broadly classified by how their values are distributed across their dimensions.

### Sparse feature vectors

A [sparse representation](/wiki/sparse_representation) is a vector where the majority of elements are zero. Sparse vectors arise naturally in text processing and categorical encoding.

| Technique | Description | Sparsity level | Typical dimensionality |
|---|---|---|---|
| [One-hot encoding](/wiki/one-hot_encoding) | A single 1 in the position corresponding to the category; all other entries are 0 | Very high | Equal to vocabulary or category count |
| [Bag of words](/wiki/bag_of_words) | Each dimension is a word count for that term in the document | High | Vocabulary size (often 10,000+) |
| TF-IDF | Weights words by term frequency multiplied by inverse document frequency | High | Vocabulary size |
| Binary feature vectors | Each dimension is 0 or 1 indicating presence or absence of a feature | Moderate to high | Varies |

One-hot encoding is the canonical way categorical attributes enter a feature vector. Google's Machine Learning Crash Course explains that "each category is represented by a vector of N elements, where N is the number of categories," with "exactly one of the elements" set to 1.0 and all others to 0.0; the one-hot vector, not the original string or index, "gets passed to the feature vector, and the model learns a separate weight for each element."[16] So a categorical attribute such as car_color with eight possible values expands into eight feature-vector dimensions.[16]

Sparse vectors are memory-efficient when stored using specialized data structures (such as compressed sparse row format) but can be computationally expensive in standard matrix operations because of their high dimensionality.

### Dense feature vectors

Dense feature vectors have most or all elements set to non-zero floating-point values. They are typically lower-dimensional and encode information more compactly than sparse vectors. Examples include:

- [Word embeddings](/wiki/word_embedding) produced by [Word2Vec](/wiki/word2vec)[3] or GloVe[13] (commonly 100 to 300 dimensions)
- Image feature vectors extracted from [convolutional neural networks](/wiki/convolutional_neural_network) (e.g., 512 or 2048 dimensions from ResNet)
- Sentence or document embeddings from [BERT](/wiki/bert)[4] or other [transformer](/wiki/transformer) models (768 or 1024 dimensions)

Dense vectors capture semantic relationships more effectively than sparse representations.[2] In a well-trained embedding space, vectors for semantically similar items are close together, enabling algorithms to generalize from limited examples.[3]

## How do feature vectors differ from embeddings?

The terms "feature vector" and "embedding" are sometimes used interchangeably, but they have distinct origins and connotations. An embedding is a learned, dense feature vector: Google defines it as "a vector representation of data in embedding space" and notes that "embeddings make it easier to do machine learning on large feature vectors," because the position of points in that space encodes meaning, so "words that are used in similar contexts will be closer to each other in embedding space."[17]

| Aspect | Feature vector (traditional) | Embedding |
|---|---|---|
| Creation method | Manual [feature engineering](/wiki/feature_engineering) or rule-based extraction | Learned automatically by a [neural network](/wiki/neural_network) during [training](/wiki/training) |
| Dimensionality | Often high-dimensional and sparse | Typically low-dimensional and dense |
| Interpretability | Individual dimensions often have clear meaning (e.g., word count, pixel intensity) | Individual dimensions usually lack direct interpretation |
| Semantic relationships | May not capture similarity between concepts | Designed to place similar items near each other in vector space |
| Generalization | Tied to specific domain assumptions | Often transfer well across tasks via [transfer learning](/wiki/transfer_learning) |
| Example | TF-IDF vector for a document | BERT embedding for a sentence |

In modern practice, embeddings are a specific type of feature vector. The broader term "feature vector" covers both hand-crafted and learned representations.[2] A useful rule of thumb: every embedding is a feature vector, but not every feature vector is an embedding. Embeddings are commonly used to turn high-cardinality one-hot inputs into compact dense feature vectors before they enter a downstream network.[17]

## How are feature vectors created?

The process of converting raw data into feature vectors depends on the data type and the problem domain.

### From tabular data

For structured data (spreadsheets, databases), each column is typically a feature. Numerical columns can be used directly, while categorical columns require encoding:

1. **Numerical features**: Use values as-is or apply scaling (see the section on normalization below).
2. **Categorical features**: Apply [one-hot encoding](/wiki/one-hot_encoding), label encoding, or target encoding.
3. **Missing values**: Impute with mean, median, mode, or a learned value.
4. **Derived features**: Create new features through arithmetic combinations, binning, or domain-specific transformations.

### From text

Text data requires [tokenization](/wiki/tokenization) before numerical representation.

- **[Bag of words](/wiki/bag_of_words)**: Count the frequency of each word in the document. The resulting vector has one dimension per vocabulary word.
- **TF-IDF**: Weight term frequencies by inverse document frequency to downweight common words and highlight distinctive terms.
- **[Word2Vec](/wiki/word2vec) / GloVe**: Map each word to a dense vector learned from co-occurrence statistics.[3][13] Document-level vectors can be formed by averaging word vectors.
- **Transformer-based models**: Pass text through models like [BERT](/wiki/bert) to produce contextualized dense vectors that account for word order and meaning.[4]

### From images

Image feature vectors can be constructed manually or extracted using [deep learning](/wiki/deep_learning).

- **Pixel-level features**: Flatten the image into a vector of raw pixel intensities. A 28x28 grayscale image (like those in the MNIST dataset) becomes a 784-dimensional vector.
- **Handcrafted descriptors**: Compute histograms of oriented gradients (HOG), scale-invariant feature transform (SIFT) descriptors, or local binary patterns (LBP).
- **CNN features**: Pass the image through a pre-trained [convolutional neural network](/wiki/convolutional_neural_network) and extract activations from an intermediate layer.[8] For example, running an image through VGG16 up to the final pooling layer yields a 25,088-dimensional feature vector (7 x 7 x 512 flattened), which can then be used as input to a downstream classifier.

### From audio

Audio data is commonly transformed into spectral feature vectors.

- **Mel-frequency cepstral coefficients (MFCCs)**: A compact representation of the short-term power spectrum of sound, widely used in [speech recognition](/wiki/speech_recognition).
- **Spectrograms**: Time-frequency representations that can be treated as images and processed by CNNs.
- **Learned features**: Models like wav2vec extract dense feature vectors directly from raw audio waveforms.

## Why does feature scaling and normalization matter?

Raw feature values often have different scales. A height measured in centimeters might range from 150 to 200, while income in dollars might range from 20,000 to 200,000. Many algorithms (particularly distance-based ones like k-nearest neighbors and [support vector machines](/wiki/support_vector_machine_svm)) are sensitive to scale differences, making [normalization](/wiki/normalization) an important preprocessing step.[14] Scaling is one example of why "feature vectors seldom use the dataset's raw values": unscaled features let a single large-range dimension dominate the distance calculation.[14]

| Technique | Formula | Output range | Best suited for |
|---|---|---|---|
| Min-max scaling | (x - min) / (max - min) | [0, 1] | When features need a bounded range; neural networks with sigmoid outputs |
| Z-score standardization | (x - mean) / std | Unbounded (mean=0, std=1) | Algorithms assuming normally distributed features; [principal component analysis](/wiki/principal_component_analysis) |
| Max-abs scaling | x / max(abs(x)) | [-1, 1] | Sparse data where zero entries should remain zero |
| Robust scaling | (x - median) / IQR | Unbounded | Data with outliers, since median and IQR are less affected by extreme values |
| L2 normalization | x / ||x||_2 | Unit length (norm = 1) | When direction matters more than magnitude; [cosine similarity](/wiki/cosine_similarity) comparisons |

## How is similarity between feature vectors measured?

Comparing feature vectors is central to many machine learning tasks, including [classification](/wiki/classification), [clustering](/wiki/clustering), and [information retrieval](/wiki/information_retrieval). The choice of distance or similarity metric depends on the data type and the problem.

### Euclidean distance

The straight-line distance between two points in n-dimensional space:

d(x, y) = sqrt( sum_i (x_i - y_i)^2 )

Euclidean distance is intuitive and widely used, but it can become less discriminative in very high-dimensional spaces because distances between all pairs of points tend to converge (a phenomenon related to the curse of dimensionality).

### Cosine similarity

[Cosine similarity](/wiki/cosine_similarity) measures the cosine of the angle between two vectors:

cos(x, y) = (x . y) / (||x|| * ||y||)

Values range from -1 (opposite directions) to 1 (same direction), with 0 indicating orthogonality. Cosine similarity ignores vector magnitude and focuses on orientation, making it well-suited for text and other high-dimensional data where document length should not affect comparison.[7]

### Manhattan distance

Also called L1 distance or taxicab distance, Manhattan distance sums the absolute differences along each dimension:

d(x, y) = sum_i |x_i - y_i|

Manhattan distance can be more robust than Euclidean distance in high-dimensional spaces and is a natural choice when features represent counts or frequencies.

### Minkowski distance

A generalization of both Euclidean and Manhattan distance:

d(x, y) = ( sum_i |x_i - y_i|^p )^(1/p)

When p=1, this reduces to Manhattan distance. When p=2, it becomes Euclidean distance. Adjusting p allows control over how much weight is given to large differences along individual dimensions.

### Dot product

The dot product measures both directional similarity and magnitude:

x . y = sum_i (x_i * y_i)

When vectors are normalized to unit length, the dot product is equivalent to cosine similarity. Many modern [vector databases](/wiki/vector_database) use dot product as their default similarity metric because of its computational efficiency.[10]

| Metric | Considers magnitude | Works well in high dimensions | Common use cases |
|---|---|---|---|
| Euclidean distance | Yes | Limited (distances converge) | Low to moderate dimensional data; spatial data |
| Cosine similarity | No | Yes | Text similarity; document retrieval; embeddings |
| Manhattan distance | Yes | Better than Euclidean | Sparse data; count-based features |
| Dot product | Yes | Yes (when normalized) | Neural network outputs; vector databases |

## Dimensionality reduction

High-dimensional feature vectors can suffer from the curse of dimensionality, where data becomes sparse and distances become less meaningful. Dimensionality reduction techniques project feature vectors into a lower-dimensional space while preserving as much relevant information as possible.

### Principal component analysis (PCA)

[Principal component analysis](/wiki/principal_component_analysis) is a linear method that finds the directions of maximum variance in the data and projects vectors onto those directions (called principal components). PCA is computationally efficient, preserves global structure, and provides interpretable components. It is commonly used as a preprocessing step before training classifiers or for visualization.

### t-SNE

t-distributed Stochastic Neighbor Embedding (t-SNE) is a nonlinear technique designed for visualization.[12] It converts high-dimensional pairwise distances into probability distributions and minimizes the Kullback-Leibler divergence between the original and low-dimensional distributions.[12] t-SNE excels at preserving local neighborhood structure, making it popular for visualizing clusters in two or three dimensions. However, it is computationally expensive for large datasets and does not preserve global distances.

### UMAP

Uniform Manifold Approximation and Projection (UMAP) is a newer nonlinear method that is faster than t-SNE and better preserves both local and global structure.[11] UMAP constructs a topological representation of the high-dimensional data and optimizes a low-dimensional layout that matches it.[11] It has become a popular choice for exploring feature spaces in [natural language processing](/wiki/natural_language_processing) and [computer vision](/wiki/computer_vision).

### Autoencoders

An [autoencoder](/wiki/autoencoder) is a neural network trained to compress its input into a lower-dimensional bottleneck layer and then reconstruct the original input.[2] The bottleneck activations serve as a reduced feature vector. Variational autoencoders ([VAE](/wiki/variational_autoencoder)) extend this approach by learning a probabilistic latent space.

| Method | Type | Preserves global structure | Preserves local structure | Scalability | Typical use |
|---|---|---|---|---|---|
| PCA | Linear | Yes | Partially | High | Preprocessing; feature decorrelation |
| t-SNE | Nonlinear | No | Yes | Low to moderate | 2D/3D visualization of clusters |
| UMAP | Nonlinear | Yes | Yes | High | Visualization and downstream tasks |
| Autoencoder | Nonlinear (learned) | Depends on architecture | Yes | Moderate | Learned compression; generative models |

## What is the curse of dimensionality?

The phrase "curse of dimensionality," coined by Richard Bellman in 1961, refers to a set of problems that arise when working with high-dimensional feature vectors.[5]

As the number of dimensions increases, the volume of the feature space grows exponentially. This leads to several practical consequences:

1. **Data sparsity**: The amount of training data needed to adequately cover the space grows exponentially with dimensionality. A common heuristic suggests at least 5 to 10 training samples per feature dimension.
2. **Distance concentration**: In very high dimensions, the ratio of the distance between the nearest and farthest neighbors of any point approaches 1. This means distance-based algorithms like k-nearest neighbors lose their ability to distinguish close from distant points.
3. **[Overfitting](/wiki/overfitting)**: Models with many features relative to the number of training examples can memorize noise rather than learning true patterns. [Regularization](/wiki/regularization) techniques (L1, L2) help counteract this.
4. **Computational cost**: Storage requirements, distance calculations, and training times all increase with dimensionality.

Mitigation strategies include [feature selection](/wiki/feature_selection) (choosing a subset of relevant features), [feature extraction](/wiki/feature_extraction) (creating new lower-dimensional features via PCA or autoencoders), and using algorithms that are naturally robust to high dimensionality, such as [random forests](/wiki/random_forest).

## What are feature vectors used for?

Feature vectors are used across virtually every domain of machine learning and artificial intelligence.

### Computer vision

In [computer vision](/wiki/computer_vision), feature vectors represent images or regions within images. Applications include:

- **[Image recognition](/wiki/image_recognition)**: Classifying images into categories based on their feature vectors.
- **[Object detection](/wiki/object_detection)**: Representing candidate regions as feature vectors and classifying them as specific object types.
- **[Image segmentation](/wiki/image_segmentation)**: Assigning each pixel a feature vector and grouping pixels with similar vectors.
- **Face recognition**: Comparing face feature vectors to identify or verify individuals. Modern systems use CNN-extracted embeddings where the Euclidean distance between two face vectors indicates identity similarity.[8]

### Natural language processing

In [natural language processing](/wiki/natural_language_processing), feature vectors enable text to be processed mathematically.

- **[Sentiment analysis](/wiki/sentiment_analysis)**: Representing documents as feature vectors and classifying them as positive, negative, or neutral.
- **[Named entity recognition](/wiki/named_entity_recognition)**: Using word-level feature vectors to identify entities like people, organizations, and locations.
- **[Machine translation](/wiki/machine_translation)**: Encoding source sentences into feature vectors that a [decoder](/wiki/decoder) converts into target language text.
- **Semantic search**: Encoding queries and documents as dense feature vectors, then ranking documents by vector similarity to the query.[7]

### Recommendation systems

In [recommendation systems](/wiki/recommender_system), both users and items are represented as feature vectors.

- **[Collaborative filtering](/wiki/collaborative_filtering)**: Users and items are embedded in a shared vector space. Similar users (close vectors) are assumed to enjoy similar items.
- **Content-based filtering**: Items are described by feature vectors derived from their attributes (genre, keywords, price). A user profile vector is compared against item vectors to generate recommendations.
- **[Matrix factorization](/wiki/matrix_factorization)**: Decomposes a user-item interaction matrix into two lower-rank matrices, yielding user and item feature vectors in a latent space.

### Bioinformatics

Feature vectors represent biological data such as gene expression profiles, protein sequences, and molecular structures. Researchers use these vectors for drug discovery, protein function prediction, and disease classification.

### Anomaly detection

Feature vectors that fall far from the normal distribution of training data can be flagged as anomalies. Applications include fraud detection in financial transactions, intrusion detection in network security, and quality control in manufacturing.

## Feature stores and production serving

In production machine learning systems, feature vectors must be served consistently and with low latency. A **feature store** is a specialized data platform that manages the computation, storage, and retrieval of feature vectors for both training and real-time [inference](/wiki/inference).

Key capabilities of a feature store include:

- **Offline store**: A columnar database (often built on data lakes or warehouses) that stores historical feature vectors for batch training.
- **Online store**: A low-latency key-value store that serves precomputed feature vectors for real-time predictions, typically with response times under 10 milliseconds.
- **Feature consistency**: Ensures the same transformation logic is applied during training and inference, preventing training-serving skew.
- **Feature reuse**: Allows teams to share feature definitions across projects, reducing duplicated work.

Popular open-source feature stores include Feast, while managed solutions are offered by Databricks, Hopsworks, and Tecton.

## Vector databases and nearest neighbor search

The proliferation of dense feature vectors has driven the development of specialized [vector databases](/wiki/vector_database) designed for efficient similarity search. Given a query vector, these systems find the most similar vectors in a large collection.

**Exact nearest neighbor search** compares the query against every stored vector and guarantees finding the true closest match, but it scales poorly to millions or billions of vectors. **Approximate nearest neighbor (ANN)** algorithms trade a small amount of accuracy for large speed improvements.[10]

Common ANN techniques include:

- **Inverted file index (IVF)**: Clusters vectors using k-means, then searches only the closest clusters at query time.
- **Hierarchical Navigable Small World (HNSW)**: Builds a multi-layered graph where each node is a vector. Queries navigate the graph from coarse to fine layers, converging on nearest neighbors.
- **Product quantization (PQ)**: Compresses vectors by splitting them into subvectors and quantizing each independently, reducing memory usage and speeding up distance calculations.

Prominent vector database implementations include FAISS (developed by Meta), Pinecone, Milvus, Weaviate, and Qdrant.[10]

## When did feature vectors originate?

The concept of representing objects as numerical vectors predates machine learning itself. In statistics, multivariate data has been organized into vectors and matrices since the early 20th century. Ronald Fisher's 1936 work on discriminant analysis used feature vectors to classify species, producing one of the most well-known datasets in machine learning: the iris dataset contains 150 samples (50 from each of three species) described by four measurements per flower, sepal length, sepal width, petal length, and petal width.[6] Each flower is therefore a four-dimensional feature vector, and Fisher used these vectors to demonstrate linear discriminant analysis.[6]

During the 1950s and 1960s, the development of the perceptron and early pattern recognition systems formalized the use of feature vectors as input to learning algorithms. The field of [information retrieval](/wiki/information_retrieval) adopted vector space models in the 1970s, representing documents and queries as TF-IDF vectors and ranking documents by cosine similarity.[7]

The 2000s and 2010s saw a shift from hand-engineered features toward learned representations. [Word2Vec](/wiki/word2vec) (Mikolov et al., 2013) demonstrated that neural networks could learn dense word vectors capturing semantic relationships.[3] The success of deep convolutional networks in image recognition (AlexNet, 2012) showed that CNNs could automatically learn feature vectors far more effective than handcrafted alternatives.[8] The [transformer](/wiki/transformer) architecture (Vaswani et al., 2017)[9] and models like [BERT](/wiki/bert) (Devlin et al., 2018)[4] extended learned feature vectors to full sentences and documents, enabling a new generation of NLP applications.

## Worked example: MNIST digit classification

Consider a dataset of 28x28 grayscale images of handwritten digits (the MNIST dataset). Each image can be converted into a feature vector in several ways:

**Approach 1: Raw pixel features**

Flatten the 28x28 pixel grid into a 784-dimensional vector (28 x 28 = 784). Each dimension holds a pixel intensity value, conventionally between 0 and 255 (or normalized to the range 0 to 1):

```
x = [0, 0, 0, ..., 128, 255, 230, ..., 0, 0, 0]  (784 values)
```

**Approach 2: Statistical features**

Compute summary statistics from the pixel values:

```
x = [mean_intensity, std_deviation, skewness, kurtosis]
x = [123.4, 10.2, 0.5, 2.0]
```

This produces a compact 4-dimensional vector but discards spatial information.

**Approach 3: CNN-extracted features**

Pass the image through a pre-trained convolutional neural network and extract activations from a hidden layer, producing a dense vector (e.g., 128 or 256 dimensions) that captures learned visual patterns like edges, curves, and stroke thickness.[8]

A [classification](/wiki/classification) algorithm (such as a softmax classifier or a [random forest](/wiki/random_forest)) receives the feature vector as input and predicts the digit label (0 through 9).[14] The quality of the feature vector directly affects classification accuracy: CNN-extracted features typically outperform raw pixels, which in turn outperform simple statistical summaries.

## Best practices

- **Match features to the algorithm**: Tree-based models ([decision trees](/wiki/decision_tree), [random forests](/wiki/random_forest)) handle mixed feature types and different scales well. Distance-based models (k-nearest neighbors, [SVMs](/wiki/support_vector_machine_svm)) require normalized features.
- **Remove redundant features**: Highly correlated features add noise without adding information. Use correlation analysis or feature importance scores to prune them.
- **Handle missing values deliberately**: Impute with statistical measures (mean, median) or train a model to predict missing values. Dropping rows with missing values can introduce bias.
- **Use domain knowledge**: Features informed by expert understanding of the problem often outperform purely automated approaches. For example, in financial fraud detection, features like "transaction amount relative to user average" are more informative than raw transaction amounts.
- **Monitor feature drift**: In production systems, the statistical distribution of feature vectors can shift over time (concept drift). Monitoring feature distributions helps detect when a model needs retraining.
- **Start simple**: Begin with a small set of well-understood features and add complexity only if baseline performance is insufficient.

## See also

- [Feature](/wiki/feature)
- [Feature Engineering](/wiki/feature_engineering)
- [Feature Extraction](/wiki/feature_extraction)
- [Feature Selection](/wiki/feature_selection)
- [Word Embedding](/wiki/word_embedding)
- [Cosine Similarity](/wiki/cosine_similarity)
- [Principal Component Analysis](/wiki/principal_component_analysis)
- [Vector Database](/wiki/vector_database)
- [Dimensionality Reduction](/wiki/dimensionality_reduction)
- [Sparse Representation](/wiki/sparse_representation)

## References

1. Bishop, C. M. (2006). *Pattern Recognition and Machine Learning*. Springer. Chapter 1: Introduction to feature vectors and design matrices.
2. Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press. Chapters on representation learning and feature hierarchies.
3. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). "Efficient Estimation of Word Representations in Vector Space." *arXiv:1301.3781*. Introduced Word2Vec for learning dense word feature vectors.
4. Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." *arXiv:1810.04805*.
5. Bellman, R. (1961). *Adaptive Control Processes: A Guided Tour*. Princeton University Press. Origin of the phrase "curse of dimensionality."
6. Fisher, R. A. (1936). "The Use of Multiple Measurements in Taxonomic Problems." *Annals of Eugenics*, 7(2), 179-188.
7. Salton, G., Wong, A., & Yang, C. S. (1975). "A Vector Space Model for Automatic Indexing." *Communications of the ACM*, 18(11), 613-620.
8. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). "ImageNet Classification with Deep Convolutional Neural Networks." *Advances in Neural Information Processing Systems*, 25.
9. Vaswani, A., et al. (2017). "Attention Is All You Need." *Advances in Neural Information Processing Systems*, 30.
10. Johnson, J., Douze, M., & Jegou, H. (2019). "Billion-scale similarity search with GPUs." *IEEE Transactions on Big Data*, 7(3), 535-547. Describes the FAISS library for vector similarity search.
11. McInnes, L., Healy, J., & Melville, J. (2018). "UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction." *arXiv:1802.03426*.
12. van der Maaten, L. & Hinton, G. (2008). "Visualizing Data using t-SNE." *Journal of Machine Learning Research*, 9, 2579-2605.
13. Pennington, J., Socher, R., & Manning, C. D. (2014). "GloVe: Global Vectors for Word Representation." *Proceedings of EMNLP*, 1532-1543.
14. Google Developers. "Numerical data: How a model ingests data using feature vectors." *Machine Learning Crash Course*. https://developers.google.com/machine-learning/crash-course/numerical-data/feature-vectors
15. "Feature (machine learning)." *Wikipedia*. https://en.wikipedia.org/wiki/Feature_(machine_learning)
16. Google Developers. "Categorical data: Vocabulary and one-hot encoding." *Machine Learning Crash Course*. https://developers.google.com/machine-learning/crash-course/categorical-data/one-hot-encoding
17. Google Developers. "Embeddings" and "Embedding space and static embeddings." *Machine Learning Crash Course*. https://developers.google.com/machine-learning/crash-course/embeddings

