Vector embeddings

Revision as of 15:15, 8 April 2023 by Daikon Radish (talk | contribs)

Introduction

Vector embeddings are a crucial and fascinating aspect of machine learning, playing a central role in numerous natural language processing (NLP), recommendation, and search algorithms. These embeddings enable systems such as recommendation engines, voice assistants, and language translators to function effectively. Machine learning algorithms, like other software algorithms, require numerical data to operate. Vector embeddings are lists of numbers that represent more abstract data types, such as text documents or other non-numeric objects, facilitating various operations. The use of vector embeddings allows for the translation of human-perceived semantic similarity into proximity within a vector space.

Vector Embeddings and Semantic Similarity

When real-world objects and concepts like images, audio recordings, news articles, user profiles, weather patterns, and political views are represented as vector embeddings, their semantic similarity can be quantified by how close they are to each other as points in vector spaces. This representation is suitable for common machine learning tasks, such as clustering, recommendation, and classification.

In clustering tasks, for example, algorithms assign similar points to the same cluster while keeping points from different clusters as dissimilar as possible. In recommendation tasks, recommender systems look for objects most similar to the target object, as measured by their similarity in vector embeddings. In classification tasks, the label of an unseen object is determined by the majority vote over the labels of the most similar objects.

Creating Vector Embeddings

Feature Engineering

One method for creating vector embeddings involves engineering the vector values using domain knowledge, a process known as feature engineering. For instance, in medical imaging, domain expertise is employed to quantify features such as shape, color, and regions within an image to capture semantics. However, feature engineering requires domain knowledge and is often too costly to scale.

Deep Neural Networks

Rather than engineering vector embeddings, models are frequently trained to translate objects into vectors. Deep neural networks are commonly used for training such models. The resulting embeddings are typically high-dimensional (up to two thousand dimensions) and dense (all values are non-zero). Text data can be transformed into vector embeddings using models such as Word2Vec, GLoVE, and BERT. Images can be embedded using convolutional neural networks (CNNs) like VGG and Inception, while audio recordings can be converted into vectors using image embedding transformations over their visual representations, such as spectrograms.

Example: Image Embedding with a Convolutional Neural Network

In this example, raw images are represented as greyscale pixels, which correspond to a matrix of integer values ranging from 0 to 255, where 0 signifies black and 255 represents white. The matrix values define a vector embedding, with the first coordinate being the matrix's upper-left cell and the last coordinate corresponding to the lower-right matrix cell.

While such embeddings effectively maintain the semantic information of a pixel's neighborhood in an image, they are highly sensitive to transformations like shifts, scaling, cropping, and other image manipulation operations. Consequently, they are often used as raw inputs to learn more robust embeddings.

A Convolutional Neural Network (CNN or ConvNet) is a class of deep learning architectures typically applied to visual data, transforming images into embeddings. CNNs process input through hierarchical small local sub-inputs known as receptive fields. Each neuron in each network layer processes a specific receptive field from the previous layer. Each layer either applies a convolution on the receptive field or reduces the input size through a process called subsampling.

A typical CNN structure includes receptive fields as sub-squares in each layer, serving as input to a single neuron within the preceding layer. Subsampling operations reduce layer size, while convolution operations expand layer size. The resulting vector embedding is obtained through a fully connected layer.

Learning the network weights (i.e., the embedding model) requires a large set of labeled images. The weights are optimized to ensure that images with the same labels have closer embeddings compared to those with different labels. Once the CNN embedding model is learned, images can be transformed into vectors and stored with a K-Nearest-Neighbor index. For a new unseen image, it can be transformed using the CNN model, its k-most similar vectors can be retrieved, and the corresponding similar images can be identified.

Although this example focuses on images and CNNs, vector embeddings can be created for various types of data, and multiple models or methods can be employed to generate them.

Using Vector Embeddings

Vector embeddings' ability to represent objects as dense vectors containing their semantic information makes them highly valuable for a wide array of machine learning applications.

One of the most popular uses of vector embeddings is similarity search. Search algorithms like KNN and ANN necessitate calculating distances between vectors to determine similarity. Vector embeddings can be used to compute these distances. Nearest neighbor search can then be utilized for tasks such as deduplication, recommendations, anomaly detection, and reverse image search.

Even if embeddings are not directly used for an application, many popular machine learning models and methods rely on them internally. For instance, in encoder-decoder architectures, the embeddings generated by the encoder contain the required information for the decoder to produce a result. This architecture is widely employed in applications like machine translation and caption generation.