Vector embeddings: Difference between revisions

no edit summary
No edit summary
No edit summary
Line 1: Line 1:
==Introduction==
==Introduction==
Vector embeddings are a crucial and fascinating aspect of machine learning, playing a central role in numerous natural language processing (NLP), recommendation, and search algorithms. These embeddings enable systems such as recommendation engines, voice assistants, and language translators to function effectively. Machine learning algorithms, like other software algorithms, require numerical data to operate. Vector embeddings are lists of numbers that represent more abstract data types, such as text documents or other non-numeric objects, facilitating various operations. The use of vector embeddings allows for the translation of human-perceived semantic similarity into proximity within a vector space.
Vector embeddings are numerical representations of data that encapsulate particular features of the data, enabling the effective execution of semantic search by capturing the semantic similarity between different data objects. They play a central role in numerous natural language processing (NLP), recommendation, and search algorithms. These embeddings enable systems such as recommendation engines, voice assistants, and language translators to function effectively. Machine learning algorithms, like other software algorithms, require numerical data to operate. Vector embeddings are lists of numbers that represent more abstract data types, such as text documents or other non-numeric objects, facilitating various operations. The use of vector embeddings allows for the translation of human-perceived semantic similarity into proximity within a vector space.


==Vector Embeddings and Semantic Similarity==
==Understanding Vector Embeddings==
When real-world objects and concepts like images, audio recordings, news articles, user profiles, weather patterns, and political views are represented as vector embeddings, their semantic similarity can be quantified by how close they are to each other as points in vector spaces. This representation is suitable for common machine learning tasks, such as clustering, recommendation, and classification.
In the context of text data, words with similar meanings, such as "cat" and "kitty", must be represented in a manner that captures their semantic similarity. Vector representations achieve this by transforming data objects into arrays of real numbers with a fixed length, typically ranging from hundreds to thousands of elements. These arrays are generated by machine learning models through a process called vectorization.


In clustering tasks, for example, algorithms assign similar points to the same cluster while keeping points from different clusters as dissimilar as possible. In recommendation tasks, recommender systems look for objects most similar to the target object, as measured by their similarity in vector embeddings. In classification tasks, the label of an unseen object is determined by the majority vote over the labels of the most similar objects.
For instance, the words "cat" and "kitty" may be vectorized as follows:


==Creating Vector Embeddings==
<code>
===Feature Engineering===
cat = [1.5, -0.4, 7.2, 19.6, 3.1, ..., 20.2]
One method for creating vector embeddings involves engineering the vector values using domain knowledge, a process known as feature engineering. For instance, in medical imaging, domain expertise is employed to quantify features such as shape, color, and regions within an image to capture semantics. However, feature engineering requires domain knowledge and is often too costly to scale.
kitty = [1.5, -0.4, 7.2, 19.5, 3.2, ..., 20.8]
</code>


===Deep Neural Networks===
These vectors exhibit a high similarity, while vectors for words like "banjo" or "comedy" would not be similar to either of these. In this way, vector embeddings capture the semantic similarity of words. The specific meaning of each number in a vector depends on the machine learning model that generated the vectors, and is not always clear in terms of human understanding of language and meaning.
Rather than engineering vector embeddings, models are frequently trained to translate objects into vectors. Deep neural networks are commonly used for training such models. The resulting embeddings are typically high-dimensional (up to two thousand dimensions) and dense (all values are non-zero). Text data can be transformed into vector embeddings using models such as Word2Vec, GLoVE, and BERT. Images can be embedded using convolutional neural networks (CNNs) like VGG and Inception, while audio recordings can be converted into vectors using image embedding transformations over their visual representations, such as spectrograms.


==Example: Image Embedding with a Convolutional Neural Network==
Vector-based representation of meaning has gained attention due to its ability to perform mathematical operations between words, revealing semantic relationships. A famous example is:
 
<code>
"king − man + woman ≈ queen"
</code>
This result suggests that the difference between "king" and "man" represents some sort of "royalty", which is analogously applicable to "queen" minus "woman". Various concepts, such as "woman", "girl", "boy", etc., can be vectorized into arrays of numbers, often referred to as dimensions. These arrays can be visualized and correlated to familiar words, giving insight into their meaning.
 
Vector embeddings can represent more than just word meanings. They can effectively be generated from any data object, including text, images, audio, time series data, 3D models, video, and molecules. Embeddings are constructed such that two objects with similar semantics have vectors that are "close" to each other in vector space, with a "small" distance between them.
 
==Generating Vector Embeddings==
The primary aspect of vector search's effectiveness lies in generating embeddings for each entity and query. The secondary aspect is efficiently searching within very large datasets.
 
Vector embeddings can be generated for various media types, such as text, images, audio, and others. For text, vectorization techniques have significantly evolved over the last decade, from word2vec (2013) to the state-of-the-art transformer models era, which began with the release of BERT in 2018.
 
===Word-level Dense Vector Models (word2vec, GloVe, etc.)===
word2vec is a group of model architectures that introduced the concept of "dense" vectors in language processing, in which all values are non-zero. It uses a neural network model to learn word associations from a large text corpus. The model first creates a vocabulary fromthe corpus and then learns vector representations for the words, usually with 300 dimensions. Words found in similar contexts have vector representations that are close in vector space.
 
However, word2vec suffers from limitations, including its inability to address words with multiple meanings (polysemantic) and words with ambiguous meanings.
 
===Transformer Models (BERT, ELMo, and others)===
The current state-of-the-art models are based on the transformer architecture. Models like BERT and its successors improve search accuracy, precision, and recall by examining the context of each word to create full contextual embeddings. Unlike word2vec embeddings, which are context-agnostic, transformer-generated embeddings consider the entire input text. Each occurrence of a word has its own embedding that is influenced by the surrounding text, better reflecting the polysemantic nature of words, which can only be disambiguated when considered in context.
 
Some potential downsides of transformer models include:
 
Increased compute requirements: Fine-tuning transformer models is much slower (taking hours instead of minutes).
Increased memory requirements: Context-sensitivity greatly increases memory requirements, often leading to limitations on possible input lengths.
Despite these drawbacks, transformer models have been incredibly successful, leading to a proliferation of text vectorizer models for various data types such as audio, video, and images. Some models, like CLIP, can vectorize multiple data types (e.g., images and text) into a single vector space, enabling content-based image searches using only text.
 
==Creating Vector Embeddings for Other Media Types==
In addition to text, vector embeddings can be created for various types of data, such as images and audio recordings. Images can be embedded using convolutional neural networks (CNNs) like VGG and Inception, while audio recordings can be converted into vectors using image embedding transformations over their visual representations, such as spectrograms.
 
===Example: Image Embedding with a Convolutional Neural Network===
In this example, raw images are represented as greyscale pixels, which correspond to a matrix of integer values ranging from 0 to 255, where 0 signifies black and 255 represents white. The matrix values define a vector embedding, with the first coordinate being the matrix's upper-left cell and the last coordinate corresponding to the lower-right matrix cell.
In this example, raw images are represented as greyscale pixels, which correspond to a matrix of integer values ranging from 0 to 255, where 0 signifies black and 255 represents white. The matrix values define a vector embedding, with the first coordinate being the matrix's upper-left cell and the last coordinate corresponding to the lower-right matrix cell.


Line 24: Line 55:


Learning the network weights (i.e., the embedding model) requires a large set of labeled images. The weights are optimized to ensure that images with the same labels have closer embeddings compared to those with different labels. Once the CNN embedding model is learned, images can be transformed into vectors and stored with a K-Nearest-Neighbor index. For a new unseen image, it can be transformed using the CNN model, its k-most similar vectors can be retrieved, and the corresponding similar images can be identified.
Learning the network weights (i.e., the embedding model) requires a large set of labeled images. The weights are optimized to ensure that images with the same labels have closer embeddings compared to those with different labels. Once the CNN embedding model is learned, images can be transformed into vectors and stored with a K-Nearest-Neighbor index. For a new unseen image, it can be transformed using the CNN model, its k-most similar vectors can be retrieved, and the corresponding similar images can be identified.
Although this example focuses on images and CNNs, vector embeddings can be created for various types of data, and multiple models or methods can be employed to generate them.


==Using Vector Embeddings==
==Using Vector Embeddings==
Line 33: Line 62:


Even if embeddings are not directly used for an application, many popular machine learning models and methods rely on them internally. For instance, in encoder-decoder architectures, the embeddings generated by the encoder contain the required information for the decoder to produce a result. This architecture is widely employed in applications like machine translation and caption generation.
Even if embeddings are not directly used for an application, many popular machine learning models and methods rely on them internally. For instance, in encoder-decoder architectures, the embeddings generated by the encoder contain the required information for the decoder to produce a result. This architecture is widely employed in applications like machine translation and caption generation.
==Vector Embeddings with Weaviate==
Weaviate is designed to support a wide range of vectorizer models and vectorizer service providers. Users can bring their own vectors, for example, if they already have a vectorization pipeline available or if none of the publicly available models are suitable.
Weaviate supports using any Hugging Face models through the text2vec-huggingface module, allowing users to choose from many sentence transformers published on Hugging Face. Other popular vectorization APIs, such as OpenAI or Cohere, can be used through the text2vec-openai or text2vec-cohere modules. Users can also run transformer models locally with text2vec-transformers, and modules like multi2vec-clip can convert images and text to vectors using a CLIP model.
All of these models perform the same core task, which is to represent the "meaning" of the original data as a set of numbers, enabling the effective implementation of semantic search. Vector embeddings can be generated from any data object, including text, images, audio, time series data, 3D models, video, and molecules. Embeddings are constructed such that two objects with similar semantics have vectors that are "close" to each other in vector space, with a "small" distance between them.
In conclusion, vector embeddings are numerical representations of various data types, facilitating machine learning applications by capturing semantic similarity. They play a vital role in natural language processing, recommendation systems, and search algorithms. By representing data as dense vectors, they enable the quantification of semantic similarity and allow for efficient similarity search and other machine learning tasks. With the development of more advanced models like transformer-based architectures and support from platforms like Weaviate, vector embeddings continue to be a cornerstone of modern machine learning applications.
370

edits