Vector embeddings: Difference between revisions

no edit summary
No edit summary
No edit summary
Line 3: Line 3:


==Understanding Vector Embeddings==
==Understanding Vector Embeddings==
In the context of text data, words with similar meanings, such as "cat" and "kitty", must be represented to capture their semantic similarity. Vector representations achieve this by transforming data objects into arrays of real numbers with a fixed length, typically ranging from hundreds to thousands of elements. These arrays are generated by machine learning models through a process called vectorization.
In the context of text data, words with similar meanings, such as "dog" and "puppy", must be represented to capture their [[semantic similarity]]. [[Vector representation]]s achieve this by transforming data objects into arrays of real numbers with a fixed length, typically ranging from hundreds to thousands of elements. These arrays are generated by machine learning models through a process called [[vectorization]].


For instance, the words "cat" and "kitty" may be vectorized as follows:
For instance, the words "dog" and "puppy" may be vectorized as follows:


<code>
<poem style="border: 1px solid; padding: 1rem">
cat = [1.5, -0.4, 7.2, 19.6, 3.1, ..., 20.2]
cat = [1.5, -0.4, 7.2, 19.6, 3.1, ..., 20.2]
kitty = [1.5, -0.4, 7.2, 19.5, 3.2, ..., 20.8]
kitty = [1.5, -0.4, 7.2, 19.5, 3.2, ..., 20.8]
</code>
</poem>


These vectors exhibit a high similarity, while vectors for words like "banjo" or "comedy" would not be similar to either of these. In this way, vector embeddings capture the semantic similarity of words. The specific meaning of each number in a vector depends on the machine learning model that generated the vectors, and is not always clear in terms of human understanding of language and meaning.
These vectors exhibit a high similarity, while vectors for words like "banjo" or "comedy" would not be similar to either of these. In this way, vector embeddings capture the semantic similarity of words. The specific meaning of each number in a vector depends on the machine learning model that generated the vectors, and is not always clear in terms of human understanding of language and meaning.
Line 22: Line 22:
This result suggests that the difference between "king" and "man" represents some sort of "royalty", which is analogously applicable to "queen" minus "woman". Various concepts, such as "woman", "girl", "boy", etc., can be vectorized into arrays of numbers, often referred to as dimensions. These arrays can be visualized and correlated to familiar words, giving insight into their meaning.
This result suggests that the difference between "king" and "man" represents some sort of "royalty", which is analogously applicable to "queen" minus "woman". Various concepts, such as "woman", "girl", "boy", etc., can be vectorized into arrays of numbers, often referred to as dimensions. These arrays can be visualized and correlated to familiar words, giving insight into their meaning.


Vector embeddings can represent more than just word meanings. They can effectively be generated from any data object, including text, images, audio, time series data, 3D models, video, and molecules. Embeddings are constructed such that two objects with similar semantics have vectors that are "close" to each other in vector space, with a "small" distance between them.
Vector embeddings can represent more than just word meanings. They can effectively be generated from any data object, including [[text]], [[images]], [[audio]], [[time series data]], [[3D models]], [[video]], and [[molecules]]. Embeddings are constructed such that two objects with similar semantics have vectors that are "close" to each other in vector space, with a "small" distance between them.
 


==Creating Vector Embeddings==
==Creating Vector Embeddings==
Line 29: Line 28:
One method for creating vector embeddings involves engineering the vector values using [[domain knowledge]], a process known as [[feature engineering]]. For instance, in medical imaging, domain expertise is employed to quantify features such as shape, color, and regions within an image to capture semantics. However, feature engineering requires domain knowledge and is often too costly to scale.
One method for creating vector embeddings involves engineering the vector values using [[domain knowledge]], a process known as [[feature engineering]]. For instance, in medical imaging, domain expertise is employed to quantify features such as shape, color, and regions within an image to capture semantics. However, feature engineering requires domain knowledge and is often too costly to scale.


===Deep Neural Networks===
===Machine Learning Models===
Rather than engineering vector embeddings, [[models]] are frequently trained to translate objects into vectors. [[Deep neural network]]s are commonly used for training such models. The resulting embeddings are typically [[high-dimensional]] (up to two thousand dimensions) and [[dense]] (all values are non-zero). Text data can be transformed into vector embeddings using models such as [[Word2Vec]], [[GLoVE]], and [[BERT]]. Images can be embedded using [[convolutional neural network]]s ([[CNN]]s) like [[VGG]] and [[Inception]], while audio recordings can be converted into vectors using [[image embedding transformation]]s over their visual representations, such as [[spectrogram]]s.
Rather than engineering vector embeddings, [[models]] are frequently trained to translate objects into vectors. [[Deep neural network]]s are commonly used for training such models. The resulting embeddings are typically [[high-dimensional]] (up to two thousand dimensions) and [[dense]] (all values are non-zero). Text data can be transformed into vector embeddings using models such as [[Word2Vec]], [[GLoVE]], and [[BERT]]. Images can be embedded using [[convolutional neural network]]s ([[CNN]]s) like [[VGG]] and [[Inception]], while audio recordings can be converted into vectors using [[image embedding transformation]]s over their visual representations, such as [[spectrogram]]s.


==Generating Vector Embeddings==
==Generating Vector Embeddings Using ML Models==
The primary aspect of vector search's effectiveness lies in generating embeddings for each entity and query. The secondary aspect is efficiently searching within very large datasets.
The primary aspect of vector search's effectiveness lies in generating embeddings for each [[entity]] and [[query]]. The secondary aspect is efficiently searching within very large [[dataset]]s.


Vector embeddings can be generated for various media types, such as text, images, audio, and others. For text, vectorization techniques have significantly evolved over the last decade, from word2vec (2013) to the state-of-the-art transformer models era, which began with the release of BERT in 2018.
Vector embeddings can be generated for various media types, such as text, images, audio, and others. For text, vectorization techniques have significantly evolved over the last decade, from [[word2vec]] (2013) to the state-of-the-art [[transformer]] models era, which began with the release of [[BERT]] in 2018.


===Word-level Dense Vector Models (word2vec, GloVe, etc.)===
===Word-level Dense Vector Models (word2vec, GloVe, etc.)===
word2vec is a group of model architectures that introduced the concept of "dense" vectors in language processing, in which all values are non-zero. It uses a neural network model to learn word associations from a large text corpus. The model first creates a vocabulary fromthe corpus and then learns vector representations for the words, usually with 300 dimensions. Words found in similar contexts have vector representations that are close in vector space.
[[word2vec]] is a group of model architectures that introduced the concept of [[dense]] vectors in language processing, in which all values are non-zero. It uses a [[neural network]] model to learn word associations from a large text corpus. The model first creates a vocabulary from the corpus and then learns vector representations for the words, usually with 300 dimensions. Words found in similar contexts have vector representations that are close in vector space.


However, word2vec suffers from limitations, including its inability to address words with multiple meanings (polysemantic) and words with ambiguous meanings.
However, word2vec suffers from limitations, including its inability to address words with multiple meanings ([[polysemantic]]) and words with ambiguous meanings.


===Transformer Models (BERT, ELMo, and others)===
===Transformer Models (BERT, ELMo, and others)===
The current state-of-the-art models are based on the transformer architecture. Models like BERT and its successors improve search accuracy, precision, and recall by examining the context of each word to create full contextual embeddings. Unlike word2vec embeddings, which are context-agnostic, transformer-generated embeddings consider the entire input text. Each occurrence of a word has its own embedding that is influenced by the surrounding text, better reflecting the polysemantic nature of words, which can only be disambiguated when considered in context.
The current state-of-the-art models are based on the [[transformer architecture]]. Models like [[BERT]] and its successors improve search accuracy, precision, and recall by examining the context of each word to create full contextual embeddings. Unlike [[word2vec embeddings]], which are context-agnostic, [[transformer-generated embeddings]] consider the entire input text. Each occurrence of a word has its own embedding that is influenced by the surrounding text, better reflecting the [[polysemantic]] nature of words, which can only be disambiguated when considered in context.


Some potential downsides of transformer models include:
Some potential downsides of transformer models include:


Increased compute requirements: Fine-tuning transformer models is much slower (taking hours instead of minutes).
*Increased compute requirements: Fine-tuning transformer models is much slower (taking hours instead of minutes).
Increased memory requirements: Context-sensitivity greatly increases memory requirements, often leading to limitations on possible input lengths.
*Increased memory requirements: Context-sensitivity greatly increases memory requirements, often leading to limitations on possible input lengths.
Despite these drawbacks, transformer models have been incredibly successful, leading to a proliferation of text vectorizer models for various data types such as audio, video, and images. Some models, like CLIP, can vectorize multiple data types (e.g., images and text) into a single vector space, enabling content-based image searches using only text.
 
Despite these drawbacks, [[transformer models]] have been incredibly successful, leading to a proliferation of text vectorizer models for various data types such as audio, video, and images. Some models, like [[CLIP]], can vectorize multiple data types (e.g., images and text) into a single vector space, enabling content-based image searches using only text.


==Creating Vector Embeddings for Other Media Types==
==Creating Vector Embeddings for Other Media Types==
370

edits