370
edits
(Created page with "== Vector Embeddings == Vector embeddings are numeric representations of data that effectively capture certain features or aspects of the data. In the context of text data, they enable semantic search by representing the meanings of words or phrases. Machine learning models generate these embeddings, which are arrays of real numbers with a fixed length, usually ranging from hundreds to thousands of elements. === Text Data and Vectorization === Vectorization is the pro...") |
No edit summary |
||
Line 1: | Line 1: | ||
== | ==Introduction== | ||
Vector embeddings are numeric representations of data that effectively capture certain features or aspects of the data. In the context of text data, they enable semantic search by representing the meanings of words or phrases. Machine learning models generate these embeddings, which are arrays of real numbers with a fixed length, usually ranging from hundreds to thousands of elements. | Vector embeddings are numeric representations of data that effectively capture certain features or aspects of the data. In the context of text data, they enable semantic search by representing the meanings of words or phrases. Machine learning models generate these embeddings, which are arrays of real numbers with a fixed length, usually ranging from hundreds to thousands of elements. | ||
==Text Data and Vectorization== | |||
Vectorization is the process of generating a vector for a data object. For example, two similar words like "cat" and "kitty" may have very different character sequences but share a close meaning. Vectorizing these words might result in highly similar vectors, while vectors for unrelated words like "banjo" or "comedy" would be considerably different. | Vectorization is the process of generating a vector for a data object. For example, two similar words like "cat" and "kitty" may have very different character sequences but share a close meaning. Vectorizing these words might result in highly similar vectors, while vectors for unrelated words like "banjo" or "comedy" would be considerably different. | ||
Line 11: | Line 9: | ||
Vector embeddings can represent more than just the meanings of words; they can be generated from various types of data, including text, images, audio, time series data, 3D models, videos, and molecules. The distance between two vectors in vector space can be calculated in multiple ways, with one simple method being the sum of the absolute differences between elements at corresponding positions in each vector. | Vector embeddings can represent more than just the meanings of words; they can be generated from various types of data, including text, images, audio, time series data, 3D models, videos, and molecules. The distance between two vectors in vector space can be calculated in multiple ways, with one simple method being the sum of the absolute differences between elements at corresponding positions in each vector. | ||
==Generation of Vector Embeddings== | |||
The effectiveness of vector embeddings primarily depends on how they are generated for each entity and query. For text data, vectorization techniques have evolved significantly over the last decade, from the introduction of word2vec in 2013 to the state-of-the-art transformer models like BERT, which emerged in 2018. | The effectiveness of vector embeddings primarily depends on how they are generated for each entity and query. For text data, vectorization techniques have evolved significantly over the last decade, from the introduction of word2vec in 2013 to the state-of-the-art transformer models like BERT, which emerged in 2018. | ||
===Word-level Dense Vector Models (word2vec, GloVe, etc.)=== | |||
Word2vec, a family of model architectures, introduced the concept of "dense" vectors in language processing. It uses a neural network model to learn word associations from a large corpus of text, creating a vocabulary and learning vector representations for each word, typically with 300 dimensions. Words found in similar contexts have vector representations that are close in vector space. However, word2vec has some limitations, such as not addressing polysemantic words (words with multiple meanings) or words with ambiguous meanings. | Word2vec, a family of model architectures, introduced the concept of "dense" vectors in language processing. It uses a neural network model to learn word associations from a large corpus of text, creating a vocabulary and learning vector representations for each word, typically with 300 dimensions. Words found in similar contexts have vector representations that are close in vector space. However, word2vec has some limitations, such as not addressing polysemantic words (words with multiple meanings) or words with ambiguous meanings. | ||
===Transformer Models (BERT, ELMo, and others)=== | |||
Transformer models like BERT and its successors improve search accuracy, precision, and recall by generating full contextual embeddings for each word, taking the entire input text into account. This allows them to better represent polysemantic words and disambiguate meanings based on context. Some downsides of transformer models include increased compute and memory requirements. | Transformer models like BERT and its successors improve search accuracy, precision, and recall by generating full contextual embeddings for each word, taking the entire input text into account. This allows them to better represent polysemantic words and disambiguate meanings based on context. Some downsides of transformer models include increased compute and memory requirements. | ||
==Vector Embeddings in Weaviate== | |||
Weaviate is designed to support various vectorizer models and service providers, allowing users to bring their own vectors or use publicly available models. It supports Hugging Face models through the text2vec-huggingface module, enabling the use of many sentence transformers available on the platform. Other popular vectorization APIs, such as OpenAI and Cohere, are supported through the text2vec-openai and text2vec-cohere modules, respectively. Users can also run transformer models locally with text2vec-transformers or use multi2vec-clip to convert images and text to vectors using a CLIP model. | Weaviate is designed to support various vectorizer models and service providers, allowing users to bring their own vectors or use publicly available models. It supports Hugging Face models through the text2vec-huggingface module, enabling the use of many sentence transformers available on the platform. Other popular vectorization APIs, such as OpenAI and Cohere, are supported through the text2vec-openai and text2vec-cohere modules, respectively. Users can also run transformer models locally with text2vec-transformers or use multi2vec-clip to convert images and text to vectors using a CLIP model. |
edits