Vector embeddings

Revision as of 14:23, 8 April 2023 by Daikon Radish (talk | contribs)

Introduction

Vector embeddings are numeric representations of data that effectively capture certain features or aspects of the data. In the context of text data, they enable semantic search by representing the meanings of words or phrases. Machine learning models generate these embeddings, which are arrays of real numbers with a fixed length, usually ranging from hundreds to thousands of elements.

Text Data and Vectorization

Vectorization is the process of generating a vector for a data object. For example, two similar words like "cat" and "kitty" may have very different character sequences but share a close meaning. Vectorizing these words might result in highly similar vectors, while vectors for unrelated words like "banjo" or "comedy" would be considerably different.

Vectors can represent meaning to a certain extent, although it may not be clear what each number in the vector signifies. However, by correlating vectors to familiar words, we can gain a rough understanding of the relationships between them. One notable example is the equation "king − man + woman ≈ queen", which demonstrates that vector embeddings can capture semantic similarities and relationships.

Vector embeddings can represent more than just the meanings of words; they can be generated from various types of data, including text, images, audio, time series data, 3D models, videos, and molecules. The distance between two vectors in vector space can be calculated in multiple ways, with one simple method being the sum of the absolute differences between elements at corresponding positions in each vector.

Generation of Vector Embeddings

The effectiveness of vector embeddings primarily depends on how they are generated for each entity and query. For text data, vectorization techniques have evolved significantly over the last decade, from the introduction of word2vec in 2013 to the state-of-the-art transformer models like BERT, which emerged in 2018.

Word-level Dense Vector Models (word2vec, GloVe, etc.)

Word2vec, a family of model architectures, introduced the concept of "dense" vectors in language processing. It uses a neural network model to learn word associations from a large corpus of text, creating a vocabulary and learning vector representations for each word, typically with 300 dimensions. Words found in similar contexts have vector representations that are close in vector space. However, word2vec has some limitations, such as not addressing polysemantic words (words with multiple meanings) or words with ambiguous meanings.

Transformer Models (BERT, ELMo, and others)

Transformer models like BERT and its successors improve search accuracy, precision, and recall by generating full contextual embeddings for each word, taking the entire input text into account. This allows them to better represent polysemantic words and disambiguate meanings based on context. Some downsides of transformer models include increased compute and memory requirements.

Vector Embeddings in Weaviate

Weaviate is designed to support various vectorizer models and service providers, allowing users to bring their own vectors or use publicly available models. It supports Hugging Face models through the text2vec-huggingface module, enabling the use of many sentence transformers available on the platform. Other popular vectorization APIs, such as OpenAI and Cohere, are supported through the text2vec-openai and text2vec-cohere modules, respectively. Users can also run transformer models locally with text2vec-transformers or use multi2vec-clip to convert images and text to vectors using a CLIP model.