Vector database: Difference between revisions

no edit summary
No edit summary
No edit summary
Line 1: Line 1:
{{stub}}
== Introduction ==
{{see also|AI terms}}
==Introduction==
A [[vector database]] is a type of [[database]] specifically designed for storing and querying [[high-dimensional vector data]], which is often used in [[artificial intelligence applications]] ([[AI]] [[apps]]). These databases are gaining popularity due to their ability to extend [[large language models]] ([[LLMs]]) with [[long-term memory]] and provide efficient [[querying]] for AI-driven applications.


==Vector Embeddings==
Complex data, which includes unstructured forms like documents, images, videos, and plain text, is growing rapidly. Traditional databases designed for structured data struggle to effectively store and analyze complex data, often requiring extensive keyword and metadata classification. Machine Learning (ML) techniques can address this issue by transforming complex data into vector embeddings, which describe data objects in numerous dimensions.
[[Vector]]s are arrays of numbers that can represent complex objects such as [[words]], [[sentences]], [[images]], or [[audio]] files in a [[continuous high-dimensional space]], called an [[embedding]]. [[Embeddings]] work by mapping [[semantically similar]] words or features from various data types together. These embeddings can be used in [[recommendation systems]], [[search engines]], and [[text generation applications]] like [[ChatGPT]].


==Database Structure==
Vector databases are designed specifically to handle vector embeddings. These databases index vectors, allowing for easy search and retrieval by comparing values and identifying those most similar to one another. Although challenging to implement, there are various solutions available for vector databases, ranging from plugins and open-source projects to fully-managed services.
In a [[relational database]], data is organized in rows and columns, while in a [[document database]], it is organized in documents and collections. In contrast, a vector database stores arrays of numbers that are clustered based on [[similarity]]. These databases can be queried with [[ultra-low latency]], making them ideal for AI-driven applications.


==Vector Database Products==
== What is a Vector Database? ==
Several vector databases have emerged to cater to the growing demand for AI applications. Some of the popular native vector databases include open-source options like [[Weaviate]] and [[Milvus]], both written in Go. [[Pinecone]] is another popular vector database, although it is not open source. [[Chroma]], based on Clickhouse, is an open-source project with a growing following. Relational databases like [[Postgres]] have tools like [[pgVector]], and [[Redis]] has first-class vector support to accommodate this type of functionality.


==Using Vector Databases with Large Language Models==
A vector database is a type of database that indexes and stores vector embeddings for efficient retrieval and similarity search. In addition to traditional CRUD (create, read, update, and delete) operations and metadata filtering, vector databases enable the organization and comparison of any vector to one another or to the vector of a search query. This capability allows vector databases to excel at similarity search or "vector search," providing more comprehensive search results that would not be possible with traditional search technology.
One of the primary reasons for the increasing popularity of vector databases is their ability to extend large language models (LLMs) with long-term memory. By providing a general-purpose model, such as [[OpenAI]]'s [[GPT-4]], [[Meta]]'s [[LLaMA]], or [[Google]]'s [[LaMDA]], users can store their own data in a vector database. When [[prompt]]ed, the database can query relevant documents to update the context, customizing the final response and providing the AI with long-term memory.


In addition, vector databases can integrate with tools like [[LangChain]], which combine multiple LLMs together for more advanced applications.
== Why Use a Vector Database? ==


==Example Code==
Vector databases offer several use cases, including:
To demonstrate the usage of a vector database, the following example shows how to use [[Chroma]] with [[JavaScript]]. First, create the client and define an [[embedding function]]. In this case, the [[OpenAI API]] is used to update the embeddings whenever a new data point is added. Each data point is a document with an ID and some text. The database can be queried by passing a string of text, and the result includes the data along with an array of distances, where smaller numbers indicate higher degrees of similarity.


[[Category:Terms]] [[Category:Artificial intelligence terms]]
=== 1. Semantic search ===
 
Unlike lexical search, which relies on exact word or string matches, semantic search uses the meaning and context of a search query or question. Vector databases use Natural Language Processing models to store and index vector embeddings, allowing for more accurate and relevant search results.
 
=== 2. Similarity search for unstructured data ===
 
Vector databases facilitate search and retrieval of unstructured data like images, audio, video, and JSON, which can be challenging to classify and store in traditional databases.
 
=== 3. Ranking and recommendation engines ===
 
By finding similar items based on nearest matches, vector databases are suitable for powering ranking and recommendation engines for online retailers and streaming media services.
 
=== 4. Deduplication and record matching ===
 
Vector similarity search can be used to find near-duplicate records for applications such as removing duplicate items from a catalog.
 
=== 5. Anomaly detection ===
 
Vector databases can identify anomalies in applications used for threat assessment, fraud detection, and IT operations by finding objects that are distant or dissimilar from expected results.
 
== Required Capabilities of a Vector Database ==
 
A vector database must possess certain capabilities to be effective:
 
=== 1. Vector Indexes for Search and Retrieval ===
 
Vector databases must utilize algorithms designed to index and retrieve vectors efficiently. These algorithms can be optimized depending on the use case requirements. Common metrics used in vector indexes include Euclidean distance, cosine similarity, and dot products. Approximate Nearest Neighbor (ANN) search can balance precision and performance by approximating and retrieving the most similar vectors.
 
=== 2. Single-Stage Filtering ===
 
Single-stage filtering combines the accuracy of pre-filtering with the speed of post-filtering by merging vector and metadata indexes into a single index. This feature is essential for an effective vector database.
 
=== 3. Data Sharding ===
 
To achieve scalable and cost-effective performance, vector databases must support horizontal scaling through data sharding. By dividing vectors into shards and replicas, vector databases can scale across multiple machines, allowing for the efficient searching of large datasets.
 
== Replication ==
Replication is a technique used by vector databases to handle multiple requests efficiently. This approach increases the capacity of the system to process simultaneous or rapidly occurring vector search requests.
 
=== Shards and Replicas ===
Sharding enables a vector database to distribute work across several pods, which can perform vector searches more quickly. Replicas, on the other hand, create copies of the entire set of pods to handle more requests concurrently. In a scenario where multiple search requests are incoming, replicas can help maintain system performance by providing additional processing capacity.
 
=== High Availability ===
Replicas also improve the availability of a vector database. As machines are prone to failure, it is crucial for the system to recover from these events quickly. By distributing replicas across different [[availability zones]], the database can achieve high availability and resilience against simultaneous failures. Users also have a responsibility to ensure sufficient replica capacity, so the remaining replicas can maintain acceptable latency during a failure event.
 
== Hybrid Storage ==
Hybrid storage configurations help address the challenges of memory cost and search latency associated with vector searches. These configurations balance the need for speed and accuracy with the requirement for cost-effective infrastructure.
 
=== In-Memory and On-Disk Storage ===
In a hybrid storage system, a compressed vector index is stored in memory (RAM), while the original, full-resolution vector index is stored on disk. The in-memory index is used to locate a small set of candidates, which are then searched within the complete index on disk. This approach enables rapid and accurate search results while reducing infrastructure costs by up to 10 times.
 
=== Improved Storage Capacity ===
Hybrid storage allows for more vectors to be stored across the same data footprint, thus lowering the operational cost of a vector database. This is achieved by enhancing overall storage capacity without negatively affecting database performance.
 
== API ==
APIs play a crucial role in allowing developers to interact with vector databases without the need to build and maintain vector search functionality themselves. By utilizing APIs, developers can focus on optimizing their applications.
 
=== REST APIs and Language Clients ===
Vector databases often offer REST APIs, which add flexibility by allowing the database to be accessed from any environment capable of making HTTPS calls. Developers can also interact with the vector database using clients in various programming languages, such as Python, Java, and Go. These APIs enable actions such as upserting vectors, retrieving query results, or deleting vectors to be performed seamlessly within the context of an application.
370

edits