LanceDB is an open-source, serverless vector database and multimodal lakehouse built on the Lance columnar storage format. Founded in 2021 by Chang She and Lei Xu and incorporated in 2022 through Y Combinator's Winter 2022 batch, the company is headquartered in San Francisco, California. LanceDB is designed to store, query, and manage vector embeddings, images, video, audio, and structured metadata in a single unified table, without requiring a separate vector store alongside a data lake.
The project separates into two related but distinct components: the Lance file format, an open-source columnar container format licensed under Apache 2.0 and maintained independently at the lance-format GitHub organization, and LanceDB itself, the database library and managed service built on top of Lance. LanceDB the library is also released under the Apache 2.0 license and is available for Python, TypeScript, Rust, and via a REST API.
By May 2026, the LanceDB repository on GitHub had accumulated over 10,200 stars and the Lance format had been downloaded more than 20 million times. Production deployments include companies such as Midjourney, Runway, Character.ai, ByteDance, WeRide, Netflix, and Airtable.
Chang She spent the first part of his career as a quantitative analyst at hedge fund AQR Capital Management, then at Barclays Capital. He left finance to co-found DataPad with Wes McKinney, the creator of the pandas Python library. DataPad was a cloud-based business intelligence startup that Cloudera acquired in September 2014. She's connection to McKinney made him one of the original core developers of pandas, the foundational Python data manipulation library.
After the acquisition, She spent roughly four and a half years at Cloudera managing engineering teams working on the Hadoop ecosystem. He then joined Tubi TV as VP of Engineering when Fox acquired Tubi in 2020 for $440 million. At Tubi he led machine learning infrastructure covering recommender systems, ML serving, and A/B testing.
Lei Xu holds a Ph.D. in Computer Science from the University of Nebraska-Lincoln and was a core contributor to HDFS (Hadoop Distributed File System), becoming a committer and member of the Apache Hadoop project management committee. He worked at Cloudera from 2014 to 2018 on the HDFS team, where he met Chang She. After Cloudera he led ML infrastructure at Cruise, the autonomous vehicle company.
She and Xu had the initial idea for what became LanceDB in 2022, according to She's own account. The original motivation was not a vector database at all. Both founders had observed the same pattern independently: projects involving multimodal data, particularly computer vision workloads with images and video, consistently took longer to build, were harder to maintain, and were more difficult to deploy to production than equivalent text or tabular data projects. Their diagnosis was that the problem was not at the application or orchestration layer; it was the underlying data infrastructure.
They built the Lance file format first, a new columnar storage format intended to serve computer vision data pipelines. The team initially wrote the format in C++, the language Parquet is implemented in. During a December 2022 holiday project for an early customer, they partially rewrote the Lance read path in Rust. The results were strong enough that the team rewrote the entire codebase in Rust over the following weeks, completing in roughly three weeks what the C++ version had taken six months to build. Beyond speed, Rust's memory safety eliminated the segmentation faults that had prevented the team from releasing confidently in C++.
When ChatGPT launched in late 2022, the open-source community around Lance quickly began using it for vector search in generative AI applications. The team noticed that it was easier to explain a vector database than a new columnar format, so they separated the vector database functionality into a dedicated repository and shifted their public positioning. The company participated in Y Combinator's Winter 2022 batch.
LanceDB raised a $3 million pre-seed round in its early period, bringing total funding at that stage to $3 million. In May 2024 the company announced an $8 million seed round led by CRV, with additional participation from Y Combinator, Essence VC, and Swift Ventures. That brought total funding to $11 million.
In June 2025 LanceDB closed a $30 million Series A led by Theory Ventures. Participating investors included CRV, Y Combinator, Databricks Ventures, RunwayML, Zero Prime, and Swift. Total funding reached $41 million across three rounds and approximately 10 investors.
At the time of the Series A, Chang She described the company's broader ambition: to establish Lance as the standard data format for multimodal AI, in the way that Parquet became standard for tabular analytics.
Lance was designed to address limitations that Parquet presents for machine learning workloads. Parquet organizes data into row groups, each a self-contained slice of rows compressed and stored together. This structure works well for analytical queries that scan large contiguous ranges of rows, but it performs poorly for random access patterns where a query needs to read a small number of individual rows scattered across the file.
In machine learning training, random access is common: training loops typically shuffle datasets and read mini-batches of non-contiguous samples. Benchmarks from the Lance team showed Lance providing approximately 100 times faster random access than Parquet or Apache Iceberg for these workloads, while preserving competitive full-scan performance for analytical queries.
Internally, a Lance dataset stores data in fragments, which are small independent columnar files. Each fragment has its own statistics and min/max zone maps that allow a query engine to skip fragments entirely when they cannot contain relevant rows. Unlike a single Parquet file with multiple row groups, Lance fragments are separate files on disk or object storage, which enables concurrent writes without a central locking mechanism.
A Lance file conceptually acts as a single row group with no fixed size boundary. The Lance v2 format, introduced in 2024, eliminated row groups as a structural unit and replaced them with variable-length pages within columns. Each column's pages can be sized independently based on what is optimal for the storage backend, such as 8 MiB aligned pages for Amazon S3. This solved the traditional "Goldilocks problem" with Parquet row groups, where groups that are too small create excessive metadata overhead and groups that are too large waste I/O when reading partial data.
Lance v2 treats encodings as pluggable extensions rather than built-in components. The format uses Protocol Buffer "any" messages to describe encodings, so new encodings can be added without changing the core reader or writer. This design allows the format to evolve without breaking backward compatibility.
The format supports two broad categories of structural encoding. The first is suitable for dense fixed-width numeric data, including embedding vectors. The second handles variable-length and nested data such as strings, lists, and binary blobs. Lance selects between them automatically based on the column's data type.
Lance v2.2 achieved storage size reductions exceeding 50% compared to v1 for many workloads, and up to 68 times faster reads for blob columns containing large binary objects such as images and video frames.
Every write to a Lance dataset produces a new version via an append-only transaction log. Each version captures fragment metadata and schema evolution. This gives applications instant dataset rollback, reproducing exact training snapshots, and branching for experimentation without data duplication. The versioning model is conceptually similar to Delta Lake or Apache Iceberg but lighter, because the underlying storage is per-fragment rather than a global log over row groups.
Lance datasets are built on Apache Arrow's in-memory format, which means any language with an Arrow binding can read Lance data without a dedicated Lance SDK. The lance-format GitHub organization also maintains integration libraries for Apache Spark, making Lance readable from existing Spark-based data pipelines. Additional compatible tools include DuckDB, Pandas, Polars, PyArrow, and PyTorch dataloaders.
LanceDB's primary deployment model for development and self-hosted production is embedded: the database runs as a library inside the application process, with no separate server process to start or manage. The application reads and writes Lance files directly to local disk or to a cloud object storage bucket such as Amazon S3, Google Cloud Storage, or Azure Blob Storage.
This design makes LanceDB well suited for serverless functions: because there is no persistent server, a Lambda function can open a LanceDB dataset stored in S3, run queries, and terminate without needing to connect to or disconnect from a server. The trade-off is that concurrency control is at the file system level; multiple writers to the same dataset must use external coordination or rely on Lance's optimistic concurrency.
Because the database is file-based, scaling in embedded mode is bounded by what a single host can provide in CPU, memory, and storage I/O. For datasets up to hundreds of millions of vectors on a single host, embedded mode is sufficient for most production RAG and search workloads.
LanceDB Cloud is a managed serverless service. It exposes the same API as the embedded library but moves storage and query execution to LanceDB's infrastructure. Because the service is serverless, users pay only for storage consumed and queries executed, with no minimum monthly fee. The service entered public beta in 2025.
LanceDB Cloud targets teams that want managed hosting without operating infrastructure but do not yet need the scale or compliance requirements of the enterprise tier. It is accessible through the same Python, TypeScript, and REST clients used for the embedded library.
LanceDB Enterprise is a distributed cluster designed for production-scale AI workloads at companies handling tens of billions of vectors and petabytes of training data. The architecture separates work across routers, execution nodes, and background workers. A load balancer distributes queries to the least-loaded execution node, so throughput scales roughly linearly as nodes are added.
Enterprise is available in two configurations. The Bring-Your-Own-Cloud template installs the control plane, routers, and nodes inside the customer's own cloud VPC, so data does not leave the customer's account. The managed SaaS option delegates day-to-day operations to LanceDB, including patching, scaling, and around-the-clock monitoring.
Enterprise is available on the AWS Marketplace. Customers include Midjourney, Character.ai, and Runway, each running tens of billions of vectors.
Because Lance datasets are standard files on object storage, LanceDB does not require a proprietary storage backend. A dataset can be moved between local disk, S3, GCS, and Azure Blob Storage by simply copying the files. This portability distinguishes LanceDB from vector databases that store index structures in a proprietary binary format tied to a specific server.
AWS published a reference architecture for searching 3.5 billion 960-dimensional protein embeddings using LanceDB tables stored in S3, processed via AWS Lambda for individual queries and AWS Batch for large-scale queries. Storage for the indexed dataset was 12.9 TB on S3. Individual queries cost fractions of a cent.
LanceDB supports several approximate nearest neighbor (ANN) index types. All are built on an IVF (Inverted File) partitioning layer that first divides the vector space into clusters, then applies a secondary algorithm within each partition.
IVF_FLAT stores raw, uncompressed vectors within each IVF partition. Search compares the query vector against partition centroids to find the closest partitions, then does exact comparisons against the raw vectors in those partitions. This provides the highest possible recall at the cost of higher memory usage and slower queries than quantized variants. It is appropriate when accuracy is the primary concern and the dataset fits comfortably in memory.
IVF_SQ uses scalar quantization to compress each vector component to an 8-bit integer, producing approximately four times compression compared to raw float32 vectors. Scalar quantization introduces a small accuracy loss but is faster to decode than product quantization and well-suited to datasets where storage is the primary constraint.
IVF_PQ combines IVF partitioning with product quantization (PQ). PQ divides each vector into sub-vectors and replaces each sub-vector with a code from a learned codebook. This can achieve compression ratios of 16x to 64x compared to raw float32 vectors. IVF_PQ is a good general-purpose choice for high-dimensional vectors up to roughly 256 dimensions, where PQ provides stronger accuracy than more aggressive methods.
IVF_RQ uses RabitQ quantization, which quantizes each vector dimension to approximately one bit. This produces extreme compression at some accuracy cost. IVF_RQ is recommended for filtered search workloads where vector search is combined with metadata predicates, because HNSW-backed indexes can show higher latency variance under filtering.
HNSW (Hierarchical Navigable Small World) is a graph-based index algorithm that provides efficient search through a multi-layer proximity graph. In LanceDB, HNSW is not a standalone index; it operates as a sub-index within IVF partitions. This hybrid approach combines IVF's scalability with HNSW's graph-based search within each partition.
Three HNSW-backed variants are available:
Key HNSW tuning parameters are m (number of neighbors per vector in the graph), ef_construction (candidates evaluated during graph building), and ef (exploration factor at query time, typically 1.5 to 10 times k).
For binary vector data, LanceDB supports IVF_FLAT with Hamming distance. Vectors must be packed as uint8 arrays with dimensions divisible by 8. Binary indexes are used in applications such as image fingerprinting and near-duplicate detection.
Beyond vector indexes, LanceDB includes a native full-text search (FTS) engine using BM25, the same term frequency-inverse document frequency algorithm used by Elasticsearch and OpenSearch. FTS indexes are column-level and must be created explicitly before keyword search is available.
Hybrid search combines vector similarity and BM25 keyword results, then re-ranks them. LanceDB ships three built-in rerankers: LinearCombinationReranker (the default, which blends vector and FTS scores with configurable weights, defaulting to 0.7 for vector and 0.3 for FTS), CohereReranker (using Cohere's Rerank API), and ColBERTReranker (running a ColBERT model locally via Hugging Face). Additional rerankers include a cross-encoder reranker and an experimental OpenAI reranker.
Queries can combine vector search, full-text search, SQL predicates, and geographic filters in a single call, without joining results from separate systems.
LanceDB's positioning as a "multimodal lakehouse" refers to its ability to store raw files (images, video frames, audio, point clouds), structured metadata, and embedding vectors in the same Lance table. A typical Parquet-based data lake would store tabular metadata in Parquet files and keep images as separate files in a bucket, requiring a separate vector index to link embeddings back to the source data.
In Lance, each row can contain scalar columns, vector columns, and blob columns in a single record. The Lance v2.2 format introduced Lance Blob v2, a storage layout optimized for large binary objects that achieves up to 68 times faster reads than the previous approach for blobs exceeding 4 KB. This makes it practical to store full image thumbnails or audio waveforms directly in the Lance table rather than as references to external files.
The multimodal design allows a single query to filter by metadata, search by vector similarity, and retrieve the associated raw file bytes in one operation. Teams building autonomous vehicle datasets, generative AI training pipelines, and content moderation systems have used this capability to avoid building and maintaining pipelines that synchronize multiple storage systems.
The Lance format integrates with Hugging Face Datasets, allowing datasets published on the Hugging Face Hub to be downloaded and manipulated with Lance tooling. Integration with Apache Spark, DuckDB, Ray, and Daft allows Lance tables to participate in large-scale distributed data processing without converting formats.
| Feature | LanceDB | Pinecone | Weaviate | Qdrant | Milvus |
|---|---|---|---|---|---|
| Open source | Yes (Apache 2.0) | No | Yes (BSD-3) | Yes (Apache 2.0) | Yes (Apache 2.0) |
| Deployment | Embedded / cloud / enterprise | Managed cloud only | Self-hosted / cloud | Self-hosted / cloud | Self-hosted / cloud |
| Storage backend | Files on disk or object storage | Proprietary | Proprietary | Proprietary | Proprietary |
| Multimodal blobs in same table | Yes | No | Limited | No | No |
| Built-in data versioning | Yes (Lance format) | No | No | No | No |
| Full-text search | Yes (BM25) | No (via metadata only) | Yes | Yes | Yes |
| Hybrid search | Yes | Limited | Yes | Yes | Yes |
| Serverless / embedded | Yes | No | No | No | No |
| Programming language | Rust core | Proprietary | Go | Rust | C++ / Go |
| Max vectors per node (documented) | 100B+ rows per table | N/A (managed) | ~100M practical | Billions | Billions |
| License | Apache 2.0 | Proprietary | BSD-3 | Apache 2.0 | Apache 2.0 |
LanceDB differs from Pinecone primarily in deployment model: Pinecone is a fully managed cloud service with no open-source offering, while LanceDB runs embedded inside the application process and stores data in standard files that the user controls. Teams that cannot send data to a third-party service, or that want to avoid per-query cloud costs at development time, tend to choose LanceDB for that reason.
Compared to Weaviate, LanceDB does not include built-in knowledge graph or object relationship features. Weaviate's GraphQL interface and cross-reference linking between objects have no direct equivalent in LanceDB. Weaviate also has a longer production history and larger community. However, Weaviate's resource usage grows quickly above 100 million vectors, while LanceDB's file-based design scales more linearly with data size.
Compared to Qdrant, LanceDB has a different performance profile for filtered search. Qdrant's HNSW implementation is mature and shows strong throughput in standard ANN benchmarks. LanceDB's IVF_RQ and IVF_PQ indexes are generally preferred over HNSW variants for heavily filtered workloads. For unfiltered ANN search at moderate scale, Qdrant benchmarks are frequently faster, but LanceDB provides capabilities (multimodal storage, data versioning, full lakehouse integration) that Qdrant does not.
Compared to Milvus, LanceDB is lighter to operate. Milvus requires etcd, MinIO or S3, a message queue (Kafka or Pulsar), and multiple microservices. LanceDB embedded requires no infrastructure. Milvus targets organizations running hundreds of millions to billions of vectors in distributed clusters, and it has a longer track record at that scale. LanceDB Enterprise is designed for the same scale but was introduced later.
LangChain includes a LanceDB vector store integration, available as langchain_community.vectorstores.LanceDB. The integration allows any LangChain retrieval chain to use LanceDB for document storage and similarity search. A LanceDB table is created on the first insert; subsequent inserts append to the existing table.
LlamaIndex ships a LancDBVectorStore integration installed via llama-index-vector-stores-lancedb. The store creates or opens a LanceDB dataset and supports the standard LlamaIndex query interface for RAG pipelines.
Letta, the agent memory framework previously known as MemGPT, uses LanceDB as the default archival storage backend. When an agent's context window fills up, MemGPT pages content to archival storage and retrieves it via vector search. LanceDB was chosen as the default because it requires no setup: the archival store is a set of files on disk with no server process. This setup-free experience and the ability to scale from gigabytes to terabytes without infrastructure changes made it the practical default for local MemGPT deployments.
AnythingLLM, an open-source all-in-one local LLM application, uses LanceDB as its default vector database. Because LanceDB is the only embedded vector database with a Node.js SDK, it was selected as the zero-configuration default that works without requiring users to configure an external vector store. All document embeddings stay local to the AnythingLLM installation.
LanceDB and DuckDB share the Apache Arrow memory layout, so Lance tables can be queried directly from DuckDB using the lance extension. Analytical queries over scalar columns run in DuckDB while vector search runs in LanceDB; the results are joined using Arrow record batches without serialization overhead.
The Lance format integrates with Hugging Face Datasets, allowing datasets to be published to and downloaded from the Hugging Face Hub in Lance format. This is used for sharing large multimodal datasets such as image-caption pairs and video clips.
Additional integrations include Apache Spark (via lance-spark), Ray (via ray-lance), Daft, Polars, Pandas, PyArrow, and PyTorch DataLoaders.
RAG applications store document chunks and their embeddings in LanceDB. At query time, the application encodes the user's question, performs a vector similarity search to retrieve the most relevant chunks, and passes them as context to a language model. LanceDB's hybrid search capability, combining BM25 and vector search in one query, improves retrieval quality for terms where keyword matching outperforms semantic similarity, such as proper nouns, version numbers, and technical identifiers.
Generative AI companies such as Runway and Midjourney use LanceDB's multimodal lakehouse capabilities to manage training datasets. The Lance format's random access speed reduces data loading bottlenecks during training. The built-in versioning allows teams to snapshot a dataset before a training run, roll back if quality degrades, and branch for ablation experiments without duplicating terabytes of data. WeRide used LanceDB to restructure its autonomous driving data pipeline, reducing data mining time from one week to one hour, a 90x improvement in developer productivity.
LanceDB is used in content recommendation pipelines where items (articles, videos, songs, products) are represented as embeddings and recommendations are served by finding nearest neighbors in embedding space. TwelveLabs uses LanceDB to store video embeddings that encode narrative and mood alongside metadata, enabling vector search over video content.
AI agent frameworks that need persistent memory across sessions store past observations, conversation summaries, and facts in LanceDB. The MemGPT/Letta use case is the canonical example: an agent's long-term memory is stored in LanceDB, and the agent retrieves relevant memories by semantic similarity before generating a response.
Image and video search applications use LanceDB to store CLIP embeddings alongside the source images or video frames. A query can be a text string or an image; the application encodes the query, searches LanceDB for nearest neighbors, and returns the original media. Fashion search, medical image retrieval, and satellite imagery analysis have been demonstrated with LanceDB.
AWS documented using LanceDB to index 3.5 billion protein sequence embeddings for biological research. The architecture stored indexed data in S3 and served queries via Lambda functions, with larger batch queries running on i4i.8xlarge instances. Individual queries for 50,000 nearest neighbors were returned in seconds at a cost of fractions of a cent per query.
LanceDB has three tiers.
The OSS tier is free. Users download the library and run it against local storage or their own object storage bucket. There are no usage limits and no fees. The OSS library is released under the Apache 2.0 license.
LanceDB Cloud is a serverless managed service with usage-based pricing and no monthly minimum. The Cloud tier entered public beta in 2025. For moderate workloads at roughly 100,000 queries per day, estimated costs are in the $50 to $200 per month range, though exact pricing depends on index size and query complexity.
LanceDB Enterprise is priced by annual contract. It is available on the AWS Marketplace. Pricing requires contacting the sales team. Typical enterprise vector database contracts in this tier range from roughly $2,000 to $10,000 per month depending on data scale, cluster size, and support tier.
Publicly confirmed LanceDB customers include Midjourney (text-to-image model serving and training data management), Runway (generative video model training pipelines), Character.ai (conversational AI serving), ByteDance, WeRide (autonomous driving data), Netflix (media data lake), Airtable, and CodeRabbit. The company also cites Harvey, WorldLabs, and Uber among its user base.
At the time of the June 2025 Series A announcement, LanceDB reported that enterprise customers were searching over tens of billions of vectors and managing petabytes of training data. Revenue reached approximately $2.3 million in 2024, according to public reports, with a 15-person team at the time.
LanceDB in embedded mode has no built-in multi-tenant access control. All data in a LanceDB dataset is accessible to any process that can read the underlying files. Applications that need row-level or table-level access control must implement it at the application layer or use the enterprise tier, which adds authentication and authorization.
Concurrent writes to the same Lance dataset from multiple processes require care. Lance uses optimistic concurrency: each writer reads the current version, applies changes, and attempts to commit. If two writers commit simultaneously, one will fail and must retry. This is sufficient for most workloads but requires application-level handling for high-concurrency write scenarios.
Real-time index updates are not instantaneous. Creating or modifying a vector index is an explicit operation; newly inserted rows are searchable via a slow linear scan until the index is rebuilt or the index build completes. This contrasts with some competing databases that build indexes incrementally in the background.
Public performance benchmarks comparing LanceDB directly against Qdrant, Weaviate, Milvus, and Pinecone in controlled conditions were not widely available as of mid-2026. Independent ANN benchmarks such as ann-benchmarks.com did not include LanceDB in their standard suite, making objective recall and QPS comparisons at the same parameters difficult.
The enterprise and cloud tiers are newer than equivalent offerings from Qdrant and Milvus. Organizations that require long production track records at multi-billion vector scale may prefer those alternatives until LanceDB accumulates more publicly documented enterprise deployments.
The LanceDB documentation, while improving, had gaps in architectural diagrams and detailed configuration references as of the early 2025 period. The embedded library's API has changed between versions, creating migration friction for early adopters.