FAISS (Facebook AI Similarity Search) is an open-source library for efficient similarity search and clustering of dense vectors. Released in March 2017 by Meta AI Research (then Facebook AI Research, or FAIR), FAISS has become the de facto reference implementation for approximate nearest neighbor (ANN) search at large scale. The library was initiated by Hervé Jégou, who wrote the first prototype, with Matthijs Douze implementing most of the CPU code and Jeff Johnson building the GPU implementation that became the project's signature differentiator [1][2]. Written primarily in C++ with optional Python bindings, FAISS scales from millions of vectors on a laptop to billions on a multi-GPU server. It is licensed under the permissive MIT license and has accumulated roughly 30,000 stars on GitHub, making it one of the most widely adopted infrastructure components in modern machine learning.
FAISS sits at the intersection of classical information retrieval, machine learning, and high-performance computing. Its core function is conceptually simple: given a query vector and a corpus of indexed vectors, return the k nearest neighbors under some distance metric. The complications come from scale, where corpora can contain hundreds of millions or billions of vectors and queries must complete in milliseconds. FAISS addresses these complications through a toolkit of indexing methods that trade exactness for speed, including brute-force search, inverted file indexes, product quantization, graph-based indexes such as HNSW, and combinations composable through a small DSL called the index factory. Internally, FAISS exploits SIMD instructions on modern CPUs (AVX2, AVX-512, ARM Neon), uses BLAS for batched matrix multiplication, and offers CUDA implementations for billion-scale workloads on a single GPU.
The practical importance of FAISS expanded sharply with the rise of embedding-based applications. Tasks such as semantic search, recommender systems using two-tower models, image deduplication, face recognition, and retrieval-augmented generation all reduce to finding nearest neighbors in a high-dimensional vector space. FAISS provides the search primitive that powers many of these systems, either directly as a library inside a Python application or indirectly as the indexing backend for a larger system such as Milvus, OpenSearch, or Vespa. Meta has reported using FAISS to index 1.5 trillion 144-dimensional vectors for internal applications spanning search, ads, and content moderation [3].
FAISS grew out of more than a decade of research at INRIA and FAIR on large-scale similarity search. Hervé Jégou and his collaborators spent the late 2000s and early 2010s developing methods for indexing local image descriptors at web scale, particularly SIFT and Fisher vectors. The 2011 paper Product Quantization for Nearest Neighbor Search by Jégou, Douze, and Cordelia Schmid introduced product quantization (PQ), splitting each high-dimensional vector into smaller subvectors and quantizing each independently against a small codebook. PQ enables compressed asymmetric distance computation in which the query stays in floating point while the database is stored as compact codes, reducing memory by an order of magnitude with modest accuracy loss. The Inverted File index with Asymmetric Distance Computation (IVFADC) built on PQ and eventually formed the conceptual core of FAISS.
When FAIR began rebuilding their internal nearest-neighbor codebase around 2015, they had three goals: cover the full range of practical index types, achieve state-of-the-art performance on both CPU and GPU, and release the result to the research community. Matthijs Douze led the CPU port, absorbing many optimizations from earlier INRIA codebases. Jeff Johnson's GPU rewrite pushed the limits of what a single GPU could do; his k-selection algorithm operated at up to 55 percent of theoretical peak GPU performance and his end-to-end pipeline ran 8.5x faster than the previous best published GPU implementation, documented in Billion-scale similarity search with GPUs [1]. That paper, released as an arXiv preprint in February 2017 and later in IEEE Transactions on Big Data, remains the canonical FAISS reference with several thousand citations.
The library was open-sourced on GitHub on March 29, 2017, with an Engineering at Meta blog post explaining the motivation and headline numbers [2]. Adoption was almost immediate. By the end of 2017, FAISS appeared in numerous papers as the default similarity search backbone and was integrated into the search systems of several large internet companies. Its reach grew with the rise of dense neural network embeddings around 2018 to 2020, which made vector search a routine component of production ML pipelines. Major milestones since include HNSW support in version 1.6 (2019), expanded GPU coverage in the 1.7 series, the 1.8 release in March 2024 documented in The FAISS Library [4], and the 1.9 (October 2024) and 1.10 (January 2025) releases that brought NVIDIA cuVS integration with new GPU index types and lower search latencies [5][6].
FAISS is structured as a layered C++ library with a thin Python binding generated by SWIG. The CPU code depends on a BLAS implementation for the matrix multiplications that dominate exact search and many quantization steps. OpenBLAS is the default on Linux and Windows; macOS uses the Accelerate framework; Intel MKL is recommended on Intel hardware for best performance. SIMD intrinsics are used heavily throughout the codebase to accelerate distance computations, k-selection, and quantization. To stay portable, FAISS isolates SIMD-specific code in files suffixed with the instruction set name (such as distances_avx2.cpp or distances_avx512.cpp) and compiles separate variants of the Python bindings. At import time, a loader script detects the host CPU's capabilities and loads the most optimized variant available.
The GPU code path is built on CUDA and runs on NVIDIA hardware. It implements brute-force search, IVF, IVFPQ, and a few other index types, with most of the heavy lifting done by custom CUDA kernels rather than off-the-shelf libraries. The GPU implementation uses a custom k-selection algorithm that holds the running top-k entirely in GPU registers and shared memory, which avoids costly trips to global memory and is one of the main reasons FAISS GPU outperforms naive implementations by a wide margin. Multi-GPU sharding is built into the public API: users can wrap any GPU index with IndexShards or IndexProxy to fan queries out across several devices. Since FAISS 1.10, an alternative GPU code path based on NVIDIA cuVS is available through the faiss-gpu-cuvs package, which delegates the underlying kernels to the CAGRA, IVF-PQ, and IVF-Flat implementations in cuVS [5].
At the API level, every FAISS index inherits from a common abstract base class exposing methods for adding vectors, searching, training, removing by ID, and serializing to or from disk. This uniform interface lets users swap between index types with almost no code change, which is crucial for the empirical tuning that FAISS expects. The library also provides building blocks beyond search itself: fast k-means clustering on CPU or GPU, PCA and random rotation preprocessing, product and scalar quantizers, and utility routines for reading fvecs/bvecs files.
FAISS supports a small but practical set of distance metrics via the MetricType enumeration. The two most heavily used are squared Euclidean (METRIC_L2) and inner product (METRIC_INNER_PRODUCT). Squared Euclidean is the default for many index types and is what most published benchmarks report. Inner product covers cosine similarity as a special case: if all vectors are L2-normalized to unit length before insertion and querying, then the inner product equals the cosine similarity. FAISS does not expose a separate METRIC_COSINE value, and the standard idiom is to call faiss.normalize_L2 on the data before adding it to the index and on each query before searching. Beyond L2 and inner product, FAISS supports L1, L-infinity, arbitrary p-norm, Canberra, Bray-Curtis, Jensen-Shannon, and Jaccard for binary vectors. These less common metrics are implemented for IndexFlat and some quantized indexes but not all index types. In practice, the overwhelming majority of deployments use L2 or inner product, since most modern embedding models produce vectors designed to be compared with one of these two metrics.
FAISS organizes its indexing methods around a small set of building blocks that can be combined to fit a wide range of accuracy, latency, and memory budgets. The major families are exact (flat) indexes, partition-based inverted files, quantization-based compressed indexes, and graph-based navigable indexes.
IndexFlatL2 and IndexFlatIP perform brute-force search by computing the distance from the query to every vector in the database. They store the raw vectors uncompressed, do not require training, and produce exact results that serve as the ground truth for evaluating approximate methods. Brute force is surprisingly competitive when the dataset fits in memory and the query rate is moderate, especially on a GPU, because GEMM-based distance computation can deliver hundreds of thousands of queries per second per device. The main drawbacks are linear-in-N memory and linear-in-N query time, which make Flat impractical beyond a few million vectors per machine for low-latency applications.
The inverted file index partitions the database into nlist clusters via k-means and stores each vector in the inverted list of its assigned centroid. At query time, only the nprobe most promising lists are scanned, reducing the candidate set to roughly nprobe / nlist of the database. The coarse quantizer is itself an arbitrary FAISS index; in practice it is usually a flat index for small nlist or an HNSW index for very large nlist. A common rule of thumb is to set nlist proportional to the square root of the number of vectors. nprobe is set at query time and offers a direct tradeoff between speed and accuracy. IVF indexes are trained by running k-means on a representative sample to produce the cluster centroids. Training is one-time and can be expensive for very large nlist, but it amortizes over all subsequent queries.
Product quantization splits each d-dimensional vector into m subvectors of dimension d/m, then trains a separate codebook of 2^nbits centroids for each subvector position. Each vector is replaced by the m centroid indices that best approximate its subvectors, occupying m bytes when nbits is 8. The compressed representation enables asymmetric distance computation: the query is kept in floating point and distances to all 2^nbits centroids per subvector are precomputed once per query, then the database distance is approximated by a sum of m table lookups. The result is roughly 16x to 64x compression with limited accuracy loss, which is what makes billion-scale search on modest hardware possible. The most widely used composite is IVFPQ, written IVF<nlist>,PQ<m> in the index factory. IVFPQ first partitions vectors into inverted lists with k-means, then encodes each vector's residual (the difference between the vector and its centroid) with PQ. Storing residuals rather than raw vectors yields better quantization quality because residuals have lower variance.
OPQ is a learned linear transformation applied before PQ that rotates vectors to flatten variance across subvectors. When variance is unevenly distributed across dimensions (which is common for embeddings produced by neural networks), some PQ subquantizers carry most of the information and others carry almost none, reducing the effective compression ratio. OPQ jointly optimizes the rotation and the PQ codebooks to balance variance, often improving recall by several percentage points at the same memory budget. OPQ is invoked in the index factory by prefixing a quantizer with OPQ<m>_<d>, where d is the output dimensionality after rotation.
HNSW (Hierarchical Navigable Small World) is a graph-based index in which each vector is a node connected to a small number of nearby neighbors. The graph is organized in layers, with the top containing only a few hub nodes and each successive layer adding more. Search starts at the top and greedily descends, refining the candidate set as it goes. HNSW achieves near-state-of-the-art recall and very low query latency for medium-scale corpora that fit in memory. Its main drawbacks are high memory overhead (graph edges typically dominate storage cost) and lack of compression, which make it less attractive than IVFPQ for billion-scale deployments. FAISS added HNSW support in version 1.6 and exposes it through IndexHNSWFlat and several composite forms.
FAISS provides a small string-based DSL called the index factory for constructing composite indexes. A factory string is a comma-separated sequence of stages: optional preprocessing (OPQ<m>_<d>), an optional coarse quantizer (IVF<nlist> or IVF<nlist>_HNSW<M>), and a final encoding stage (Flat, PQ<m>, SQ8, Refine). For example, OPQ16_64,IVF65536_HNSW32,PQ16 builds an index that first applies OPQ to rotate vectors and reduce them to 64 dimensions, then partitions them into 65,536 inverted lists with an HNSW coarse quantizer, then encodes each vector as a 16-byte PQ code. The factory makes it easy to experiment with different index recipes without writing C++ glue code.
The table below summarizes the major FAISS index types and typical use cases.
| Index type | Factory string | Memory per vector | Training required | Typical use case |
|---|---|---|---|---|
| IndexFlat | Flat | 4d bytes (full precision) | No | Ground truth, small corpora, GPU brute force |
| IndexIVFFlat | IVF<nlist>,Flat | 4d bytes plus list overhead | Yes (k-means) | Up to ~10M vectors with high recall |
| IndexIVFPQ | IVF<nlist>,PQ<m> | m bytes plus list overhead | Yes (k-means + PQ) | Billion-scale with moderate recall |
| IndexIVFScalarQuantizer | IVF<nlist>,SQ8 | d bytes plus list overhead | Yes (k-means) | Balance of speed, recall, memory |
| IndexHNSWFlat | HNSW<M> | 4d + 8M bytes | No | Small to medium, low latency, high recall |
| IndexIVFPQ + HNSW quantizer | IVF<nlist>_HNSW<M>,PQ<m> | m bytes plus list overhead | Yes | Very large nlist (billion-scale partitioning) |
| IndexLSH | LSH | bits/8 bytes | No | Binary embeddings, very low memory |
| IndexBinaryFlat | BFlat | d/8 bytes | No | Hash codes, binary descriptors |
| IndexPQ + Refine | PQ<m>,Refine(Flat) | m + 4d bytes | Yes (PQ) | High-recall reranking on top of compressed search |
The choice of recipe depends on dataset size, latency, and memory budget. For up to roughly one million vectors, HNSW or brute-force flat is typically simplest and most accurate. For tens to hundreds of millions, IVFFlat or IVFScalarQuantizer offers a good balance. For billion-scale corpora, IVFPQ with an HNSW coarse quantizer is the canonical recipe, often combined with OPQ preprocessing for embeddings with uneven variance.
The GPU implementation most clearly distinguishes FAISS from competing libraries and was the centerpiece of the original 2017 paper. FAISS GPU runs roughly three to ten times faster than the equivalent CPU implementation on a single NVIDIA card and can index billions of vectors using only a few high-memory GPUs. Multi-GPU configurations scale almost linearly for embarrassingly parallel queries, since each GPU can hold a shard and answer queries against its own shard before a final merge combines results.
The original GPU implementation supported brute-force search (GpuIndexFlat) and inverted files (GpuIndexIVFFlat, GpuIndexIVFPQ, GpuIndexIVFScalarQuantizer). The cuVS integration in version 1.10 adds the CAGRA graph index (GpuIndexCagra), built specifically for GPU execution and capable of outperforming IVF methods at high recall. The cuVS path also enables PQ codes with bit widths in [4, 8], whereas the classic GPU path supports only 8-bit PQ [5][6]. The table below summarizes typical CPU versus GPU tradeoffs from official documentation and third-party benchmarks. Numbers are approximate and depend on hardware, dataset, and recall target.
| Workload | CPU (16 cores) | GPU (V100/A100) | GPU + cuVS (H100) |
|---|---|---|---|
| Brute-force search, 1M vectors, d=128 | ~2k QPS | ~50k QPS | ~80k QPS |
| IVFFlat search, 100M vectors, nprobe=32 | ~500 QPS | ~5k QPS | ~12k QPS |
| IVFPQ search, 1B vectors, nprobe=64 | ~200 QPS | ~3k QPS | ~10k QPS |
| IVFPQ index build, 100M vectors | hours | minutes | minutes (12x faster than classic GPU at 95% recall) |
| Search latency at 95% recall | 10-50 ms | 2-10 ms | 1-3 ms (8x lower than classic GPU) |
The headline numbers from the 2017 paper are still cited often: building a high-accuracy k-NN graph on 95 million images took about 35 minutes on the GPU implementation, and connecting one billion vectors took less than 12 hours on four GPUs. With cuVS on modern accelerators, similar workloads now complete roughly an order of magnitude faster.
FAISS exposes several primitives that are used inside the indexes but also work standalone. The Clustering and Kmeans classes implement a fast Lloyd's k-means algorithm that scales to tens of millions of points and runs on CPU or GPU. PCA preprocessing is exposed through PCAMatrix, which can reduce vector dimensionality before indexing to save memory and accelerate search. Random rotation matrices, scalar quantizers, and product quantizers are similarly available as standalone classes. The library includes utilities for reading and writing the canonical fvecs and bvecs formats used by many academic benchmarks, and for serializing indexes to portable binary files. For very large indexes that do not fit in RAM, FAISS supports memory-mapped I/O via the OnDiskInvertedLists class.
Inside Meta, FAISS powers similarity search for recommendation, ads ranking, content moderation, image deduplication, and trust-and-safety classification. Externally, the library is the backbone of many vector search workloads in industry and academia. Five families of applications dominate.
The first is retrieval-augmented generation, in which a large language model is augmented with documents fetched from an external corpus at inference. The retrieval step embeds the user query with a sentence transformer and uses FAISS to find the top-k nearest passages. RAG frameworks such as LangChain and LlamaIndex offer first-class FAISS integration, and FAISS is often the first vector store recommended in tutorials because it requires no separate service to deploy.
The second is semantic search, matching user queries against documents by meaning rather than lexical overlap. The combination of FAISS for the ANN step and BM25 for the lexical step is a common production pattern. The third is recommender systems, particularly the candidate-generation stage of two-tower models. The user tower produces a query embedding and the item tower produces item embeddings; at serving time the system embeds the user once per request and uses FAISS to find the top-k items closest to the user embedding. This pattern is used by streaming services, e-commerce platforms, social networks, and ad networks.
The fourth is image and video similarity, including deduplication, near-duplicate detection, and visual search, often combined with deep image embeddings from CLIP, DINOv2, or similar models. Face recognition systems also use FAISS to retrieve the closest enrolled face from a database of reference embeddings. The fifth is data mining and clustering at scale, where the k-means and PQ building blocks compress and cluster large vector datasets for downstream analysis.
FAISS is one of several open-source libraries for approximate nearest neighbor search, competing or complementing libraries from other large technology companies and independent contributors. The closest peers are ScaNN (Google), HNSWlib (Yury Malkov and collaborators), Annoy (Spotify), NMSlib, Vespa, and several full-fledged vector database systems that wrap one or more ANN algorithms with persistence, replication, and a network protocol.
| Library | Origin | Year | Primary algorithms | GPU support | Strengths | Weaknesses |
|---|---|---|---|---|---|---|
| FAISS | Meta FAIR | 2017 | IVF, PQ, OPQ, HNSW, brute force | Yes (CUDA, cuVS) | Largest algorithm coverage, fastest GPU, billion-scale | No native metadata filter, no built-in persistence/server |
| ScaNN | 2020 | Anisotropic vector quantization, partitioning | No | Very high inner-product recall, optimized for embeddings | Less mature, smaller community | |
| HNSWlib | Independent | 2017 | HNSW only | No | Simplest to use, in-memory inserts/deletes | HNSW only, high memory cost |
| Annoy | Spotify | 2015 | Random projection forest | No | Very small, easy to deploy, mmap-friendly | Static index, lower recall than HNSW |
| NMSlib | Independent | 2014 | HNSW, SW-graph, VP-tree, others | No | Wide algorithm choice, research roots | Less actively maintained |
| Vespa | Yahoo/Vespa.ai | 2017 | HNSW with full filtering | No | Distributed engine, hybrid lexical + vector | Complex to operate, JVM-based |
| Milvus | Zilliz | 2019 | FAISS, HNSW, DiskANN, others | Yes (delegated) | Distributed vector database, FAISS under the hood | Heavier than a library |
| Pinecone | Pinecone | 2021 | Proprietary | Managed | Fully managed, no operations | Closed source, vendor lock-in |
| Qdrant | Qdrant | 2021 | HNSW with payload filtering | No | Rust core, strong filtering, low p99 latency | Smaller index variety |
| Weaviate | SeMI | 2019 | HNSW | No | Built-in modules for embeddings, filtering, hybrid | Heavier than a library |
| ChromaDB | Chroma | 2022 | HNSW (via hnswlib) | No | Easy local prototyping, popular for RAG | Single-node, modest scale |
FAISS offers the broadest range of indexing algorithms and the most mature GPU implementation. HNSWlib has slightly faster CPU search than FAISS HNSW for some workloads because it is focused on a single algorithm. ScaNN holds the recall-versus-speed crown on several public benchmarks for inner-product search on text embeddings, owing to its anisotropic loss function. Annoy remains popular for its small footprint and memory-mapped index format. The full vector database systems (Milvus, Pinecone, Qdrant, Weaviate, Chroma, Vespa) provide capabilities FAISS does not, including metadata filtering, hybrid lexical-vector queries, replication, persistence, and a network API; several of them use FAISS internally, with Milvus the most prominent. The ANN-Benchmarks project tracks the recall-versus-throughput frontier across dozens of libraries; FAISS, ScaNN, HNSWlib, and a few specialized graph methods consistently sit at the Pareto frontier.
FAISS is a library, not a database, and several limitations follow from that design choice. There is no native metadata or payload attached to a vector. The library accepts a 64-bit integer ID and supports a basic ID-based filter callback (IDSelector), but anything more complex (filtering by timestamp range or tag set) requires a separate metadata table and a custom selector, incurring indirection cost. This makes FAISS less convenient than vector databases that store payloads natively and prune the search space using inverted indexes over the metadata.
FAISS has limited built-in support for incremental updates. Most index types accept additions, but deletions require either marking IDs as deleted (still consuming memory) or rebuilding. For high update churn, HNSWlib (which natively supports inserts and deletes) or vector databases with their own update model are easier to operate. Disk-based scaling is another limited area: the OnDiskInvertedLists interface lets IVF indexes keep lists on disk, but the rest of the index must still fit in RAM. For corpora significantly exceeding memory, alternatives such as Microsoft's DiskANN are designed around hybrid RAM and flash storage. Finally, FAISS lacks a built-in network server, authentication, multi-tenancy, and replication. Wrapping FAISS in a service is straightforward, but operators wanting a managed product typically choose a vector database built on top of FAISS or a similar library.
FAISS releases follow a regular cadence with several minor versions per year. Version 1.7 (from mid-2021) added improved IVF support, binary index types, and CPU performance work. Version 1.8.0 (March 2024) consolidated these improvements alongside the comprehensive paper The FAISS Library [4]. Version 1.9 (October 2024) introduced more building blocks and CPU optimizations. Version 1.10 (January 2025) is the most consequential recent release: it integrated NVIDIA's cuVS as an alternative GPU backend, enabling 8x lower search latency at 95 percent recall and 12x faster index build times relative to the classic FAISS GPU implementation [5]. As of 2026, the team continues releasing minor versions roughly twice a year, with ongoing work on additional cuVS integration, filtered search, and CPU SIMD improvements.