DBSCAN
Last reviewed
May 2, 2026
Sources
14 citations
Review status
Source-backed
Revision
v2 · 3,530 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 2, 2026
Sources
14 citations
Review status
Source-backed
Revision
v2 · 3,530 words
Add missing citations, update stale details, or suggest a clearer explanation.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based cluster analysis algorithm that groups together points packed closely in feature space and labels points in low-density regions as noise. It was published by Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu at the second ACM SIGKDD conference in 1996. The original paper, "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise," received the SIGKDD Test of Time Award in 2014, and DBSCAN remains one of the most cited algorithms in data mining.
Unlike partitioning methods such as k-means, DBSCAN does not require the user to specify the number of clusters beforehand. It can also recover clusters of arbitrary shape and explicitly separates outliers, which makes it useful when the data contains noise or when clusters are not roughly spherical. The trade-off is that DBSCAN's results depend on two parameters, the neighborhood radius eps (often written ε) and the minimum density MinPts, and it tends to struggle when clusters in the same dataset have very different densities.
| Attribute | Value |
|---|---|
| Algorithm class | Density-based clustering, unsupervised learning |
| Year published | 1996 (KDD-96, Portland, Oregon, August 2-4) |
| Authors | Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu |
| Affiliation | University of Munich (Ludwig-Maximilians-Universität München) |
| Original paper | "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise," KDD-96 Proceedings, pp. 226-231, AAAI Press |
| Major award | ACM SIGKDD Test of Time Award (2014) |
| Parameters | eps (ε), MinPts |
| Time complexity | $O(n \log n)$ average with spatial index in low dimensions; $O(n^2)$ worst case |
| Space complexity | $O(n)$ without precomputed distance matrix; $O(n^2)$ with one |
| Output | Flat partition with explicit noise label (commonly -1) |
| Reference implementation | sklearn.cluster.DBSCAN (scikit-learn), ELKI, R dbscan package |
By the mid-1990s, the dominant clustering algorithms had recognized weaknesses. k-means needed the user to fix the number of clusters in advance, was sensitive to initialization, and produced spherical Voronoi cells that could not capture curved or elongated structures. Hierarchical clustering handled arbitrary shapes better but had at least quadratic runtime and produced a dendrogram rather than a flat partition. CLARANS, a randomized k-medoids method by Ng and Han (1994), scaled poorly on large spatial datasets where the user did not know k and the data could contain noise.
The authors of DBSCAN came from the spatial database research group at the University of Munich. They wanted clustering routines that ran inside spatial database systems on tens of thousands of points, with cluster shapes determined by geography rather than by Gaussian assumptions. Density-based clustering can be viewed as a form of non-parametric density estimation: a cluster is a connected level set of the underlying density above some threshold, and the algorithm recovers those level sets from a finite sample.
The authors proposed a definition of clusters based on local density, designed so that the algorithm could exploit spatial indexes such as the R-tree for fast neighborhood queries. DBSCAN was presented at KDD-96 in Portland, Oregon, on August 2-4, 1996. In 2014 the SIGKDD program committee gave the paper its Test of Time Award. By that point DBSCAN had been incorporated into virtually every general-purpose machine learning library and had inspired a long line of variants.
In 2017, Erich Schubert, Sander, Ester, Kriegel, and Xu published a follow-up paper titled "DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN" in ACM Transactions on Database Systems. That paper clarified what the algorithm actually computes, corrected common implementation mistakes, and addressed a misconception about its worst-case complexity that had spread through the literature.
DBSCAN's definitions are built around the ε-neighborhood of a point. For a dataset $D$ in some metric space and a parameter $\varepsilon > 0$, the ε-neighborhood of a point $p$ is the set of all points within distance ε:
$$N_\varepsilon(p) = { q \in D : \text{dist}(p, q) \leq \varepsilon }$$
The distance function is usually Euclidean, but DBSCAN works with any metric, including Manhattan, cosine, Haversine for geographic data, or precomputed distance matrices.
From this neighborhood, DBSCAN classifies each point into one of three categories.
| Type | Definition |
|---|---|
| Core point | Has at least MinPts points (including itself) inside its ε-neighborhood. Formally, $ |
| Border point | Falls inside the ε-neighborhood of some core point but has fewer than MinPts neighbors itself. |
| Noise point | Neither a core point nor a border point. Treated as an outlier. |
Two more relations link these categories into clusters. A point $q$ is directly density-reachable from a core point $p$ if $q \in N_\varepsilon(p)$. Direct reachability is symmetric only when both points are core points, since a border point cannot reach anything. A point $q$ is density-reachable from $p$ if there is a chain of points $p_1 = p, p_2, \ldots, p_n = q$ where each $p_{i+1}$ is directly density-reachable from $p_i$; this relation is transitive but not symmetric. Two points $p$ and $q$ are density-connected if there exists a core point $o$ such that both are density-reachable from $o$. Density-connectivity is symmetric and defines DBSCAN clusters: a cluster is a maximal set of mutually density-connected points (Ester et al. 1996, Definitions 1-5). Noise is whatever is left over once the clusters have been formed.
Iterate over the dataset, and whenever an unvisited point turns out to be a core point, expand a new cluster around it by following the density-reachability relation until no more points can be added. The expansion is a breadth-first search over the graph of directly density-reachable point pairs.
DBSCAN(D, eps, MinPts):
label every point as Unvisited
C = 0
for each point P in D:
if P is Visited: continue
mark P as Visited
N = regionQuery(P, eps)
if |N| < MinPts:
label P as Noise
else:
C = C + 1
expandCluster(P, N, C, eps, MinPts)
expandCluster(P, N, C, eps, MinPts):
add P to cluster C
seeds = N \ {P}
while seeds is not empty:
Q = seeds.pop()
if Q is Unvisited:
mark Q as Visited
M = regionQuery(Q, eps)
if |M| >= MinPts:
seeds = seeds union M
if Q is not yet a member of any cluster:
add Q to cluster C
The regionQuery step returns all points within ε of the query point. This is the part that benefits from a spatial index: with a k-d tree, R-tree, or ball tree, each query takes roughly $O(\log n)$ time on average for low-dimensional data, instead of the $O(n)$ time required by a linear scan.
Border points can sit on the edge of more than one cluster. The algorithm assigns each border point to whichever cluster touches it first, so the output can depend on the order in which points are visited. Modern variants such as DBSCAN* treat border points as noise to remove this ambiguity.
DBSCAN exposes two main parameters and one implicit choice of distance metric.
| Parameter | Meaning | Common defaults |
|---|---|---|
eps (ε) | Maximum radius of a neighborhood. Two points are neighbors if their distance is at most ε. | Estimated from data. No universal default. |
MinPts | Minimum number of points in an ε-neighborhood for a point to be a core point. Includes the point itself. | 4 for 2D data; rule of thumb is MinPts = 2 × dim for higher dimensions. |
| Distance metric | Function used to measure distance between points. | Euclidean. |
Choosing eps is the harder problem. The classical heuristic from the 1996 paper is the k-distance plot: for each point, compute the distance to its k-th nearest neighbor (with k set to MinPts), sort the distances in descending order, and plot them. The plot usually has a clear elbow, and the y-value at that elbow is a reasonable choice for eps. Below the elbow, distances grow slowly because most neighborhoods are dense; above the elbow, distances grow fast because the queries hit empty space between clusters and noise.
MinPts controls how strict the density requirement is. Larger values produce sparser, more conservative clusters and label more points as noise. Sander, Ester, Kriegel, and Xu suggested MinPts = 2 × dim as a starting point in their 1998 paper on Generalized DBSCAN, with the related rule of thumb MinPts ≥ dim + 1 to avoid degeneracies. Many practitioners use MinPts = 4 for 2D spatial data, the value the original paper used. Schubert et al. (2017) recommend slightly larger values for noisy data.
Feature scaling matters. Because DBSCAN measures distances directly, features with large numerical ranges dominate the metric. Standardizing or normalizing features before clustering is usually a good idea, unless the relative scales already encode something meaningful such as geographic coordinates.
The complexity of DBSCAN has been a source of confusion since the original paper. The 1996 paper claimed an average runtime of $O(n \log n)$ when neighborhood queries are answered by an R-tree, but this holds only in low dimensions on well-distributed data. The worst case is $O(n^2)$: without a spatial index, every region query touches the whole dataset. The 2017 "DBSCAN Revisited, Revisited" paper showed that even with an index, no exact algorithm can do strictly better than $O(n^2)$ in the general worst case, citing a 2015 hardness result by Gan and Tao for Euclidean DBSCAN in dimensions ≥ 3.
| Setting | Runtime | Memory |
|---|---|---|
| Naive, no index | $O(n^2)$ | $O(n)$ |
| Index, low dimensions, well-distributed data | $O(n \log n)$ average | $O(n)$ |
| Worst case, any exact algorithm | $O(n^2)$ | $O(n)$ |
| With precomputed distance matrix | $O(n^2)$ time and memory | $O(n^2)$ |
High dimensionality is a separate issue. As dimension grows, all pairs of points start to have similar distances because of the curse of dimensionality, and spatial indexes lose their advantage over linear scans. In high-dimensional spaces, DBSCAN often runs at quadratic cost in practice and the very notion of a meaningful ε breaks down.
DBSCAN has lasted for almost three decades because it does several things older clustering algorithms did poorly. It does not require the number of clusters as input; the count emerges from the data and from the choice of ε and MinPts. It finds clusters of arbitrary shape, since density-connectivity can wind through curved or non-convex regions, while k-means and Gaussian mixture models assume blob-shaped or ellipsoidal clusters. It has a built-in notion of noise, so points that do not belong to any dense region are labeled as outliers rather than forced into a cluster. It is mostly insensitive to the order of input points, except for the small ambiguity at cluster borders.
The weaknesses are equally well-known. DBSCAN struggles when the data has clusters of very different densities, because a single global ε cannot fit both a tight cluster and a sparse one. This was the main motivation for OPTICS and HDBSCAN. Results are sensitive to ε: small changes can collapse two clusters into one or split one into many. The k-distance plot helps but is not always conclusive. High-dimensional data is hard because distances become uninformative and the spatial-index speedup disappears, so many practitioners reduce dimensionality with PCA, UMAP, or autoencoders before running DBSCAN. Very large datasets are tricky unless a good index is available. Border-point ambiguity also means that two adjacent clusters can both legitimately claim the same point, and the choice between them is essentially arbitrary.
DBSCAN spawned a family of related algorithms.
OPTICS (Ordering Points To Identify the Clustering Structure) was introduced by Mihael Ankerst, Markus Breunig, Kriegel, and Sander in 1999. Instead of producing a flat partition, OPTICS produces an ordering of the dataset and a reachability plot, from which clusters at any density level can be extracted. It replaces the single ε with an upper bound and lets the user pick density thresholds afterward. Its complexity matches DBSCAN's.
Generalized DBSCAN (Sander, Ester, Kriegel, Xu, 1998) abstracts the algorithm so that the neighborhood definition and the density predicate can be replaced with user-defined functions. This makes the same code apply to non-spatial data such as polygon datasets where neighborhoods are defined by adjacency.
HDBSCAN (Hierarchical DBSCAN) was introduced by Ricardo Campello, Davoud Moulavi, and Sander in 2013. HDBSCAN builds a hierarchy of DBSCAN-like clusterings across all values of ε at once, then extracts a flat clustering by selecting the most stable clusters from that hierarchy. The user sets only min_cluster_size, which is much less brittle than ε. HDBSCAN handles varying-density clusters that DBSCAN cannot, at the cost of higher constant factors. A port is bundled into scikit-learn as sklearn.cluster.HDBSCAN since version 1.3.
DBSCAN* is a small modification proposed in the same 2013 paper. It treats border points as noise, making the output deterministic and aligning DBSCAN more cleanly with the hierarchical view used by HDBSCAN.
DBSCAN++ was proposed by Jennifer Jang and Heinrich Jiang in 2019 (ICML). It computes core distances only on a sub-sample and propagates labels to the rest, giving sublinear sample complexity while preserving DBSCAN's output asymptotically.
Parallel and distributed variants include PDBSCAN (Xu, Jäger, Kriegel, 1999), MR-DBSCAN by He, Tan, Luo, Feng, and Fan (2014) for MapReduce, plus various Spark and GPU implementations. The hard part of parallelizing DBSCAN is merging clusters that span partition boundaries.
Approximate DBSCAN algorithms work around the $O(n^2)$ worst case, including the $\rho$-approximate grid scheme of Gan and Tao (2015) and locality-sensitive hashing for high-dimensional data.
DBSCAN ships in essentially every machine learning library that does clustering.
| Implementation | Language | Notes |
|---|---|---|
sklearn.cluster.DBSCAN | Python | Reference implementation in scikit-learn. Supports k-d tree, ball tree, and brute-force neighbor search; works with any precomputed distance matrix. |
sklearn.cluster.OPTICS and sklearn.cluster.HDBSCAN | Python | Bundled successors. HDBSCAN was added in scikit-learn 1.3; the standalone hdbscan package by Leland McInnes has more features. |
| ELKI | Java | The academic home for DBSCAN, OPTICS, and many variants, maintained by the Heidelberg group around Erich Schubert. Often the fastest exact implementation. |
dbscan (R) | R | The R package by Michael Hahsler also includes OPTICS and HDBSCAN. |
| SPMF | Java | Open-source data mining library by Philippe Fournier-Viger that bundles DBSCAN with many other algorithms for benchmarking. |
| RAPIDS cuML | Python (GPU) | GPU-accelerated DBSCAN on CUDA. |
| Apache Spark MLlib, Mahout, Weka | Scala/Java | Community implementations for distributed and Java environments. |
In scikit-learn, a typical call:
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
X = StandardScaler().fit_transform(raw_data)
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
labels = db.labels_ # noise points get label -1
The min_samples parameter is scikit-learn's name for MinPts. Points labeled -1 are noise.
| Property | DBSCAN | k-means | OPTICS | HDBSCAN | Hierarchical |
|---|---|---|---|---|---|
| Number of clusters | Automatic | Set in advance | Automatic (reachability plot) | Automatic | Cut from dendrogram |
| Cluster shape | Arbitrary | Convex, spherical | Arbitrary | Arbitrary | Arbitrary |
| Handles noise | Yes, explicit label | No | Yes | Yes | Only via post-processing |
| Varying densities | Poorly | Sensitive to scale | Yes | Yes (main advantage) | Depends on linkage |
| Main parameters | eps, MinPts | k, initialization | min_samples, optional max_eps | min_cluster_size | Linkage, cut-off |
| Scales to millions | With index, low dim | Yes | Same as DBSCAN | Heavier than DBSCAN | Quadratic memory |
| Deterministic | Up to border-point ties | Depends on init | Yes | Yes | Yes |
| Typical complexity | $O(n \log n)$ to $O(n^2)$ | $O(n k i)$ | $O(n \log n)$ to $O(n^2)$ | $O(n \log n)$ to $O(n^2)$ | $O(n^2 \log n)$ to $O(n^3)$ |
The practical decision rule most people use: start with k-means if you know roughly how many clusters you expect and the shapes look spherical. Switch to DBSCAN when you do not know k, expect non-convex shapes, or noise is a real concern. Move to HDBSCAN when results depend too sensitively on eps or when clusters have visibly different densities. OPTICS sits between the two, with the same complexity profile but a reachability plot rather than a flat partition.
DBSCAN has been applied across many domains where dense regions are meaningful and outliers need to be flagged.
| Domain | Use |
|---|---|
| Geographic information systems | Hotspot identification from GPS traces, taxi pickup locations, or earthquake epicenters. The original 1996 paper used a spatial database of geographic features. |
| Anomaly detection | Treating noise points as candidates for anomaly detection in fraud, intrusion detection, and quality control. |
| Image segmentation | Clustering pixels in color and position space to find homogeneous regions, especially in noisy images where k-means oversegments. |
| Astronomy | Finding galaxy clusters, stellar streams, and other structures in sky surveys. |
| Bioinformatics | Clustering gene expression data, single-cell RNA sequencing readouts, and protein structures. |
| Document clustering | Grouping documents in semantic embedding space, with the noise label catching off-topic outliers. |
| Trajectory analysis | Finding common paths in vehicle, pedestrian, or animal movement data. |
The SIGKDD Test of Time Award is given each year to a paper from at least ten years earlier that has had lasting impact. DBSCAN won the award in 2014. As of early 2025, Google Scholar lists more than 35,000 citations to the original paper, putting DBSCAN among the most cited papers in computer science and one of the most cited in the history of the KDD conference. Semantic Scholar tracks comparable numbers and lists thousands of citing papers per year well into the 2020s.
The 2017 "Revisited, Revisited" paper from Schubert and colleagues is unusually self-critical for a follow-up to a celebrated result. Three points from that paper are worth highlighting:
For low- to moderate-dimensional data where the cluster count is unknown and noise is part of the picture, DBSCAN is still often the first thing to try.