DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based cluster analysis algorithm that groups together points packed closely in feature space and labels points in low-density regions as noise. It was published by Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu at the second ACM SIGKDD conference in 1996. The original paper, "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise," received the SIGKDD Test of Time Award in 2014, and DBSCAN remains one of the most cited algorithms in data mining.

Unlike partitioning methods such as k-means, DBSCAN does not require the user to specify the number of clusters beforehand. It can also recover clusters of arbitrary shape and explicitly separates outliers, which makes it useful when the data contains noise or when clusters are not roughly spherical. The trade-off is that DBSCAN's results depend on two parameters, the neighborhood radius eps (often written ε) and the minimum density MinPts, and it tends to struggle when clusters in the same dataset have very different densities.

Facts

Attribute	Value
Algorithm class	Density-based clustering, unsupervised learning
Year published	1996 (KDD-96, Portland, Oregon, August 2-4)
Authors	Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu
Affiliation	University of Munich (Ludwig-Maximilians-Universität München)
Original paper	"A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise," KDD-96 Proceedings, pp. 226-231, AAAI Press
Major award	ACM SIGKDD Test of Time Award (2014)
Parameters	`eps` (ε), `MinPts`
Time complexity	$O(n \log n)$ average with spatial index in low dimensions; $O(n^2)$ worst case
Space complexity	$O(n)$ without precomputed distance matrix; $O(n^2)$ with one
Output	Flat partition with explicit noise label (commonly `-1`)
Reference implementation	`sklearn.cluster.DBSCAN` (scikit-learn), ELKI, R `dbscan` package

Background

By the mid-1990s, the dominant clustering algorithms had recognized weaknesses. k-means needed the user to fix the number of clusters in advance, was sensitive to initialization, and produced spherical Voronoi cells that could not capture curved or elongated structures. Hierarchical clustering handled arbitrary shapes better but had at least quadratic runtime and produced a dendrogram rather than a flat partition. CLARANS, a randomized k-medoids method by Ng and Han (1994), scaled poorly on large spatial datasets where the user did not know k and the data could contain noise.

The authors of DBSCAN came from the spatial database research group at the University of Munich. They wanted clustering routines that ran inside spatial database systems on tens of thousands of points, with cluster shapes determined by geography rather than by Gaussian assumptions. Density-based clustering can be viewed as a form of non-parametric density estimation: a cluster is a connected level set of the underlying density above some threshold, and the algorithm recovers those level sets from a finite sample.

History and the 1996 paper

The authors proposed a definition of clusters based on local density, designed so that the algorithm could exploit spatial indexes such as the R-tree for fast neighborhood queries. DBSCAN was presented at KDD-96 in Portland, Oregon, on August 2-4, 1996. In 2014 the SIGKDD program committee gave the paper its Test of Time Award. By that point DBSCAN had been incorporated into virtually every general-purpose machine learning library and had inspired a long line of variants.

In 2017, Erich Schubert, Sander, Ester, Kriegel, and Xu published a follow-up paper titled "DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN" in ACM Transactions on Database Systems. That paper clarified what the algorithm actually computes, corrected common implementation mistakes, and addressed a misconception about its worst-case complexity that had spread through the literature.

Core concepts

DBSCAN's definitions are built around the ε-neighborhood of a point. For a dataset $D$ in some metric space and a parameter $\varepsilon > 0$, the ε-neighborhood of a point $p$ is the set of all points within distance ε:

$$N_\varepsilon(p) = { q \in D : \text{dist}(p, q) \leq \varepsilon }$$

The distance function is usually Euclidean, but DBSCAN works with any metric, including Manhattan, cosine, Haversine for geographic data, or precomputed distance matrices.

From this neighborhood, DBSCAN classifies each point into one of three categories.

Type	Definition
Core point	Has at least `MinPts` points (including itself) inside its ε-neighborhood. Formally, $
Border point	Falls inside the ε-neighborhood of some core point but has fewer than `MinPts` neighbors itself.
Noise point	Neither a core point nor a border point. Treated as an outlier.

Two more relations link these categories into clusters. A point $q$ is directly density-reachable from a core point $p$ if $q \in N_\varepsilon(p)$. Direct reachability is symmetric only when both points are core points, since a border point cannot reach anything. A point $q$ is density-reachable from $p$ if there is a chain of points $p_1 = p, p_2, \ldots, p_n = q$ where each $p_{i+1}$ is directly density-reachable from $p_i$; this relation is transitive but not symmetric. Two points $p$ and $q$ are density-connected if there exists a core point $o$ such that both are density-reachable from $o$. Density-connectivity is symmetric and defines DBSCAN clusters: a cluster is a maximal set of mutually density-connected points (Ester et al. 1996, Definitions 1-5). Noise is whatever is left over once the clusters have been formed.

Algorithm

Iterate over the dataset, and whenever an unvisited point turns out to be a core point, expand a new cluster around it by following the density-reachability relation until no more points can be added. The expansion is a breadth-first search over the graph of directly density-reachable point pairs.

DBSCAN(D, eps, MinPts):
    label every point as Unvisited
    C = 0
    for each point P in D:
        if P is Visited: continue
        mark P as Visited
        N = regionQuery(P, eps)
        if |N| < MinPts:
            label P as Noise
        else:
            C = C + 1
            expandCluster(P, N, C, eps, MinPts)

expandCluster(P, N, C, eps, MinPts):
    add P to cluster C
    seeds = N \ {P}
    while seeds is not empty:
        Q = seeds.pop()
        if Q is Unvisited:
            mark Q as Visited
            M = regionQuery(Q, eps)
            if |M| >= MinPts:
                seeds = seeds union M
        if Q is not yet a member of any cluster:
            add Q to cluster C

The regionQuery step returns all points within ε of the query point. This is the part that benefits from a spatial index: with a k-d tree, R-tree, or ball tree, each query takes roughly $O(\log n)$ time on average for low-dimensional data, instead of the $O(n)$ time required by a linear scan.

Border points can sit on the edge of more than one cluster. The algorithm assigns each border point to whichever cluster touches it first, so the output can depend on the order in which points are visited. Modern variants such as DBSCAN* treat border points as noise to remove this ambiguity.

Parameters

DBSCAN exposes two main parameters and one implicit choice of distance metric.

Parameter	Meaning	Common defaults
`eps` (ε)	Maximum radius of a neighborhood. Two points are neighbors if their distance is at most ε.	Estimated from data. No universal default.
`MinPts`	Minimum number of points in an ε-neighborhood for a point to be a core point. Includes the point itself.	4 for 2D data; rule of thumb is `MinPts = 2 × dim` for higher dimensions.
Distance metric	Function used to measure distance between points.	Euclidean.

Choosing eps is the harder problem. The classical heuristic from the 1996 paper is the k-distance plot: for each point, compute the distance to its k-th nearest neighbor (with k set to MinPts), sort the distances in descending order, and plot them. The plot usually has a clear elbow, and the y-value at that elbow is a reasonable choice for eps. Below the elbow, distances grow slowly because most neighborhoods are dense; above the elbow, distances grow fast because the queries hit empty space between clusters and noise.

MinPts controls how strict the density requirement is. Larger values produce sparser, more conservative clusters and label more points as noise. Sander, Ester, Kriegel, and Xu suggested MinPts = 2 × dim as a starting point in their 1998 paper on Generalized DBSCAN, with the related rule of thumb MinPts ≥ dim + 1 to avoid degeneracies. Many practitioners use MinPts = 4 for 2D spatial data, the value the original paper used. Schubert et al. (2017) recommend slightly larger values for noisy data.

Feature scaling matters. Because DBSCAN measures distances directly, features with large numerical ranges dominate the metric. Standardizing or normalizing features before clustering is usually a good idea, unless the relative scales already encode something meaningful such as geographic coordinates.

Computational complexity

The complexity of DBSCAN has been a source of confusion since the original paper. The 1996 paper claimed an average runtime of $O(n \log n)$ when neighborhood queries are answered by an R-tree, but this holds only in low dimensions on well-distributed data. The worst case is $O(n^2)$: without a spatial index, every region query touches the whole dataset. The 2017 "DBSCAN Revisited, Revisited" paper showed that even with an index, no exact algorithm can do strictly better than $O(n^2)$ in the general worst case, citing a 2015 hardness result by Gan and Tao for Euclidean DBSCAN in dimensions ≥ 3.

Setting	Runtime	Memory
Naive, no index	$O(n^2)$	$O(n)$
Index, low dimensions, well-distributed data	$O(n \log n)$ average	$O(n)$
Worst case, any exact algorithm	$O(n^2)$	$O(n)$
With precomputed distance matrix	$O(n^2)$ time and memory	$O(n^2)$

High dimensionality is a separate issue. As dimension grows, all pairs of points start to have similar distances because of the curse of dimensionality, and spatial indexes lose their advantage over linear scans. In high-dimensional spaces, DBSCAN often runs at quadratic cost in practice and the very notion of a meaningful ε breaks down.

Strengths and weaknesses

DBSCAN has lasted for almost three decades because it does several things older clustering algorithms did poorly. It does not require the number of clusters as input; the count emerges from the data and from the choice of ε and MinPts. It finds clusters of arbitrary shape, since density-connectivity can wind through curved or non-convex regions, while k-means and Gaussian mixture models assume blob-shaped or ellipsoidal clusters. It has a built-in notion of noise, so points that do not belong to any dense region are labeled as outliers rather than forced into a cluster. It is mostly insensitive to the order of input points, except for the small ambiguity at cluster borders.

The weaknesses are equally well-known. DBSCAN struggles when the data has clusters of very different densities, because a single global ε cannot fit both a tight cluster and a sparse one. This was the main motivation for OPTICS and HDBSCAN. Results are sensitive to ε: small changes can collapse two clusters into one or split one into many. The k-distance plot helps but is not always conclusive. High-dimensional data is hard because distances become uninformative and the spatial-index speedup disappears, so many practitioners reduce dimensionality with PCA, UMAP, or autoencoders before running DBSCAN. Very large datasets are tricky unless a good index is available. Border-point ambiguity also means that two adjacent clusters can both legitimately claim the same point, and the choice between them is essentially arbitrary.

Variants and successors

DBSCAN spawned a family of related algorithms.

OPTICS (Ordering Points To Identify the Clustering Structure) was introduced by Mihael Ankerst, Markus Breunig, Kriegel, and Sander in 1999. Instead of producing a flat partition, OPTICS produces an ordering of the dataset and a reachability plot, from which clusters at any density level can be extracted. It replaces the single ε with an upper bound and lets the user pick density thresholds afterward. Its complexity matches DBSCAN's.

Generalized DBSCAN (Sander, Ester, Kriegel, Xu, 1998) abstracts the algorithm so that the neighborhood definition and the density predicate can be replaced with user-defined functions. This makes the same code apply to non-spatial data such as polygon datasets where neighborhoods are defined by adjacency.

HDBSCAN (Hierarchical DBSCAN) was introduced by Ricardo Campello, Davoud Moulavi, and Sander in 2013. HDBSCAN builds a hierarchy of DBSCAN-like clusterings across all values of ε at once, then extracts a flat clustering by selecting the most stable clusters from that hierarchy. The user sets only min_cluster_size, which is much less brittle than ε. HDBSCAN handles varying-density clusters that DBSCAN cannot, at the cost of higher constant factors. A port is bundled into scikit-learn as sklearn.cluster.HDBSCAN since version 1.3.

DBSCAN* is a small modification proposed in the same 2013 paper. It treats border points as noise, making the output deterministic and aligning DBSCAN more cleanly with the hierarchical view used by HDBSCAN.

DBSCAN++ was proposed by Jennifer Jang and Heinrich Jiang in 2019 (ICML). It computes core distances only on a sub-sample and propagates labels to the rest, giving sublinear sample complexity while preserving DBSCAN's output asymptotically.

Parallel and distributed variants include PDBSCAN (Xu, Jäger, Kriegel, 1999), MR-DBSCAN by He, Tan, Luo, Feng, and Fan (2014) for MapReduce, plus various Spark and GPU implementations. The hard part of parallelizing DBSCAN is merging clusters that span partition boundaries.

Approximate DBSCAN algorithms work around the $O(n^2)$ worst case, including the $\rho$-approximate grid scheme of Gan and Tao (2015) and locality-sensitive hashing for high-dimensional data.

Implementations

DBSCAN ships in essentially every machine learning library that does clustering.

Implementation	Language	Notes
`sklearn.cluster.DBSCAN`	Python	Reference implementation in scikit-learn. Supports k-d tree, ball tree, and brute-force neighbor search; works with any precomputed distance matrix.
`sklearn.cluster.OPTICS` and `sklearn.cluster.HDBSCAN`	Python	Bundled successors. HDBSCAN was added in scikit-learn 1.3; the standalone `hdbscan` package by Leland McInnes has more features.
ELKI	Java	The academic home for DBSCAN, OPTICS, and many variants, maintained by the Heidelberg group around Erich Schubert. Often the fastest exact implementation.
`dbscan` (R)	R	The R package by Michael Hahsler also includes OPTICS and HDBSCAN.
SPMF	Java	Open-source data mining library by Philippe Fournier-Viger that bundles DBSCAN with many other algorithms for benchmarking.
RAPIDS cuML	Python (GPU)	GPU-accelerated DBSCAN on CUDA.
Apache Spark MLlib, Mahout, Weka	Scala/Java	Community implementations for distributed and Java environments.

In scikit-learn, a typical call:

from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler

X = StandardScaler().fit_transform(raw_data)
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
labels = db.labels_  # noise points get label -1

The min_samples parameter is scikit-learn's name for MinPts. Points labeled -1 are noise.

Comparison with other clustering algorithms

Property	DBSCAN	k-means	OPTICS	HDBSCAN	Hierarchical
Number of clusters	Automatic	Set in advance	Automatic (reachability plot)	Automatic	Cut from dendrogram
Cluster shape	Arbitrary	Convex, spherical	Arbitrary	Arbitrary	Arbitrary
Handles noise	Yes, explicit label	No	Yes	Yes	Only via post-processing
Varying densities	Poorly	Sensitive to scale	Yes	Yes (main advantage)	Depends on linkage
Main parameters	`eps`, `MinPts`	`k`, initialization	`min_samples`, optional `max_eps`	`min_cluster_size`	Linkage, cut-off
Scales to millions	With index, low dim	Yes	Same as DBSCAN	Heavier than DBSCAN	Quadratic memory
Deterministic	Up to border-point ties	Depends on init	Yes	Yes	Yes
Typical complexity	$O(n \log n)$ to $O(n^2)$	$O(n k i)$	$O(n \log n)$ to $O(n^2)$	$O(n \log n)$ to $O(n^2)$	$O(n^2 \log n)$ to $O(n^3)$

The practical decision rule most people use: start with k-means if you know roughly how many clusters you expect and the shapes look spherical. Switch to DBSCAN when you do not know k, expect non-convex shapes, or noise is a real concern. Move to HDBSCAN when results depend too sensitively on eps or when clusters have visibly different densities. OPTICS sits between the two, with the same complexity profile but a reachability plot rather than a flat partition.

Applications

DBSCAN has been applied across many domains where dense regions are meaningful and outliers need to be flagged.

Domain	Use
Geographic information systems	Hotspot identification from GPS traces, taxi pickup locations, or earthquake epicenters. The original 1996 paper used a spatial database of geographic features.
Anomaly detection	Treating noise points as candidates for anomaly detection in fraud, intrusion detection, and quality control.
Image segmentation	Clustering pixels in color and position space to find homogeneous regions, especially in noisy images where k-means oversegments.
Astronomy	Finding galaxy clusters, stellar streams, and other structures in sky surveys.
Bioinformatics	Clustering gene expression data, single-cell RNA sequencing readouts, and protein structures.
Document clustering	Grouping documents in semantic embedding space, with the noise label catching off-topic outliers.
Trajectory analysis	Finding common paths in vehicle, pedestrian, or animal movement data.

Test of Time Award and impact

The SIGKDD Test of Time Award is given each year to a paper from at least ten years earlier that has had lasting impact. DBSCAN won the award in 2014. As of early 2025, Google Scholar lists more than 35,000 citations to the original paper, putting DBSCAN among the most cited papers in computer science and one of the most cited in the history of the KDD conference. Semantic Scholar tracks comparable numbers and lists thousands of citing papers per year well into the 2020s.

The 2017 "Revisited, Revisited" paper from Schubert and colleagues is unusually self-critical for a follow-up to a celebrated result. Three points from that paper are worth highlighting:

The $O(n \log n)$ claim was conditional. It applies in the average case in low dimensions with a well-functioning spatial index, not as a worst-case bound. Restating $O(n \log n)$ as the general complexity, as many textbooks and libraries did, is incorrect.
Border-point handling differs across implementations. Some libraries assign border points to the first cluster that reaches them (the original DBSCAN), while others use DBSCAN* semantics and label them as noise.
The cluster definition is set-based, not partition-based. Border points can legitimately belong to multiple clusters, and the algorithm's choice between them is implementation-dependent.

For low- to moderate-dimensional data where the cluster count is unknown and noise is part of the picture, DBSCAN is still often the first thing to try.

References

Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996). "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise." In KDD-96 Proceedings, pp. 226-231. AAAI Press.
Sander, J., Ester, M., Kriegel, H.-P., and Xu, X. (1998). "Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications." Data Mining and Knowledge Discovery, 2(2), 169-194.
Ankerst, M., Breunig, M. M., Kriegel, H.-P., and Sander, J. (1999). "OPTICS: Ordering Points To Identify the Clustering Structure." ACM SIGMOD, pp. 49-60.
Xu, X., Jäger, J., and Kriegel, H.-P. (1999). "A Fast Parallel Clustering Algorithm for Large Spatial Databases." Data Mining and Knowledge Discovery, 3(3), 263-290.
Campello, R. J. G. B., Moulavi, D., and Sander, J. (2013). "Density-Based Clustering Based on Hierarchical Density Estimates." PAKDD, LNCS 7819, pp. 160-172.
He, Y., Tan, H., Luo, W., Feng, S., and Fan, J. (2014). "MR-DBSCAN: A Scalable MapReduce-Based DBSCAN Algorithm for Heavily Skewed Data." Frontiers of Computer Science, 8(1), 83-99.
Gan, J., and Tao, Y. (2015). "DBSCAN Revisited: Mis-Claim, Un-Fixability, and Approximation." ACM SIGMOD, pp. 519-530.
Schubert, E., Sander, J., Ester, M., Kriegel, H.-P., and Xu, X. (2017). "DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN." ACM Transactions on Database Systems, 42(3), Article 19.
Jang, J., and Jiang, H. (2019). "DBSCAN++: Towards Fast and Scalable Density Clustering." ICML, PMLR 97, pp. 3019-3029.
ACM SIGKDD. "SIGKDD 2014 Test of Time Award." SIGKDD awards listing.
Pedregosa, F., et al. (2011). "Scikit-learn: Machine Learning in Python." Journal of Machine Learning Research, 12, 2825-2830. See also the `sklearn.cluster.DBSCAN`, `OPTICS`, and `HDBSCAN` documentation.
McInnes, L., Healy, J., and Astels, S. (2017). "hdbscan: Hierarchical density based clustering." Journal of Open Source Software, 2(11), 205.
Schubert, E., and Zimek, A. ELKI: Environment for Developing KDD-Applications Supported by Index-Structures. Heidelberg University.
Hahsler, M., Piekenbrock, M., and Doran, D. (2019). "dbscan: Fast Density-Based Clustering with R." Journal of Statistical Software, 91(1), 1-30.

DBSCAN

Facts

Background

History and the 1996 paper

Core concepts

Algorithm

Parameters

Computational complexity

Strengths and weaknesses

Variants and successors

Implementations

Comparison with other clustering algorithms

Applications

Test of Time Award and impact

See also

References

Improve this article

Related Articles

Latent Dirichlet allocation

Dimension Reduction

Anomaly Detection

Clustering

Divisive Clustering

Hierarchical Clustering

DBSCAN

Facts

Background

History and the 1996 paper

Core concepts

Algorithm

Parameters

Computational complexity

Strengths and weaknesses

Variants and successors

Implementations

Comparison with other clustering algorithms

Applications

Test of Time Award and impact

See also

References

Related Articles

Latent Dirichlet allocation

Dimension Reduction

Anomaly Detection

Clustering

Divisive Clustering

Hierarchical Clustering