TransWikia.com

A routine to choose eps and minPts for DBSCAN

Cross Validated Asked by Mehraban on December 18, 2021

DBSCAN is most cited clustering algorithm according to some literature and it can find arbitrary shape clusters based on density. It has two parameters eps (as neighborhood radius) and minPts (as minimum neighbors to consider a point as core point) which I believe it highly depends on them.

Is there any routine or commonly used method to choose these parameters?

3 Answers

Maybe a bit late, but I would like to add an answer here for future knowledge. One way to find the best $epsilon$ for DBSCAN is to compute the knn, then sort the distances and see where the "knee" is located.

Example in python, because is the language I manage.:

from sklearn.neighbors import NearestNeighbors
import plotly.express as px

neighbors = 6
# X_embedded is your data
nbrs = NearestNeighbors(n_neighbors=neighbors ).fit(X_embedded)
distances, indices = nbrs.kneighbors(X_embedded)
distance_desc = sorted(distances[:,ns-1], reverse=True)
px.line(x=list(range(1,len(distance_desc )+1)),y= distance_desc )

SIMPLE IMAGE

Then, to find the "knee", you can use another package: (pip install kneed)

from kneed import KneeLocator
kneedle = KneeLocator(range(1,len(distanceDec)+1),  #x values
                      distanceDec, # y values
                      S=1.0, #parameter suggested from paper
                      curve="convex", #parameter from figure
                      direction="decreasing") #parameter from figure

To see where the "knee" is, you can run

kneedle.plot_knee_normalized()

enter image description here

the commands kneedle.elbow or kneedle.knee returns the index of the x array, and the kneedle.knee_y returns the optimum value for $epsilon$.

Answered by Luis Felipe on December 18, 2021

minPts is selected based on the domain knowledge. If you do not have domain understanding, a rule of thumb is to derive minPts from the number of dimensions D in the data set. minPts >= D + 1. For 2D data, take minPts = 4. For larger datasets, with much noise, it suggested to go with minPts = 2 * D.

Once you have the appropriate minPts, in order to determine the optimal eps, follow these steps -

Let's say minPts = 24

  1. For every point in dataset, compute the distance of it's 24th nearest neighbor.(generally we use euclidean distance, but you can experiment with different distance metrics).
  2. Sort the distances in the increasing order.
  3. Plot the chart of distances on Y-axis v/s the index of the datapoints on X-axis.
  4. Observe the sudden increase or what we popularly call as an 'elbow' or 'knee' in the plot. Select the distance value that corresponds to the 'elbow' as optimal eps.

Answered by Pratik Nabriya on December 18, 2021

There are plenty of publications that propose methods to choose these parameters.

The most notable is OPTICS, a DBSCAN variation that does away with the epsilon parameter; it produces a hierarchical result that can roughly be seen as "running DBSCAN with every possible epsilon".

For minPts, I do suggest to not rely on an automatic method, but on your domain knowledge.

A good clustering algorithm has parameters, that allow you to customize it to your needs.

A parameter that you overlooked is the distance function. The first thing to do for DBSCAN is to find a good distance function for your application. Do not rely on Euclidean distance being the best for every application!

Answered by Has QUIT--Anony-Mousse on December 18, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP