Sunday, January 3, 2021

Python Study notes: How HDBSCAN works?

HDBSCAN is a clustering algorithm. It extends DBSCAN(Density-Based Spatial Clustering of Applications with Noise)by converting it into a hierarchical clustering algorithm. The main concept of DBSCAN algorithm is to locate regions of high density that are separated from one another by regions of low density. Here are some good tutorial. the steps of DBSCAN algorithm:
The algorithm starts with an arbitrary point which has not been visited and its neighborhood information is retrieved from the ϵ parameter.

If this point contains MinPts within ϵ neighborhood, cluster formation starts. Otherwise the point is labeled as noise. This point can be later found within the ϵ neighborhood of a different point and, thus can be made a part of the cluster. Concept of density reachable and density connected points are important here.

If a point is found to be a core point then the points within the ϵ neighborhood is also part of the cluster. So all the points found within ϵ neighborhood are added, along with their own ϵ neighborhood, if they are also core points.

The above process continues until the density-connected cluster is completely found.

The process restarts with a new point which can be a part of a new cluster or labeled as noise.

The notes are taken from the notebook.
#import package and setup the display
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn.datasets as data
%matplotlib inline
sns.set_context('poster')
sns.set_style('white')
sns.set_color_codes()
plot_kwds = {'alpha' : 0.5, 's' : 80, 'linewidths':0}

#load the data:
moons, _ = data.make_moons(n_samples=50, noise=0.05)
blobs, _ = data.make_blobs(n_samples=50, centers=[(-0.75,2.25), (1.0, 2.0)], cluster_std=0.25)
test_data = np.vstack([moons, blobs])
plt.scatter(test_data.T[0], test_data.T[1], color='b', **plot_kwds)
Time to import the hdbscan package and run the hierarchical clustering algorithm.
import hdbscan
clusterer = hdbscan.HDBSCAN(min_cluster_size=5, gen_min_span_tree=True)
clusterer.fit(test_data)
So now that we have clustered the data -- what actually happened? We can break it out into a series of steps:

Transform the space according to the density/sparsity.
Build the minimum spanning tree of the distance weighted graph.
Construct a cluster hierarchy of connected components.
Condense the cluster hierarchy based on minimum cluster size.
Extract the stable clusters from the condensed tree.

Core distance for a point x is defined for parameter k for a point x and denote as corek(x), via kth nearest neighbor. In other words, draw the circle to cover k nearest data points, the radius of that circle is the core distance.

Mutual reachability distance: now with 2 circles for each data point, find the maximum radius to cover both circles:

clusterer.minimum_spanning_tree_.plot(edge_cmap='viridis', 
                                      edge_alpha=0.6, 
                                      node_size=80, 
                                      edge_linewidth=2)

#Any point not in a selected cluster is simply a noise point(assigned the label -1)
palette = sns.color_palette()
cluster_colors = [sns.desaturate(palette[col], sat) 
                  if col >= 0 else (0.5, 0.5, 0.5) for col, sat in 
                  zip(clusterer.labels_, clusterer.probabilities_)]
plt.scatter(test_data.T[0], test_data.T[1], c=cluster_colors, **plot_kwds)                                      
Parameter Selection for HDBSCAN

1. Selecting min_cluster_size: the smallest size grouping that you wish to consider a cluster.

digits = datasets.load_digits()
data = digits.data
projection = TSNE().fit_transform(data)
plt.scatter(*projection.T, **plot_kwds)

#start with a min_cluster_size of 15
clusterer = hdbscan.HDBSCAN(min_cluster_size=15).fit(data)
color_palette = sns.color_palette('Paired', 12)
cluster_colors = [color_palette[x] if x >= 0
                  else (0.5, 0.5, 0.5)
                  for x in clusterer.labels_]
cluster_member_colors = [sns.desaturate(x, p) for x, p in
                         zip(cluster_colors, clusterer.probabilities_)]
plt.scatter(*projection.T, s=50, linewidth=0, c=cluster_member_colors, alpha=0.25)

#Increasing the min_cluster_size to 30 
#reduces the number of clusters, merging some together.
2. Selecting min_samples. The larger the value of min_samples you provide, the more conservative the clustering – more points will be declared as noise, and clusters will be restricted to progressively more dense areas.
clusterer = hdbscan.HDBSCAN(min_cluster_size=60, min_samples=1).fit(data)
color_palette = sns.color_palette('Paired', 12)
cluster_colors = [color_palette[x] if x >= 0
                  else (0.5, 0.5, 0.5)
                  for x in clusterer.labels_]
cluster_member_colors = [sns.desaturate(x, p) for x, p in
                         zip(cluster_colors, clusterer.probabilities_)]
plt.scatter(*projection.T, s=50, linewidth=0, c=cluster_member_colors, alpha=0.25)

No comments:

Post a Comment

AWS Study notes: step function vs lambda function

Step functions == coordinator(project manager) lambda function == the main worker for each task(project developer). Step functions are ...