The algorithm starts with an arbitrary point which has not been visited and its neighborhood information is retrieved from the ϵ parameter.
If this point contains MinPts within ϵ neighborhood, cluster formation starts. Otherwise the point is labeled as noise. This point can be later found within the ϵ neighborhood of a different point and, thus can be made a part of the cluster. Concept of density reachable and density connected points are important here.
If a point is found to be a core point then the points within the ϵ neighborhood is also part of the cluster. So all the points found within ϵ neighborhood are added, along with their own ϵ neighborhood, if they are also core points.
The above process continues until the density-connected cluster is completely found.
The process restarts with a new point which can be a part of a new cluster or labeled as noise.
The notes are taken from the notebook.
#import package and setup the display import numpy as np import matplotlib.pyplot as plt import seaborn as sns import sklearn.datasets as data %matplotlib inline sns.set_context('poster') sns.set_style('white') sns.set_color_codes() plot_kwds = {'alpha' : 0.5, 's' : 80, 'linewidths':0} #load the data: moons, _ = data.make_moons(n_samples=50, noise=0.05) blobs, _ = data.make_blobs(n_samples=50, centers=[(-0.75,2.25), (1.0, 2.0)], cluster_std=0.25) test_data = np.vstack([moons, blobs]) plt.scatter(test_data.T[0], test_data.T[1], color='b', **plot_kwds)Time to import the hdbscan package and run the hierarchical clustering algorithm.
import hdbscan clusterer = hdbscan.HDBSCAN(min_cluster_size=5, gen_min_span_tree=True) clusterer.fit(test_data)So now that we have clustered the data -- what actually happened? We can break it out into a series of steps:
Transform the space according to the density/sparsity.
Build the minimum spanning tree of the distance weighted graph.
Construct a cluster hierarchy of connected components.
Condense the cluster hierarchy based on minimum cluster size.
Extract the stable clusters from the condensed tree.
Core distance for a point x is defined for parameter k for a point x and denote as corek(x), via kth nearest neighbor. In other words, draw the circle to cover k nearest data points, the radius of that circle is the core distance.
Mutual reachability distance: now with 2 circles for each data point, find the maximum radius to cover both circles:
clusterer.minimum_spanning_tree_.plot(edge_cmap='viridis', edge_alpha=0.6, node_size=80, edge_linewidth=2) #Any point not in a selected cluster is simply a noise point(assigned the label -1) palette = sns.color_palette() cluster_colors = [sns.desaturate(palette[col], sat) if col >= 0 else (0.5, 0.5, 0.5) for col, sat in zip(clusterer.labels_, clusterer.probabilities_)] plt.scatter(test_data.T[0], test_data.T[1], c=cluster_colors, **plot_kwds)Parameter Selection for HDBSCAN
1. Selecting min_cluster_size: the smallest size grouping that you wish to consider a cluster.
digits = datasets.load_digits() data = digits.data projection = TSNE().fit_transform(data) plt.scatter(*projection.T, **plot_kwds) #start with a min_cluster_size of 15 clusterer = hdbscan.HDBSCAN(min_cluster_size=15).fit(data) color_palette = sns.color_palette('Paired', 12) cluster_colors = [color_palette[x] if x >= 0 else (0.5, 0.5, 0.5) for x in clusterer.labels_] cluster_member_colors = [sns.desaturate(x, p) for x, p in zip(cluster_colors, clusterer.probabilities_)] plt.scatter(*projection.T, s=50, linewidth=0, c=cluster_member_colors, alpha=0.25) #Increasing the min_cluster_size to 30 #reduces the number of clusters, merging some together.2. Selecting min_samples. The larger the value of min_samples you provide, the more conservative the clustering – more points will be declared as noise, and clusters will be restricted to progressively more dense areas.
clusterer = hdbscan.HDBSCAN(min_cluster_size=60, min_samples=1).fit(data) color_palette = sns.color_palette('Paired', 12) cluster_colors = [color_palette[x] if x >= 0 else (0.5, 0.5, 0.5) for x in clusterer.labels_] cluster_member_colors = [sns.desaturate(x, p) for x, p in zip(cluster_colors, clusterer.probabilities_)] plt.scatter(*projection.T, s=50, linewidth=0, c=cluster_member_colors, alpha=0.25)
No comments:
Post a Comment