- Python Machine Learning Cookbook(Second Edition)
- Giuseppe Ciaburro Prateek Joshi
- 600字
- 2021-06-24 15:41:11
How to do it...
Let's see how to automatically estimate the number of clusters using the DBSCAN algorithm:
- The full code for this recipe is given in the estimate_clusters.py file that has already been provided to you. Now let's look at how it's built. Create a new Python file, and import the necessary packages:
from itertools import cycle import numpy as np from sklearn.cluster import DBSCAN from sklearn import metrics import matplotlib.pyplot as plt
- Load the input data from the data_perf.txt file. This is the same file that we used in the previous recipe, which will help us to compare the methods on the same dataset:
# Load data
input_file = ('data_perf.txt')
x = []
with open(input_file, 'r') as f:
for line in f.readlines():
data = [float(i) for i in line.split(',')]
x.append(data)
X = np.array(x)
- We need to find the best parameter, so let's initialize a few variables:
# Find the best epsilon eps_grid = np.linspace(0.3, 1.2, num=10) silhouette_scores = [] eps_best = eps_grid[0] silhouette_score_max = -1 model_best = None labels_best = None
- Let's sweep the parameter space:
for eps in eps_grid: # Train DBSCAN clustering model model = DBSCAN(eps=eps, min_samples=5).fit(X) # Extract labels labels = model.labels_
- For each iteration, we need to extract the performance metric:
# Extract performance metric silhouette_score = round(metrics.silhouette_score(X, labels), 4) silhouette_scores.append(silhouette_score) print("Epsilon:", eps, " --> silhouette score:", silhouette_score)
- We need to store the best score and its associated epsilon value:
if silhouette_score > silhouette_score_max: silhouette_score_max = silhouette_score eps_best = eps model_best = model labels_best = labels
- Let's now plot the bar graph, as follows:
# Plot silhouette scores vs epsilon plt.figure() plt.bar(eps_grid, silhouette_scores, width=0.05, color='k', align='center') plt.title('Silhouette score vs epsilon') # Best params print("Best epsilon =", eps_best)
- Let's store the best models and labels:
# Associated model and labels for best epsilon model = model_best labels = labels_best
- Some datapoints may remain unassigned. We need to identify them, as follows:
# Check for unassigned datapoints in the labels offset = 0 if -1 in labels: offset = 1
- Extract the number of clusters, as follows:
# Number of clusters in the data num_clusters = len(set(labels)) - offset print("Estimated number of clusters =", num_clusters)
- We need to extract all the core samples, as follows:
# Extracts the core samples from the trained model mask_core = np.zeros(labels.shape, dtype=np.bool) mask_core[model.core_sample_indices_] = True
- Let's visualize the resultant clusters. We will start by extracting the set of unique labels and specifying different markers:
# Plot resultant clusters plt.figure() labels_uniq = set(labels) markers = cycle('vo^s<>')
- Now let's iterate through the clusters and plot the datapoints using different markers:
for cur_label, marker in zip(labels_uniq, markers): # Use black dots for unassigned datapoints if cur_label == -1: marker = '.' # Create mask for the current label cur_mask = (labels == cur_label) cur_data = X[cur_mask & mask_core] plt.scatter(cur_data[:, 0], cur_data[:, 1], marker=marker, edgecolors='black', s=96, facecolors='none') cur_data = X[cur_mask & ~mask_core] plt.scatter(cur_data[:, 0], cur_data[:, 1], marker=marker, edgecolors='black', s=32) plt.title('Data separated into clusters') plt.show()
- If you run this code, you will get the following output on your Terminal:
Epsilon: 0.3 --> silhouette score: 0.1287
Epsilon: 0.39999999999999997 --> silhouette score: 0.3594
Epsilon: 0.5 --> silhouette score: 0.5134
Epsilon: 0.6 --> silhouette score: 0.6165
Epsilon: 0.7 --> silhouette score: 0.6322
Epsilon: 0.7999999999999999 --> silhouette score: 0.6366
Epsilon: 0.8999999999999999 --> silhouette score: 0.5142
Epsilon: 1.0 --> silhouette score: 0.5629
Epsilon: 1.0999999999999999 --> silhouette score: 0.5629
Epsilon: 1.2 --> silhouette score: 0.5629
Best epsilon = 0.7999999999999999
Estimated number of clusters = 5
This will produce the following bar graph:
Let's take a look at the labeled datapoints, along with unassigned datapoints marked by solid points in the following output: