书名：Python Machine Learning Cookbook（Second Edition）
作者名：Giuseppe Ciaburro Prateek Joshi
本章字数：600字
更新时间：2025-04-04 14:38:11

How to do it...

Let's see how to automatically estimate the number of clusters using the DBSCAN algorithm:

The full code for this recipe is given in the estimate_clusters.py file that has already been provided to you. Now let's look at how it's built. Create a new Python file, and import the necessary packages:

from itertools import cycle 
import numpy as np 
from sklearn.cluster import DBSCAN 
from sklearn import metrics 
import matplotlib.pyplot as plt

Load the input data from the data_perf.txt file. This is the same file that we used in the previous recipe, which will help us to compare the methods on the same dataset:

# Load data
input_file = ('data_perf.txt')

x = []
with open(input_file, 'r') as f:
    for line in f.readlines():
        data = [float(i) for i in line.split(',')]
        x.append(data)

X = np.array(x)

We need to find the best parameter, so let's initialize a few variables:

# Find the best epsilon 
eps_grid = np.linspace(0.3, 1.2, num=10) 
silhouette_scores = [] 
eps_best = eps_grid[0] 
silhouette_score_max = -1 
model_best = None 
labels_best = None

Let's sweep the parameter space:

for eps in eps_grid: 
    # Train DBSCAN clustering model 
    model = DBSCAN(eps=eps, min_samples=5).fit(X) 
 
    # Extract labels 
    labels = model.labels_

For each iteration, we need to extract the performance metric:

    # Extract performance metric  
    silhouette_score = round(metrics.silhouette_score(X, labels), 4) 
    silhouette_scores.append(silhouette_score) 
 
    print("Epsilon:", eps, " --> silhouette score:", silhouette_score)

We need to store the best score and its associated epsilon value:

    if silhouette_score > silhouette_score_max: 
        silhouette_score_max = silhouette_score 
        eps_best = eps 
        model_best = model 
        labels_best = labels

Let's now plot the bar graph, as follows:

# Plot silhouette scores vs epsilon 
plt.figure() 
plt.bar(eps_grid, silhouette_scores, width=0.05, color='k', align='center') 
plt.title('Silhouette score vs epsilon') 
 
# Best params 
print("Best epsilon =", eps_best)

Let's store the best models and labels:

# Associated model and labels for best epsilon 
model = model_best  
labels = labels_best

Some datapoints may remain unassigned. We need to identify them, as follows:

# Check for unassigned datapoints in the labels 
offset = 0 
if -1 in labels: 
    offset = 1

Extract the number of clusters, as follows:

# Number of clusters in the data  
num_clusters = len(set(labels)) - offset  
 
print("Estimated number of clusters =", num_clusters)

We need to extract all the core samples, as follows:

# Extracts the core samples from the trained model 
mask_core = np.zeros(labels.shape, dtype=np.bool) 
mask_core[model.core_sample_indices_] = True

Let's visualize the resultant clusters. We will start by extracting the set of unique labels and specifying different markers:

# Plot resultant clusters  
plt.figure() 
labels_uniq = set(labels) 
markers = cycle('vo^s<>')

Now let's iterate through the clusters and plot the datapoints using different markers:

for cur_label, marker in zip(labels_uniq, markers): 
    # Use black dots for unassigned datapoints 
    if cur_label == -1: 
        marker = '.' 
 
    # Create mask for the current label 
    cur_mask = (labels == cur_label) 
 
    cur_data = X[cur_mask & mask_core] 
    plt.scatter(cur_data[:, 0], cur_data[:, 1], marker=marker, 
             edgecolors='black', s=96, facecolors='none') 
    cur_data = X[cur_mask & ~mask_core] 
    plt.scatter(cur_data[:, 0], cur_data[:, 1], marker=marker, 
             edgecolors='black', s=32) 
plt.title('Data separated into clusters') 
plt.show()

If you run this code, you will get the following output on your Terminal:

Epsilon: 0.3 --> silhouette score: 0.1287
Epsilon: 0.39999999999999997 --> silhouette score: 0.3594
Epsilon: 0.5 --> silhouette score: 0.5134
Epsilon: 0.6 --> silhouette score: 0.6165
Epsilon: 0.7 --> silhouette score: 0.6322
Epsilon: 0.7999999999999999 --> silhouette score: 0.6366
Epsilon: 0.8999999999999999 --> silhouette score: 0.5142
Epsilon: 1.0 --> silhouette score: 0.5629
Epsilon: 1.0999999999999999 --> silhouette score: 0.5629
Epsilon: 1.2 --> silhouette score: 0.5629
Best epsilon = 0.7999999999999999
Estimated number of clusters = 5

This will produce the following bar graph:

Let's take a look at the labeled datapoints, along with unassigned datapoints marked by solid points in the following output: