How to do it...

Let's see how to evaluate the performance of clustering algorithms:

  1. The full code for this recipe is given in the performance.py file that has already been provided to you. Now let's look at how it's built. Create a new Python file, and import the following packages:
import numpy as np 
import matplotlib.pyplot as plt 
from sklearn import metrics 
from sklearn.cluster import KMeans 
  1. Let's load the input data from the data_perf.txt file that has already been provided to you:
input_file = ('data_perf.txt')

x = []
with open(input_file, 'r') as f:
for line in f.readlines():
data = [float(i) for i in line.split(',')]
x.append(data)

data = np.array(x)
  1. In order to determine the optimal number of clusters, let's iterate through a range of values and see where it peaks:
scores = [] 
range_values = np.arange(2, 10) 
 
for i in range_values: 
    # Train the model 
    kmeans = KMeans(init='k-means++', n_clusters=i, n_init=10) 
    kmeans.fit(data) 
    score = metrics.silhouette_score(data, kmeans.labels_,  
                metric='euclidean', sample_size=len(data)) 
 
    print("Number of clusters =", i)
print("Silhouette score =", score) scores.append(score)
  1. Now let's plot the graph to see where it peaked:
# Plot scores 
plt.figure() 
plt.bar(range_values, scores, width=0.6, color='k', align='center') 
plt.title('Silhouette score vs number of clusters') 
 
# Plot data 
plt.figure() 
plt.scatter(data[:,0], data[:,1], color='k', s=30, marker='o', facecolors='none') 
x_min, x_max = min(data[:, 0]) - 1, max(data[:, 0]) + 1 
y_min, y_max = min(data[:, 1]) - 1, max(data[:, 1]) + 1 
plt.title('Input data') 
plt.xlim(x_min, x_max) 
plt.ylim(y_min, y_max) 
plt.xticks(()) 
plt.yticks(()) 
 
plt.show()
  1. If you run this code, you will get the following output on the Terminal:
Number of clusters = 2
Silhouette score = 0.5290397175472954
Number of clusters = 3
Silhouette score = 0.5572466391184153
Number of clusters = 4
Silhouette score = 0.5832757517829593
Number of clusters = 5
Silhouette score = 0.6582796909760834
Number of clusters = 6
Silhouette score = 0.5991736976396735
Number of clusters = 7
Silhouette score = 0.5194660249299737
Number of clusters = 8
Silhouette score = 0.44937089046511863
Number of clusters = 9
Silhouette score = 0.3998899991555578

The bar graph looks like the following:

As with these scores, the best configuration is five clusters. Let's see what the data actually looks like:

We can visually confirm that the data, in fact, has five clusters. We just took the example of a small dataset that contains five distinct clusters. This method becomes very useful when you are dealing with a huge dataset that contains high-dimensional data that cannot be visualized easily.