Looking at the decision boundaries

We will now examine the decision boundaryies. In order to plot these on paper, we will simplify and look at only two dimensions:

knn.fit(features[:, [0,2]], target)  

We will call predict on a grid of feature values (1000 by 1000 points):

y0, y1 = features[:, 2].min() * .9, features[:, 2].max() * 1.1 
x0, x1 = features[:, 0].min() * .9, features[:, 0].max() * 1.1 
X = np.linspace(x0, x1, 1000) 
Y = np.linspace(y0, y1, 1000) 
X, Y = np.meshgrid(X, Y) 
C = knn.predict(np.vstack([X.ravel(), Y.ravel()]).T).reshape(X.shape)  

Now, we plot the decision boundaries:

cmap = ListedColormap([(1., 1., 1.), (.2, .2, .2), (.6, .6, .6)]) 
 
fig,ax = plt.subplots() 
ax.scatter(features[:, 0], features[:, 2], c=target, cmap=cmap) 
for lab, ma in zip(range(3), "Do^"): 
    ax.plot(features[target == lab, 0], features[ 
             target == lab, 2], ma, c=(1., 1., 1.), ms=6) 
 
ax.set_xlim(x0, x1) 
ax.set_ylim(y0, y1) 
ax.set_xlabel(feature_names[0]) 
ax.set_ylabel(feature_names[2]) 
ax.pcolormesh(X, Y, C, cmap=cmap) 

This is what the results look like:

Canadian examples are shown as diamonds, Koma seeds as circles, and Rosa seeds as triangles. Their respective areas are shown as white, black, and grey. You might be wondering why the regions are so horizontal, almost weirdly so. The problem is that the x axis (area) ranges from 10 to 22, while the y axis (compactness) ranges from 0.75 to 1.0. This means that a small change in x is actually much larger than a small change in y. So, when we compute the distance between points, we are, for the most part, only taking the x axis into account. This is also a good example of why it is a good idea to visualize our data and look for red flags or surprises.

If you studied physics (and you remember your lessons), you might have already noticed that we have been summing up lengths, areas, and dimensionless quantities, mixing up our units (which is something you never want to do in a physical system). We need to normalize all of the features to a common scale. There are many solutions to this problem; a simple one is to normalize to z-scores. The z-score of a value is how far away from the mean it is, in units of standard deviation. It comes down to this operation:

f ' = ( f - μ)/σ

In this formula, f is the old feature value, f' is the normalized feature value, μ is the mean of the feature, and σ is the standard deviation. Both μ and σ are estimated from training data. Independent of what the original values were, after z-scoring, a value of zero corresponds to the training mean, positive values are above the mean, and negative values are below it.

The scikit-learn module makes it very easy to use this normalization as a preprocessing step. We are going to use a pipeline of transformations: the first element will do the transformation and the second element will do the classification. We start by importing both the pipeline and the feature scaling classes as follows:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler  

Now, we can combine them:

clf = KNeighborsClassifier(n_neighbors=1)
clf = Pipeline([('norm', StandardScaler()), ('knn', classifier)])  

The Pipeline constructor takes a list of pairs, (str,clf). Each pair corresponds to a step in the pipeline: the first element is a string naming the step, while the second element is the object that performs the transformation. Advanced usage of the object uses these names to refer to different steps.

After normalization, every feature is in the same units (technically, every feature is now dimensionless; it has no units) and we can more confidently mix dimensions. In fact, if we now run our nearest neighbor classifier, we obtain 86 percent accuracy, estimated with the same five-fold cross-validation code shown previously!

Look at the decision space again in two dimensions:

The boundaries are now different and you can see that both dimensions make a difference to the outcome. In the full dataset, everything is happening on a seven-dimensional space, which is very hard to visualize, but the same principle applies; while a few dimensions are dominant in the original data, after normalization, they are all given the same importance.