Nearest neighbor classification

For use with this dataset, we will introduce a new classifier: the nearest neighbor classifier. The nearest neighbor classifier is very simple. When classifying a new element, this looks at the training data. For the object that is closest to it, its nearest neighbor. Then, it returns its label as the answer. Notice that this model performs perfectly on its training data! For each point, its closest neighbor is itself, and so its label matches perfectly (unless two examples with different labels have exactly the same feature values, which will indicate that the features you are using are not very descriptive). Therefore, it is essential to test the classification using a cross-validation protocol.

The nearest neighbor method can be generalized to look not at a single neighbor, but to multiple ones, and can take a vote amongst the neighbors. This makes the method more robust than outliers or mislabeled data.

To use scikit-learn's implementation of nearest neighbor classification, we start by importing the KneighborsClassifier object from the sklearn.neighbors submodule:

from sklearn.neighbors import KNeighborsClassifier  

We can now instantiate a classifier object. In the constructor, we specify the number of neighbors to consider, as follows:

knn = KNeighborsClassifier(n_neighbors=1)  

If we do not specify the number of neighbors, it defaults to 5, which is often a good choice for classification, but we stick with 1 as it's very easy to think about (in the online repository, you can play around with parameters such as these).

We will use cross-validation (of course) to look at our data. The scikit-learn module also makes this easy:

kf = model_selection.KFold(n_splits=5, shuffle=False) 
means = [] 
for training,testing in kf.split(features): 
    # We learn a model for this fold with `fit` and then apply it to the 
    # testing data with `predict`: 
    knn.fit(features[training], target[training]) 
    prediction = knn.predict(features[testing]) 
 
    # np.mean on an array of booleans returns fraction 
    # of correct decisions for this fold: 
    curmean = np.mean(prediction == target[testing]) 
    means.append(curmean) 
print('Mean accuracy: {:.1%}'.format(np.mean(means))) 

Using five folds for cross-validation, for this dataset, with this algorithm, we obtain 83.8 percent accuracy. As we discussed in the earlier section, the cross-validation accuracy is lower than the training accuracy, but this is a more credible estimate
of the performance of the model.