Summary

Classification means generalizing from examples to build a model that assigns objects to a predefined class (that is, a rule that can automatically be applied to new, unclassified objects). It is one of the fundamental tools in machine learning and we will look at many more examples of this in the forthcoming chapters.

In a way, this was a very abstract and theoretical chapter, as we introduced generic concepts with simple examples. We went over a few operations with the Iris dataset. This is a small dataset. However, it has the advantage that we were able to plot all the data and see what we were doing in detail. This is something that will be lost when we move on to problems with many dimensions and many thousands of examples. The insight we gained here will still be valid.

You also learned that the training error is a misleading, over-optimistic estimate of how well the model does. We must instead evaluate it on testing data that has not been used for training. In order to not waste too many examples in testing, a cross-validation schedule can get us the best of both worlds (at the cost of more computation).

Finally, we discussed what is often the best off-the-shelf classifier, random forests. It's simple to use a classification system that is very flexible (requiring little preprocessing of the data) and achieves very high performance in a wide range of problems.

Chapter 3, Regression, in which we will dive deep into scikit-learn—the marvelous machine learning toolkit—give an overview of different types of learning, and show you the beauty of feature engineering.