Cross-validation for regression

When we introduced classification, we stressed the importance of cross-validation for checking the quality of our predictions. When performing regression, this is not always done. In fact, we have discussed only the training errors in this chapter so far.

This is a mistake if you want to confidently infer the generalization ability. However, since ordinary least squares is a very simple model, this is often not a very serious mistake. In other words, the amount of overfitting is slight. We should still test this empirically, which we can easily do with scikit-learn.

We will use the Kfold class to build a five-fold cross-validation loop and test the generalization ability of linear regression:

from sklearn.model_selection import KFold, cross_val_predict 
kf = KFold(n_splits=5)
p = cross_val_predict(lr, x, y, cv=kf)
rmse_cv = np.sqrt(mean_squared_error(p, y))
print('RMSE on 5-fold CV: {:.2}'.format(rmse_cv))
RMSE on 5-fold CV: 6.1

With cross-validation, we obtain a more conservative estimate (that is, the error is larger): 6.1. As in the case of classification, the cross-validation estimate is a better estimate of how well we could generalize to predict on unseen data.

Ordinary least squares is fast at learning time and returns a simple model, which is fast at prediction time. However, we are now going to see more advanced methods and why they are sometimes preferable.