An example based on text documents

We will now turn to an example that comes from a study performed at Carnegie Mellon University by Prof. Noah Smith's research group. The study was based on mining the so-called 10-K reports that companies file with the Securities and Exchange Commission (SEC) in the United States. This filing is mandated by law for all publicly traded companies. The goal of their study was to predict, based on this piece of public information, what the future volatility of the company's stock would be. In the training data, we are actually using historical data for which we already know the outcome.

There are 16,087 examples available. The features, which have already been preprocessed for us, correspond to different words, 150,360 in total. Thus, we have many more features than examples, almost ten times as many. In the introduction, it was stated that ordinary least regression is useful in these cases, and we will now see why by attempting to blindly apply it.

The dataset is available in SVMLight format from multiple sources, including the book's companion website. This is a format that scikit-learn can read. SVMLight is, as the name says, a support vector machine implementation, which is also available through scikit-learn; right now, we are only interested in the file format:

from sklearn.datasets import load_svmlight_file 
data,target = load_svmlight_file('E2006.train')

In the preceding code, data is a sparse matrix (that is, most of its entries are zeros and, therefore, only the nonzero entries are saved in memory), while the target is a simple one-dimensional vector. We can start by looking at some attributes of the target:

print('Min target value: {}'.format(target.min())) 
Min target value: -7.89957807347

print('Max target value: {}'.format(target.max()))
Max target value: -0.51940952694

print('Mean target value: {}'.format(target.mean()))
Mean target value: -3.51405313669

print('Std. dev. target: {}'.format(target.std()))
Std. dev. target: 0.632278353911

So, we can see that the data lies between -7.9 and -0.5. Now that we have a feel for the data, we can check what happens when we use OLS to predict. Note that we can use exactly the same classes and methods as we did earlier:

from sklearn.linear_model import LinearRegression 
lr = LinearRegression()
lr.fit(data,target)
pred = lr.predict(data)
rmse_train = np.sqrt(mean_squared_error(target, pred))
print('RMSE on training: {:.2}'.format(rmse_train))
RMSE on training: 0.0024

print('R2 on training: {:.2}'.format(r2_score(target, pred)))
R2 on training: 1.0

The root mean squared error is not exactly zero because of rounding errors, but it is very close. The coefficient of determination is 1.0. That is, the linear model is reporting a perfect prediction on its training data. This is what we expected.

However, when we use cross-validation (the code is very similar to what we used earlier in the Boston example), we get something very different: RMSE of 0.75, which corresponds to a negative coefficient of determination of -0.42! This means that if we always predict the mean value of -3.5, we do better than when using the regression model! This is the typical P-greater-than-N situation.

Training and generalization error:
When the number of features is greater than the number of examples, you always get zero training errors with OLS, except perhaps for issues due to rounding off. However, this is rarely a sign that your model will do well in terms of generalization. In fact, you may get zero training errors and have a completely useless model.

The natural solution is to use regularization to counteract the overfitting. We can try the same cross-validation with an ElasticNet learner, having set the penalty parameter to 0.1:

from sklearn.linear_model import ElasticNet 
met = ElasticNet(alpha=0.1)
met.fit(data, target)
pred = met.predict(data)

print('[EN 0.1] RMSE on training: {:.2}'.format(np.sqrt(mean_squared_error(target, pred))))
[EN 0.1] RMSE on training: 0.4

print('[EN 0.1] R2 on training: {:.2}'.format(r2_score(target, pred)))
[EN 0.1] R2 on training: 0.61

Thus, we get worse results on the training data. However, we have a better hope that these results will generalize well:

kf = Kfold(n_splits=5) 
pred = cross_val_predict(met, data, target, cv=kf)

rmse = np.sqrt(mean_squared_error(target, pred))
print('[EN 0.1] RMSE on testing (5 fold): {:.2}'.format(rmse))
[EN 0.1] RMSE on testing (5 fold): 0.4

print('[EN 0.1] R2 on testing (5 fold): {:.2}'.format(r2_score(target, pred)))
[EN 0.1] R2 on testing (5 fold): 0.61

Indeed, they do! Unlike in the case for OLS, with ElasticNet, the result of cross-validation is the same as for the training data.

There is one problem with this solution, though, which is the choice of alpha. When using the default value (1.0), the result is very different (and worse).

In this case, we cheated as the author had previously tried a few values to see which ones would give a good result. This is not effective and can lead to overestimates of confidence (we are looking at the test data to decide which parameter values to use and which we should never use). The next section explains how to do it properly and how this is supported by scikit-learn.