Training and testing

If we only had some data from the future that we could use to measure our models against, then we should be able to judge our model choice only on the resulting approximation error.

Although we cannot look into the future, we can and should simulate a similar effect by holding out a part of our data. Let's remove, for instance, a certain percentage of the data and train on the remaining one. Then, we use the held-out data to calculate the error. As the model has been trained without knowing the held-out data, we should get a more realistic picture of how the model will behave in the future.

The test errors for the models trained only on the time after the inflection point now show a completely different picture:

  • d=1: 6492812.705336
  • d=2: 5008335.504620
  • d=3: 5006519.831510
  • d=10: 5440767.696731
  • d=53: 5369417.148129

Have a look at the following plot:

It seems the model with the degrees 2 and 3 has the lowest test error, which is the error that is shown when measured using data that the model did not see during training. This gives us hope that we won't get bad surprises when future data arrives. However, we are not fully done yet.

We will see in the next plot why we cannot simply pick the model with the lowest error:

The model with degree 3 does not foresee a future in which we will ever get 100,000 hits per hour. So we stick with degree 2.