Toward more complex models

Let's now fit a more complex model, a polynomial of degree 2, to see whether it better understands our data:

>>> f2p = np.polyfit(x, y, 2)
>>> print(f2p)
[ 1.05605675e-02 -5.29774287e+00 1.98466917e+03]
>>> f2 = np.poly1d(f2p)
>>> print(error(f2, x, y))
181347660.764

With plot_web_traffic(x, y, [f1, f2]) we can see how a function of degree 2 manages to model our web traffic data:

The error is 181,347,660.764, which is almost half the error of the straight-line model. This is good, but unfortunately this comes at a price: we now have a more complex function, meaning that we have one more parameter to tune inside polyfit(). The fitted polynomial is as follows:

f(x) = 0.0105605675 * x**2 - 5.29774287 * x + 1984.66917

So, if more complexity gives better results, why not increase the complexity even more? Let's try it for degrees 3, 10, and 100:

Interestingly, we do not see d = 100 for the polynomial that had been fitted with 100 degrees, but instead d = 53. This has to do with the warning we get when fitting 100 degrees:

RankWarning: Polyfit may be poorly conditioned

This means that, because of numerical errors, polyfit cannot determine a good fit with 100 degrees. Instead, it figured that 53 would be good enough.

It seems like the curves capture the fitted data better the more complex they get. The errors seem to tell the same story:

>>> print("Errors for the complete data set:")
>>> for f in [f1, f2, f3, f10, f100]:
... print("td=%i: %f" % (f.order, error(f, x, y)))
...

The errors for the complete dataset are as follows:

  • d=1: 319,531,507.008126
  • d=2: 181,347,660.764236
  • d=3: 140,576,460.879141
  • d=10: 123,426,935.754101
  • d=53: 110,768,263.808878

However, taking a closer look at the fitted curves, we start to wonder whether they also capture the true process that generated that data. Framed differently, do our models correctly represent the underlying mass behavior of customers visiting our website? Looking at the polynomials of degree 10 and 53, we see wildly oscillating behavior. It seems that the models are fitted too much to the data. So much so that the graph is now capturing not only the underlying process, but also the noise. This is called overfitting.

At this point, we have the following choices:

  • Choose one of the fitted polynomial models
  • Switch to another more complex model class
  • Think differently about the data and start again

Out of the five fitted models, the first-order model is clearly too simple, and the models of order 10 and 53 are clearly overfitting. Only the second- and third-order models seem to somehow match the data. However, if we extrapolate them at both borders, we see them going berserk.

Switching to a more complex class also doesn't seem to be the right way to go. Which arguments would back which class? At this point, we realize that we have probably not fully understood our data.