P-greater-than-N scenarios

The title of this section is a bit of jargon, which you will learn now. In the 1990s, first in the biomedical domain, and then on the web, problems started to appear where P was greater than N. What this meant was that the number of features, P, was greater than the number of examples, N (these letters were the conventional statistical shorthand for these concepts).

For example, if your input is a set of written documents, a simple way to approach it is to consider each possible word in the dictionary as a feature and regress on those (we will later work on one such problem ourselves). In the English language, you have over 20,000 words (this is if you perform some stemming and only consider common words; it is more than ten times that if you skip this preprocessing step). If you only have a few hundred or a few thousand examples, you will have more features than examples.

In this case, as the number of features is greater than the number of examples, it is possible to have a perfect fit on the training data. This is a mathematical fact, which is independent of your data. You are, in effect, solving a system of linear equations with fewer equations than variables. You can find a set of regression coefficients with zero training errors (in fact, you can find more than one perfect solution).

However, and this is a major problem, zero training error does not mean that your solution will generalize well. In fact, it may generalize very poorly. In earlier examples, regularization could give you a little extra boost, but it is now absolutely required for a meaningful result.