Regression and Classification Problems

We see classification and regression problems all around us in our daily life. The chances of rain from https://weather.com, our emails getting filtered into the spam mailbox and inbox, our personal and home loans getting accepted or rejected, deciding to pick our next holiday destination, exploring the options for buying a new house, investment decisions to gain short- and long-term benefits, purchasing the next book from Amazon; the list goes on and on. The world around us today is increasingly being run by algorithms that help us with our choices (which is not always a good thing).

As discussed in Chapter 2, Exploratory Analysis of Data, we will use the Minto Pyramid principle called Situation–Complication–Question (SCQ) to define our problem statement. The following table shows the SCQ approach for Beijing's PM2.5 problem:

Figure 3.3: Applying SCQ on Beijing's PM2.5 problem.

Now, in the SCQ construct described in the previous table, we can do a simple correlation analysis to establish the factors affecting the PM2.5 levels or create a predictive problem (prediction means finding an approximate function that maps from input variables to an output) that estimates the PM2.5 levels using all the factors. For the clarity of terminology, we will refer to factors as input variables. Then, PM2.5 becomes the dependent variable (often referred to as output variable). The dependent variable could be either categorical or continuous.

For example, in the email classification into SPAM/NOT SPAM problem, the dependent variable is categorical. The following table highlights some critical differences between regression and classification problems:

Figure 3.4: Difference between regression and classification problems.