Problem Understanding and Data Preparation

In the last chapter, we learned about the predictive analytics process; we also learned about some of the fundamental definitions and the main libraries in the Python data ecosystem. In this chapter, we will start getting our hands on a couple of datasets and delve deeper into the first and second phases of the predictive analytics process: Problem understanding and definition and Data collection and preparation.

In the first part of this chapter, we talk about some of the most important considerations when defining and understanding the problem: having enough context and domain knowledge about the problem, and defining what is being predicted and the data that we have to work with. This phase also includes proposing a solution; we talk about some of the main topics to consider.

We put this idea into practice in the second part of the chapter where we introduce two datasets (which will be used in the rest of the book), along with some hypothetical business problems. Using these datasets, we not only talk about Problem understanding and definition, but we also talk about Data collection and preparation and we introduce some practical topics concerning these stages, such as dealing with missing values, encoding categorical features, the problem of collinearity, low variance features, and finally we give a brief introduction to feature engineering.

These are the main points of this chapter:

Understanding the business problem and proposing a solution
Introducing the diamond prices dataset and the practical project associated with it
Introducing the credit card default dataset and the practical project associated with it