- Python:Advanced Predictive Analytics
- Ashish Kumar Joseph Babcock
- 239字
- 2021-07-02 20:09:23
Summary
The main learning outcomes of this chapter are summarized as follows:
- Various methods and variations in importing a dataset using pandas:
read_csv
and its variations, reading a dataset using open method in Python, reading a file in chunks using theopen
method, reading directly from a URL, specifying the column names from a list, changing the delimiter of a dataset, and so on. - Basic exploratory analysis of data: observing a thumbnail of data, shape, column names, column types, and summary statistics for numerical variables
- Handling missing values: The reason for incorporation of missing values, why it is important to treat them properly, how to treat them properly by deletion and imputation, and various methods of imputing data.
- Creating dummy variables: creating dummy variables for categorical variables to be used in the predictive models.
- Basic plotting: scatter plotting, histograms and boxplots; their meaning and relevance; and how they are plotted.
This chapter is a head start into our journey to explore our data and wrangle it to make it modelling-worthy. The next chapter will go deeper in this pursuit whereby we will learn to aggregate values for categorical variables, sub-set the dataset, merge two datasets, generate random numbers, and sample a dataset.
Cleaning, as we have seen in the last chapter takes about 80% of the modelling time, so it's of critical importance and the methods we are learning will come in handy in the pursuit of that goal.