Subsetting a dataset_Python：Advanced Predictive Analytics-QQ阅读男生中文玄幻网

书名：Python：Advanced Predictive Analytics
作者名：Ashish Kumar Joseph Babcock
本章字数：1636字
更新时间：2025-02-24 08:29:04

Subsetting a dataset

As discussed in the introductory section, the task of subsetting a dataset can entail a lot of things. Let us look at them one by one. In order to demonstrate it, let us first import the Customer Churn Model dataset, which we used in the last chapter:

import pandas as pd
data=pd.read_csv('E:/Personal/Learning/Predictive Modeling Book/Book Datasets/Customer Churn Model.txt')

Selecting columns

Very frequently, an analyst might come across situations wherein only a handful of columns among a vast number of columns are useful and are required in the model. It then becomes important, to select particular columns. Let us see how to do that.

If one wishes to select the Account Length variable of the data frame we just imported, one can simply write:

account_length=data['Account Length']
account_length.head()

The square bracket ([ ]) syntax is used to subset a column of a data frame. One just needs to type the appropriate column name in the square brackets. Selecting one column returns a Series object (an object similar to a data frame) consisting of the values of the selected column. The output of the preceding snippet is as follows:

Fig. 3.1: First few entries of the Account Length column

The fact that this process returns a series can be confirmed by typing type(account_length), this will return something similar to the following output, as a result:

Selecting multiple columns can be accomplished in a similar fashion. One just needs to add an extra square bracket to indicate that it is a list of column names that they are selecting and not just one column.

If one wants to select Account Length, VMail Message, and Day Calls, one can write the code, as follows:

Subdata = data[['Account Length','VMail Message','Day Calls']]
subdata.head()

The output of the preceding snippet should be similar to the following screenshot:

Fig. 3.2: First few entries of the Account Length and VMail Message columns

Unlike in the case of selecting a single column, selecting multiple columns throws up a data frame, as the result:

type(subdata)

One can also create a list of required columns and pass the list name as the parameter inside the square bracket to subset a data frame. The following code snippet will give the same result, as shown in Fig. 3.3, in the next section:

wanted_columns=['Account Length','VMail Message','Day Calls']
subdata=data[wanted]
subdata.head()

In some cases, one might want to delete or remove certain columns from the dataset before they proceed to modelling. The same approach, as taken in the preceding section, can be taken in such cases.

This approach of subsetting columns from data frames works fine when the list of columns is relatively small (3-5 columns). After this, the time consumed in typing column names warrants some more efficient methods to do this. The trick is to manually create a list to complement (a list not containing the elements that are present in the other set) the bigger list and create the bigger list using looping. The complement list of a big table will always be small; hence, we need to make the method a tad bit efficient.

Let us have a look at the following code snippet to observe how to implement this:

wanted=['Account Length','VMail Message','Day Calls']
column_list=data.columns.values.tolist()
sublist=[x for x in column_list if x not in wanted]
subdata=data[sublist]
subdata.head()

The sublist as expected contains all the column names except the ones listed in the wanted list, as shown in the following screenshot:

Fig. 3.3: Column names of the subdata data frame

In the third line of the preceding code snippet, a list comprehension has been used. It is a convenient method to run for loops over lists and get lists as output. Many of you, who have experience with Python, will know of this. For others, it is not rocket science; just a better way to run for loops.

Selecting rows

Selecting rows is similar to selecting columns, in the sense that the same square bracket is used, but instead of column names the row number or indices are used. Let us see some examples to know how to select a particular number of rows from a data frame:

If one wants to select the first 50 rows of the data frame, one can just write:
```
data[1:50]
```
It is important to note that one needs to pass a range of numbers to subset a data frame over rows. To select 50 rows starting from 25th column, we will write:
```
data[25:75]
```
If the lower limit is not mentioned, it denotes that the upper limit is the starting row of the data, which is row 1 in most cases. Thus, data[:50] is similar to data[1:50].

In the same way, if the upper limit is not mentioned, it is assumed to be the last row of the dataset. To select all the rows except the first 50 rows, we will write data[51:].

A variety of permutations and combinations can be performed on these rules to fetch the row that one needs.

Another important way to subset a data frame by rows is conditional or Boolean subsetting. In this method, one filters the rows that satisfy certain conditions. The condition can be either an inequality or a comparison written inside the square bracket. Let us see a few examples of how one can go about implementing them:

Suppose, one wants to filter the rows that have clocked Total Mins to be greater than 500. This can be done as follows:
```
data1=data[data['Total Mins']>500]
data1.shape
```
The newly created data frame, after filtering, has 2720 rows compared to 3333 in the unfiltered data frame. Clearly, the balance rows have been filtered by the condition.
Let us have a look at another example, where we provide equality as a condition. Let us filter the rows for which the state is VA:
```
data1=data[data['State']=='VA']
data1.shape
```
This data frame contains only 77 rows, while the rest get filtered.
One can combine multiple conditions, as well, using AND (&) and OR (|) operators. To filter all the rows in the state VA that have Total Mins greater than 500, we can write:
```
data1=data[(data['Total Mins']>500) & (data['State']=='VA')]
data1.shape
```
This data frame contains only 64 rows; it's lesser than the previous data frame. It also has two conditions, both of which must be satisfied to get filtered. The AND operator has a subtractive effect.
To filter all the rows that are either in state VA or have Total Mins greater than 500, we can write the following code:
```
data1=data[(data['Total Mins']>500) | (data['State']=='VA')]
data1.shape
```
This data frame has 2733 rows, which is greater than 2720 rows obtained with just one filter of Total Mins being greater than 500. The OR operator has an additive affect.

Selecting a combination of rows and columns

This is the most used form of subsetting a dataset. Earlier in this chapter we selected three columns of this dataset and called the sub-setted data frame a subdata. What if we wish to look at specific rows of that sub-setted data frame? How can we do that? We just need another square bracket adjacent to the one already existing.

Let's say, we need to look at the first 50 rows of that sub-setted data frame. We can write a snippet, as shown:

subdata_first_50=data[['Account Length','VMail Message','Day Calls']][1:50]
subdata_first_50

We can use the already created subdata data frame and subset it for the first 50 rows by typing:

subdata[1:50] or subdata[:50]

Alternatively, one can subset the columns using the list name as explained earlier and then subset for rows.

Another effective (but a little unstable, as its behavior changes based on the version of Python installed) method to select both rows and columns together is the .ix method. Let's see how to use this method.

Basically, in the .ix method, we can provide row and column indices (in a lay man's term, row and column numbers) inside the square bracket. The syntax can be summarized, as follows:

The data frame name is appended with ix
Inside the square bracket, specify the row number (range) and column number (range) in that order

Now, let's have a look at a few examples:

Selecting the first 100 rows of the first 5 columns:
```
data.ix[1:100,1:6]
```

The output looks similar to the following screenshot:

Fig. 3.4: First 100 rows of the first 5 columns

Selecting all rows from the first five columns:
```
data.ix[:,1:6] 
```
Selecting first 100 rows from all the columns:
```
data.ix[1:100,:]
```

The row and column numbers/name can be passed off as a list, as well. Let's have a look at how it can be done:

Selecting the first 100 rows from the 2^nd, 5^th, and 7^th columns:
```
data.ix[1:100,[2,5,7]]
```

The output looks similar to the following screenshot:

Fig. 3.5: First 100 rows of the 2^nd, 5^th and 7^th columns

Selecting the 1^st, 2^nd and 5^th rows from the 2^nd, 5^th and 7^th columns:
```
data.ix[[1,2,5],[2,5,7]]
```

The output looks similar to the following screenshot:

Fig. 3.6: 1^st, 2^nd and 5^th rows of the 2^nd, 5^th and 7^th columns

Instead of row and column indices or numbers, we can also write corresponding column names, as shown in the following example:

data.ix[[1,2,5],['Area Code','VMail Plan','Day Mins']]

Creating new columns

Many times during the analysis, we are required to create a new column based on some calculation or modification of the existing columns containing a constant value to be used in the modelling. Hence, the knowledge of creating new columns becomes an indispensable tool to learn. Let's see how to do that.

Suppose, in the Customer Churn Model dataset, we want to calculate the total minutes spent during the day, evening, and night. This requires summing up the 3 columns, which are Day Mins, Eve Mins, and Night Mins. It can be done, as shown in the following snippet:

data['Total Mins']=data['Day Mins']+data['Eve Mins']+data['Night Mins']
data['Total Mins'].head()

The output of the snippet is, as follows:

Fig. 3.7: First few entries of the new Total Mins column