- Python:Advanced Predictive Analytics
- Ashish Kumar Joseph Babcock
- 1636字
- 2021-07-02 20:09:23
Subsetting a dataset
As discussed in the introductory section, the task of subsetting a dataset can entail a lot of things. Let us look at them one by one. In order to demonstrate it, let us first import the Customer Churn Model
dataset, which we used in the last chapter:
import pandas as pd data=pd.read_csv('E:/Personal/Learning/Predictive Modeling Book/Book Datasets/Customer Churn Model.txt')
Selecting columns
Very frequently, an analyst might come across situations wherein only a handful of columns among a vast number of columns are useful and are required in the model. It then becomes important, to select particular columns. Let us see how to do that.
If one wishes to select the Account Length
variable of the data frame we just imported, one can simply write:
account_length=data['Account Length'] account_length.head()
The square bracket ([ ]
) syntax is used to subset a column of a data frame. One just needs to type the appropriate column name in the square brackets. Selecting one column returns a Series
object (an object similar to a data frame) consisting of the values of the selected column. The output of the preceding snippet is as follows:
Fig. 3.1: First few entries of the Account Length column
The fact that this process returns a series can be confirmed by typing type(account_length)
, this will return something similar to the following output, as a result:
Selecting multiple columns can be accomplished in a similar fashion. One just needs to add an extra square bracket to indicate that it is a list of column names that they are selecting and not just one column.
If one wants to select Account Length
, VMail Message
, and Day Calls
, one can write the code, as follows:
Subdata = data[['Account Length','VMail Message','Day Calls']] subdata.head()
The output of the preceding snippet should be similar to the following screenshot:
Fig. 3.2: First few entries of the Account Length and VMail Message columns
Unlike in the case of selecting a single column, selecting multiple columns throws up a data frame, as the result:
type(subdata)
One can also create a list of required columns and pass the list name as the parameter inside the square bracket to subset a data frame. The following code snippet will give the same result, as shown in Fig. 3.3, in the next section:
wanted_columns=['Account Length','VMail Message','Day Calls'] subdata=data[wanted] subdata.head()
In some cases, one might want to delete or remove certain columns from the dataset before they proceed to modelling. The same approach, as taken in the preceding section, can be taken in such cases.
This approach of subsetting columns from data frames works fine when the list of columns is relatively small (3-5 columns). After this, the time consumed in typing column names warrants some more efficient methods to do this. The trick is to manually create a list to complement (a list not containing the elements that are present in the other set) the bigger list and create the bigger list using looping. The complement list of a big table will always be small; hence, we need to make the method a tad bit efficient.
Let us have a look at the following code snippet to observe how to implement this:
wanted=['Account Length','VMail Message','Day Calls'] column_list=data.columns.values.tolist() sublist=[x for x in column_list if x not in wanted] subdata=data[sublist] subdata.head()
The sublist
as expected contains all the column names except the ones listed in the wanted
list, as shown in the following screenshot:
Fig. 3.3: Column names of the subdata data frame
In the third line of the preceding code snippet, a list comprehension has been used. It is a convenient method to run for loops over lists and get lists as output. Many of you, who have experience with Python, will know of this. For others, it is not rocket science; just a better way to run for
loops.
Selecting rows
Selecting rows is similar to selecting columns, in the sense that the same square bracket is used, but instead of column names the row number or indices are used. Let us see some examples to know how to select a particular number of rows from a data frame:
- If one wants to select the first
50
rows of the data frame, one can just write:data[1:50]
- It is important to note that one needs to pass a range of numbers to subset a data frame over rows. To select 50 rows starting from 25th column, we will write:
data[25:75]
- If the lower limit is not mentioned, it denotes that the upper limit is the starting row of the data, which is row 1 in most cases. Thus,
data[:50]
is similar todata[1:50]
.
In the same way, if the upper limit is not mentioned, it is assumed to be the last row of the dataset. To select all the rows except the first 50 rows, we will write data[51:]
.
A variety of permutations and combinations can be performed on these rules to fetch the row that one needs.
Another important way to subset a data frame by rows is conditional or Boolean subsetting. In this method, one filters the rows that satisfy certain conditions. The condition can be either an inequality or a comparison written inside the square bracket. Let us see a few examples of how one can go about implementing them:
- Suppose, one wants to filter the rows that have clocked
Total Mins
to be greater than 500. This can be done as follows:data1=data[data['Total Mins']>500] data1.shape
- The newly created data frame, after filtering, has 2720 rows compared to 3333 in the unfiltered data frame. Clearly, the balance rows have been filtered by the condition.
- Let us have a look at another example, where we provide equality as a condition. Let us filter the rows for which the state is
VA
:data1=data[data['State']=='VA'] data1.shape
- This data frame contains only 77 rows, while the rest get filtered.
- One can combine multiple conditions, as well, using
AND
(&
) andOR
(|
) operators. To filter all the rows in the stateVA
that haveTotal Mins
greater than 500, we can write:data1=data[(data['Total Mins']>500) & (data['State']=='VA')] data1.shape
- This data frame contains only 64 rows; it's lesser than the previous data frame. It also has two conditions, both of which must be satisfied to get filtered. The
AND
operator has a subtractive effect. - To filter all the rows that are either in state
VA
or haveTotal Mins
greater than 500, we can write the following code:data1=data[(data['Total Mins']>500) | (data['State']=='VA')] data1.shape
- This data frame has 2733 rows, which is greater than 2720 rows obtained with just one filter of
Total Mins
being greater than 500. TheOR
operator has an additive affect.
Selecting a combination of rows and columns
This is the most used form of subsetting a dataset. Earlier in this chapter we selected three columns of this dataset and called the sub-setted data frame a subdata
. What if we wish to look at specific rows of that sub-setted data frame? How can we do that? We just need another square bracket adjacent to the one already existing.
Let's say, we need to look at the first 50 rows of that sub-setted data frame. We can write a snippet, as shown:
subdata_first_50=data[['Account Length','VMail Message','Day Calls']][1:50] subdata_first_50
We can use the already created subdata
data frame and subset it for the first 50 rows by typing:
subdata[1:50] or subdata[:50]
Alternatively, one can subset the columns using the list name as explained earlier and then subset for rows.
Another effective (but a little unstable, as its behavior changes based on the version of Python installed) method to select both rows and columns together is the .ix
method. Let's see how to use this method.
Basically, in the .ix
method, we can provide row and column indices (in a lay man's term, row and column numbers) inside the square bracket. The syntax can be summarized, as follows:
- The data frame name is appended with
ix
- Inside the square bracket, specify the row number (range) and column number (range) in that order
Now, let's have a look at a few examples:
- Selecting the first 100 rows of the first 5 columns:
data.ix[1:100,1:6]
The output looks similar to the following screenshot:
Fig. 3.4: First 100 rows of the first 5 columns
- Selecting all rows from the first five columns:
data.ix[:,1:6]
- Selecting first 100 rows from all the columns:
data.ix[1:100,:]
The row and column numbers/name can be passed off as a list, as well. Let's have a look at how it can be done:
- Selecting the first 100 rows from the 2nd, 5th, and 7th columns:
data.ix[1:100,[2,5,7]]
The output looks similar to the following screenshot:
Fig. 3.5: First 100 rows of the 2nd, 5th and 7th columns
- Selecting the 1st, 2nd and 5th rows from the 2nd, 5th and 7th columns:
data.ix[[1,2,5],[2,5,7]]
The output looks similar to the following screenshot:
Fig. 3.6: 1st, 2nd and 5th rows of the 2nd, 5th and 7th columns
Instead of row and column indices or numbers, we can also write corresponding column names, as shown in the following example:
data.ix[[1,2,5],['Area Code','VMail Plan','Day Mins']]
Creating new columns
Many times during the analysis, we are required to create a new column based on some calculation or modification of the existing columns containing a constant value to be used in the modelling. Hence, the knowledge of creating new columns becomes an indispensable tool to learn. Let's see how to do that.
Suppose, in the Customer Churn Model
dataset, we want to calculate the total minutes spent during the day, evening, and night. This requires summing up the 3 columns, which are Day Mins
, Eve Mins
, and Night Mins
. It can be done, as shown in the following snippet:
data['Total Mins']=data['Day Mins']+data['Eve Mins']+data['Night Mins'] data['Total Mins'].head()
The output of the snippet is, as follows:
Fig. 3.7: First few entries of the new Total Mins column