- Feature Engineering Made Easy
- Sinan Ozdemir Divya Susarla
- 1015字
- 2021-06-25 22:45:54
Salary ranges by job classification
Let's first do some import statements:
# import packages we need for exploratory data analysis (EDA)
# to store tabular data
import pandas as pd
# to do some math
import numpy as np
# a popular data visualization tool
import matplotlib.pyplot as plt
# another popular data visualization tool
import seaborn as sns
# allows the notebook to render graphics
%matplotlib inline
# a popular data visualization theme
plt.style.use('fivethirtyeight')
And then, let's import our first dataset, which will explore salaries of different job titles in San Francisco. This dataset is available publicly and so you are encouraged to play around with it as much as you want:
# load in the data set
# https://data.sfgov.org/City-Management-and-Ethics/Salary-Ranges-by-Job-Classification/7h4w-reyq
salary_ranges = pd.read_csv('../data/Salary_Ranges_by_Job_Classification.csv')
# view the first few rows and the headers
salary_ranges.head()
Let us have a look at the following table to understand better:
We can see that we have a bunch of columns, and some already start to jump out at us as being quantitative or qualitative. Let's get a sense of how many rows of data there are using the .info() command:
# get a sense of how many rows of data there are, if there are any missing values, and what data type each column has
salary_ranges.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1356 entries, 0 to 1355 Data columns (total 13 columns): SetID 1356 non-null object Job Code 1356 non-null object Eff Date 1356 non-null object Sal End Date 1356 non-null object Salary SetID 1356 non-null object Sal Plan 1356 non-null object Grade 1356 non-null object Step 1356 non-null int64 Biweekly High Rate 1356 non-null object Biweekly Low Rate 1356 non-null object Union Code 1356 non-null int64 Extended Step 1356 non-null int64 Pay Type 1356 non-null object dtypes: int64(3), object(10) memory usage: 137.8+ KB
So, we have 1356 entries (rows) and 13 columns. The .info() command also tells us the number of non-null items in each column. This is important because missing data is by far one of the most common issues in feature engineering. Sometimes, we are working with datasets that are just incomplete. In pandas, we have many ways of figuring out if we are working with missing data, and many ways of dealing with them. A very quick and common way to count the number of missing values is to run:
# another method to check for missing values
salary_ranges.isnull().sum()
SetID 0 Job Code 0 Eff Date 0 Sal End Date 0 Salary SetID 0 Sal Plan 0 Grade 0 Step 0 Biweekly High Rate 0 Biweekly Low Rate 0 Union Code 0 Extended Step 0 Pay Type 0 dtype: int64
So, we see we are not missing any pieces of data in this one, phew (for now). Moving on, let's run the describe method to check out some descriptive statistics of our quantitative columns (which we should have). Note that the describe method will default to describing quantitative columns, but will describe qualitative columns if there are no quantitative columns:
# show descriptive stats:
salary_ranges.describe()
Let us have a look at the following table for a better understanding here:
According to pandas, we only have three quantitative columns: Step, Union Code, and Extended Step. Let's ignore Step and Extended Step for now, and also notice that Union Code isn't really quantitative. While it is a number, it doesn't really represent a quantity of something, it's merely describing the union through a unique coding. So, we have some work to do here to even understand the features that we are more interested in. Most notably, let's say we wish to pull out a single quantitative column, the Biweekly High Rate, and a single qualitative column, Grade (the type of job):
salary_ranges = salary_ranges[['Biweekly High Rate', 'Grade']]
salary_ranges.head()
The following is the result of the preceding code:
To do some cleaning up on these columns, let's remove those dollar signs ($) from the salary rate and ensure that the columns are of the correct type. When working with quantitative columns, we generally want them to be integer or floats (floats are preferred), while qualitative columns are usually strings or Unicode objects:
# Rate has dollar signs in a few of them, we need to clean that up..
salary_ranges['Biweekly High Rate'].describe()
count 1356 unique 593 top $3460.00 freq 12 Name: Biweekly High Rate, dtype: object
To clean up this column, let's use the map feature in pandas to efficiently map a function to an entire series of data:
# need to clean our Biweekly High columns to remove the dollar sign in order to visualize
salary_ranges['Biweekly High Rate'] = salary_ranges['Biweekly High Rate'].map(lambda value: value.replace('$',''))
# Check to see the '$' has been removed
salary_ranges.head()
The following table gives us a better understanding here:
To finish our transformation of the Biweekly High Rate column, we will cast the whole thing as a float:
# Convert the Biweeky columns to float
salary_ranges['Biweekly High Rate'] = salary_ranges['Biweekly High Rate'].astype(float)
While we are casting, let's also cast the Grade column as a string:
# Convert the Grade columns to str
salary_ranges['Grade'] = salary_ranges['Grade'].astype(str)
# check to see if converting the data types worked
salary_ranges.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1356 entries, 0 to 1355 Data columns (total 2 columns): Biweekly High Rate 1356 non-null float64 Grade 1356 non-null object dtypes: float64(1), object(1) memory usage: 21.3+ KB
We see that we now have a total of:
- 1,356 rows (like we started with)
- Two columns (that we selected):
- Biweekly High Rate: A quantitative column that refers to the average high weekly salary for a specified department:
- This column is quantitative because the the values are numerical in nature and describe the quantity of money that the person earns weekly
- It is of type float, which we cast it to
- Grade: The department that the salary is in reference to:
- This column is definitely qualitative because the codes refer to a department and not a quantity of any kind
- It is of type object, which is the type pandas will stipulate if it is a string
- Biweekly High Rate: A quantitative column that refers to the average high weekly salary for a specified department:
To break quantitative and qualitative data even further, let's pe into the four levels of data.