- Applied Supervised Learning with R
- Karthik Ramasubramanian Jojo Moolayil
- 1075字
- 2021-06-11 13:22:32
Studying the Relationship between a Categorical and a Numeric Variable
Let's first recall the methods discussed to study the relationship between the numeric and categorical variable and discuss the approach to execute it.
In this section, we will discuss the different aggregation metrics that we can use for summarizing the data. So far, we have used avg, but a better approach would be to use a combination of avg, min, max, and other metrics.
Exercise 31: Studying the Relationship between the y and age Variables
We have a categorical dependent variable and nine numeric variables to explore. To start small, we will first explore the relationship between our target, y, and age. To study the relationship between a categorical and numeric variable, we can choose a simple analytical technique where we calculate the average age across each target outcome; if we see stark differences, we can make insights from the observations.
In this exercise, we will calculate the average age across each target outcome and also count the number of records in each bucket, followed by a visual representation.
Perform the following steps:
- First, import the ggplot2 package using the following command:
library(ggplot2)
- Create a DataFrame object, df, and use the bank-additional-full.csv file using the following command:
df <- read.csv("/Chapter 2/Data/bank-additional/bank-additional-full.csv",sep=';')
- Create a temp object and store the value using the following command:
temp <- df %>% group_by(y) %>%
summarize(Avg.Age = round(mean(age),2),
Num.Records = n())
- Print the value stored in the temp object:
print(temp)
The output is as follows:
# A tibble: 2 x 3
y Avg.Age Num.Records
<fct> <dbl> <int>
1 no 39.9 36548
2 yes 40.9 4640
- Now, create a plot using the ggplot command:
ggplot(data= temp, aes(x=y, y=Avg.Age)) +
geom_bar(stat="identity",fill="blue",alpha= 0.5) + #Creates the bar plot
geom_text(label=temp$Avg.Age,vjust=-0.3)+ #Adds the label
ggtitle(paste("Average Age across target outcome")) #Creates the title
The output is as follows:
Figure 2.16: Histogram for the average age across target outcome
The first line of code creates the temporary aggregation datasets, which summarizes the average age and the number of records in each category. The plotting functionality used is on the lines of our previous visuals. We extend the ggplot function with the geom_bar to render the bar plots.
We can see that there is barely any difference between the two outcomes. We don't see any interesting patterns.
Note
In bivariate analysis, we need to be careful before concluding any interesting patterns as insights. In many cases, due to the skewed distribution of data, the patterns would seem surprisingly interesting.
Let's move on to the next set of variables.
Exercise 32: Studying the Relationship between the Average Value and the y Variable
In this exercise, we will study the relationship between the next set of variables: average and y.
Perform the following steps to complete the exercise:
- Import the required libraries and create the DataFrame object.
- Next, create the plot_bivariate_numeric_and_categorical object using the following command:
plot_bivariate_numeric_and_categorical <- function(df,target,list_of_variables,ncols=2){
target<-sym(target) #Defined for converting text to column names
plt_matrix <- list()
i<-1
for(column in list_of_variables){
col <-sym(column) #defined for converting text to column name
temp <- df %>% group_by(!!sym(target)) %>%
summarize(Avg.Val = round(mean(!!sym(col)),2))
options(repr.plot.width=12, repr.plot.height=8) #Defines plot size
plt_matrix[[i]]<-ggplot(data= temp, aes(x=!!sym(target), y=Avg.Val)) +
geom_bar(stat="identity",fill="blue",alpha= 0.5) +
geom_text(label=temp$Avg.Val,vjust=-0.3)+ #Adds the labels
ggtitle(paste("Average",column,"across target outcomes")) #Creates the title
i<-i+1
}
plot_grid(plotlist = plt_matrix,ncol=ncols)
}
- Now, print the distribution of records across target outcomes:
print("Distribution of records across target outcomes-")
print(table(df$y))
The output is as follows:
[1] "Distribution of records across target outcomes-"
no yes
36548 4640
- Now, plot the histogram using the following command for the defined variables:
plot_bivariate_numeric_and_categorical(df,"y",c("campaign","pdays","previous","emp.var.rate"),2)
The output is as follows:
Figure 2.17: Histogram of average value versus the y variable
In order to automate the data exploration task for bivariate analysis between a categorical and a numeric variable, we have defined a function similar to the one we defined in the previous exercise. We have additionally used the sym function, which will help us use dynamic column names in the function. Using !!sym(column) converts a string to a real column name that's analogous to passing the actual value. The previous function first aggregates the average value of the target across the variable of interest. The plot function then uses the information to plot the bar chart with the average values across the target outcomes.
In bivariate analysis, it is important to carefully validate the patterns observed before concluding a specific insight. In some cases, outliers might skew the results and therefore deliver incorrect findings. Additionally, fewer of records for a particular pattern might also be a risky pattern to conclude. It is always recommended to collect all the insights observed and further validate them with additional extensive EDA or statistical techniques for significance.
Here, we don't see any prominent results to conclude. In the campaign variable, the average number of contacts made during the campaign is a bit lower for successful campaigns, but the difference is too small to make any possible conclusions. pdays, which indicate the number of days since the last contact in the previous campaign shows a big difference between the outcomes for the target.
However, this difference is purely due to most clients being not contacted in the previous campaign. All of those records have values set to 999. The same holds true for previous; though there is a decent difference between the two, most clients were contacted for the first time in the current campaign. The employment variance rate, however, shows counter-intuitive results. We would actually expect the variance rate to be higher when the outcome is yes, but we see it the other way around. This sounds interesting, we will make a note of this insight for now and later come back for more validation before making any conclusions.
Let's move on to the next set of categorical dependent variables to be studied with the categorical dependent variable.
Exercise 33: Studying the Relationship between the cons.price.idx, cons.conf.idx, curibor3m, and nr.employed Variables
Let's move on to the next set of categorical dependent variables to be studied with the categorical dependent variable. For this exercise, we will explore the relationship between cons.price.idx, cons.conf.idx, euribor3m, and nr.employed, with the target variable y using histogram.
- Import the required libraries and create the DataFrame object.
- Next, create a plot_bivariate_numeric_and_categorical function and plot the histogram:
plot_bivariate_numeric_and_categorical(df,"y",
c("cons.price.idx","cons.conf.idx", "euribor3m", "nr.employed"),2)
The output is as follows:
Figure 2.18: Histogram of the cons.price.idx, cons.conf.idx, euribor3m, and nr.employed variables
Again, for most cases, we don't see any prominent patterns. However, the euribor3m variable demonstrates some good differences between the average values for yes and no outcomes of the campaign and, again, seems counter-intuitive. We ideally expected higher bank deposits with higher interest rates. Therefore, let's make a note of the insight and validate it later.
Moving on, let's now explore the relationship between two categorical variables.