- Applied Supervised Learning with R
- Karthik Ramasubramanian Jojo Moolayil
- 860字
- 2021-06-11 13:22:34
Categorical Dependent and Categorical Independent Variables
Moving on, let's take a look at the third hypothesis. To test the relationship between the categorical dependent variable and categorical independent variable, we can use the chi squared test.
For hypothesis 3, we define the following:
- Null hypothesis: The campaign outcome has no relationship with clients who never married.
- Alternate hypothesis: The campaign outcome has a relationship with clients who never married.
In the following exercise, we will leverage R's chi-square test function to validate the hypothesis..
Exercise 38: Hypothesis 3 Testing for Categorical Dependent Variables and Categorical Independent Variables
In this exercise, we will perform a statistical test using the chi-squared test. We use the chi-squared test because both the independent and dependent variables are categorical, particularly when testing the relationship between y and marital status.
Perform the following steps:
- Import the required libraries and create the DataFrame objects.
- First, convert the dependent variable into a factor type:
df$y <- factor(df$y)
- Create a flag for single clients:
df$single_flag <- as.factor(ifelse(df$marital == "single","single","other"))
- Create a sample object and print the value:
sample <- table(df$y, df$single_flag)
print(sample)
The output is as follows:
other single
no 26600 9948
yes 3020 1620
- Perform the chi-squared test:
h.test3 <- chisq.test(sample)
- Print the test summary:
print(h.test3)
The output is as follows:
Pearson's Chi-squared test with Yates' continuity correction
data: sample
X-squared = 120.32, df = 1, p-value < 2.2e-16
We first create a new variable/flag for this test where we define whether a client is single or not. Since we are exclusively defining our relationship between the target and client's single marital status, we mask all other classes within marital status.
The table command creates a new DataFrame with a simple frequency distribution between each individual class. Finally, we use this DataFrame to perform the chi-squared test.
As we can see, the p-value or the chance of the null hypothesis being true is far less than 5%. Therefore, we can accept our alternate hypothesis, which confirms the fact that the campaign's outcome is positively influenced by single clients rather than other clients.
Moving on, let's take a quick look at the validity of our 4th and 5th hypotheses.
For the 4th and 5th hypotheses, we define the following:
- Null hypothesis: The campaign outcome has no relationship with clients who are students or retired. The campaign outcome has no relationship with the contact mode used.
- Alternate hypothesis: The campaign outcome has no relationship with clients who are students or retired. The campaign outcome has a relationship with the contact mode used.
Exercise 39: Hypothesis 4 and 5 Testing for a Categorical Dependent Variable and a Categorical Independent Variable
Once again, let's use the chi-squared test to statistically check whether there is a relationship between the target variable, y, the categorical independent variable job_flag, and contact.
Perform the following steps:
- Import the required libraries and create the DataFrame objects.
- First, convert the dependent variable into a factor type:
df$y <- factor(df$y)
- Prepare the independent variable:
df$job_flag <- as.factor(ifelse(df$job %in% c("student","retired"),as.character(df$job),"other"))
df$contact <- as.factor(df$contact)
- Create an object named sample4 and print the value:
sample4 <- table(df$y, df$job_flag)
print("Frequency table for Job")
print(sample4)
The output is as follows:
[1] "Frequency table for Job"
other retired student
no 34662 1286 600
yes 3931 434 275
- Perform the test for the 4th hypothesis:
h.test4 <- chisq.test(sample4)
- Print the test summary for the 4th hypothesis:
print("Hypothesis #4 results")
print(h.test4)
The output is as follows:
[1] "Hypothesis #4 results"
Pearson's Chi-squared test
data: sample4
X-squared = 736.53, df = 2, p-value < 2.2e-16
- Now, create a new sample5 object and print the value:
print("Frequency table for Contact")
sample5 <- table(df$y, df$contact)
print(sample5)
The output is as follows:
[1] "Frequency table for Contact"
cellular telephone
no 22291 14257
yes 3853 787
- Perform the test on the test5 variable:
h.test5 <- chisq.test(sample5)
- Print the test summary for the 5th hypothesis:
print("Hypothesis #5 results")
print(h.test5)
The output is as follows:
[1] "Hypothesis #5 results"
Pearson's Chi-squared test with Yates' continuity correction
data: sample5
X-squared = 862.32, df = 1, p-value < 2.2e-16
We can see that results have been validated in our favor. We can also see that there is definitely a relationship between student and retired clients and the cellular mode of communication with a positive outcome in the campaign.
Collating Insights – Refine the Solution to the Problem
We have now traversed the length and breadth of EDA. In the different sections, we studied the data in varying levels of depth. Now that we have valid answers for the data exploration problem, we can touch base again with the initial problem defined. If you recall the complication and question section in the problem statement, we had What are factors that lead to poor performance of the campaign. Well, we now have an answer based on the patterns we discovered during the bivariate analysis and validated with statistical tests.
Collating all the hypotheses validated with the correct story brings about the solution to our problem. Spend good time with the outcome of each of the hypothesis test results to knit the story together. Each hypothesis tells us whether an independent variable has a relationship with a dependent variable.