- Python:Advanced Predictive Analytics
- Ashish Kumar Joseph Babcock
- 715字
- 2021-07-02 20:09:27
Chi-square tests
The chi-square test is a statistical test commonly used to compare observed data with the expected data assuming that the data follows a certain hypothesis. In a sense, this is also a hypothesis test. You assume one hypothesis, which your data will follow and calculate the expected data according to that hypothesis. You already have the observed data. You calculate the deviation between the observed and expected data using the statistics defined in the following formula:
Where O is the observed value and E is the expected value while the summation is over all the data points.
The chi-square test can be used to do the following things:
- Show a causal relationship or independence between one input and output variable. We assume that they are independent and calculate the expected values. Then we calculate the chi-square value. If the null hypothesis is rejected, it suggests a relationship between the two variables. The relationship is not just by chance but statistically proven.
- Check whether the observed data is coming from a fair/unbiased source. If the observed data is more skewed towards one extreme, compared to the expected data, then it is not coming from a fair source. But, if it is very close to the expected value then it is.
- Check whether a data is too good to be true. As, it is a random experiment and we don't expect the values to toe the assumed hypothesis. If they do toe the assumed hypothesis, then the data has probably been tampered to make it look good and is too good to be true.
Let us create a hypothetical experiment where a coin is tossed 10 times. How many times do you expect it to turn heads or tails? Five, right? Now, what if we do this experiment 1000 times and record the scores (number of heads and tails). Suppose we observed heads 553 times and a tails in the rest of the trials:
Let us calculate the chi-square value:
This chi-square value is compared to the value on a chi-square distribution for a given degree of freedom and a given significance level. The degrees of freedom is the number of categories -1. In this case, it is 2-1=1. Let us assume a significance level of 0.05.
The chi-square distribution looks a little different than the normal distribution. It also has a peak but has a much longer tail than the normal distribution and is only on one side. As the degree of freedom increases, they start looking similar to a normal distribution:
Fig. 4.6: Chi-square distribution with different degrees of freedom
When we look at the chi-square distribution table for a degree of freedom 1 and a significance level of 0.05, we get a value of 3.841. At a significance level of 0.01, we get 6.635. In both the cases, the chi-square statistic is greater than the value from the chi-square distribution, meaning that the chi-square statistic lies on the right of the value from the distribution table.
Hence, the null hypothesis is rejected. That means that the coin is not fair.
Fig. 4.7: Null hypothesis is rejected because the value of the chi-square statistic at the significance level is less than the value of the chi-square statistic
Let us look at another example where we want to prove that the gender of a student and the subjects they choose are independent.
Suppose, in a group of students, the following table represents the number of boys and girls who have taken Maths, Arts, and Commerce, as their main subjects.
The observed number of boys and girls in each subject is as shown in the following table:
On calculating and summing up all the values, the chi-square value comes out to be 5.05. The degree of freedom is the number of categories-1, which amounts to [(3x2)-1=5]. Let us assume a significance level of 0.05.
Looking at the chi-square distribution, one can find out that for a 5-degree freedom chi-square distribution, the value of the chi-square statistic at a significance level of 0.05 is 11.07.
The calculated chi-square statistic < chi-square statistic (at significance level=0.05).
Since, the chi-square statistic lies on the left of the value at the significance level, the null hypothesis can't be rejected. Hence, the choice of subjects is independent of the gender.