Having recently renewed my effort learning statistics, I came across the Chi-Squared Test for Independence (Chi-Squared Test).
The Chi-Squared Test was positioned as a viable test to follow the Spearman’s Rank Correlation Coefficient (Spearman’s Rank Test) when comparing categorical variables. This ignited a fire in me driven by confusion, contradictions and questions. (Being detail orientated I pick up on small irregularities which is often useful but occasionally sends me on a wild goose chase. This article is a bit of both.)
What confused me so was the use of the word ‘Independence’. If we have a test for independence (i.e. dependence / causation), then why are we using the Spearman’s Rank Correlation Coefficient, a test for mere correlation? Why is this test positioned after the Spearman’s Rank Test instead of before? Surely dependence is more important than correlation? How can we see 0 correlation as a result of the Spearman’s Rank, but a Chi-Squared test result that shows dependence between the variables?
It all goes back to basics — correlation vs. causation
Correlation — This is when statistical tests show that when one variable changes so does the other. This can be because they are dependent OR some other reason (e.g. they are both dependent on a third variable).
Independence — The value of one variable has no effect whatsoever on the value of the other variable.
Dependence — Opposite of independence and synonym for causation. Most articles talk about independence vs. correlation but I find it more intuitive to discuss correlation vs. dependence as they are on the same side of the coin.
It’s important to note that most of the time, statistical tests are limited to show correlation / association. To prove causation / dependence requires a randomised controlled experiment. e.g. A/B tests or the likes of those seen in medical trials.
So, how can the Chi-Squared Test show independence / dependence? It can’t. It’s simply has a misleading name. Chi-Squared Test can only tell us whether two variables are associated to one another. I.e. When one variable changes, it looks like the other one changes too (for some reason currently unknown). It does not necessarily imply that one variable has any causal effect on the other. In order to establish causality, a more detailed analysis would be required.
Let’s take this one step further — The Chi-Squared Test doesn’t even show correlation, it can only show association. To understand this it helps to add to the definition of correlation. Correlation can only be used to measure linear relationships.
Chi-Squared is a test used on categorical variables that show whether an association exists or not. The reason it follows the Spearman’s Rank Test is that it has less assumptions. Spearman’s Rank is also a test on categorical variables, but it assumes that the variables are monotonic (if one increases, so does the other), whereas Chi-Squared test does not make this assumption.
So there you have it. For me, these findings reinforced the message that in statistics (and science in general for that matter) we are always making assumptions and approximations. It is rare that we are actually dealing with reality.
For futher reading and a more detailed look at the math: