Main content

## Statistics and probability

### Course: Statistics and probability > Unit 14

Lesson 2: Chi-square tests for relationships- Filling out frequency table for independent events
- Contingency table chi-square test
- Introduction to the chi-square test for homogeneity
- Chi-square test for association (independence)
- Expected counts in chi-squared tests with two-way tables
- Test statistic and P-value in chi-square tests with two-way tables
- Making conclusions in chi-square tests for two-way tables

© 2023 Khan AcademyTerms of usePrivacy PolicyCookie Notice

# Introduction to the chi-square test for homogeneity

Introduction to the chi-square test for homogeneity.

## Want to join the conversation?

- Hi could someone explain how the chi-square test for homogeneity is conceptually DIFFERENT from the two-sample inference for the difference between groups? I am having a hard time conceptually wrapping my mind around how this test is different from the one that we learned before. Thank you!(9 votes)
- The first difference is that Chi-Square Tests are used for CATEGORICAL variables rather than Z and T which use QUANTITATIVE Variables. Another difference is that Chi-Square homogeneity is used to compare how data compares to the true KNOWN value and basic (observed-expected)^2/expected is used based on CELL COUNTS not means. On the other hand, 2 sample t or z is used to see if the means of 2 separate groups are equal, greater, or smaller than each other.(9 votes)

- Does someone have a resource that talks in detail about degrees of freedom? I understand what it is, but I don't exactly get why it's applicable in many situations.(4 votes)
- The explanation at6:08here is fairly intuitive (while breif):

https://www.khanacademy.org/math/ap-statistics/chi-square-tests/chi-square-goodness-fit/v/chi-square-statistic(2 votes)

- Why do we calculate just one
`χ²`

that includes both the data for the left-handed and right-handed people? Coming from the previous videos, I would think we would have to compute two`χ²`

's, one for the right-handed data and one for the left-handed, and then compare those by taking the difference between them.(4 votes) - Is there a video/playlist explaining at length the reason/s for the large expected counts and 10% sample requirements?(1 vote)
- Is this 'special' Chi-square test considered a non-parametric test? why or why not?(2 votes)
- how would i go about calculating the expected value if the null hypothesis was "There IS a difference" instead of "no difference"?(1 vote)
- Hi Ramon,

That is because there is only one null hypothesis, but many, many alternative hypotheses that it's difficult to integrate to one and calculate. Also, you need to know what the probability of the null hypothesis being false is in order to calculate the positive predictive value for example. Have a look here if you haven't already - https://www.khanacademy.org/math/statistics-probability/significance-tests-one-sample/idea-of-significance-tests/v/simple-hypothesis-testing.

Hope that helps! :)

Evelyn(2 votes)

- Wait chi is the 𝛘 not kai?(1 vote)
- In case of finding the expected counts, why do we consider the column total and not the row total?

I mean instead of saying that from a sample of 100 people 40 prefer STEM so from 60 people who are right-handed 40% of them are expected to prefer STEM, we can say from 100 people 60 are right-handed so from 40 people who prefer STEM 60% of them are expected to be write-handed.(1 vote) - If I'm understanding this correctly, the "expected" result essentially creates a baseline (or a baseline for each population). The test then calculates the distance each population has from that baseline.

If the distance is small enough to not be significant (i.e. less than p-value), then the samples can be deemed to be homogenous, i.e. the populations being tested don't have an impact on the variables being looked at.

Is this correct?(1 vote)

## Video transcript

- We've already been introduced
to the Chi-squared statistic in other videos. Now we're going to use it
for a test for homogeneity. And homogeneity or homogeneity, in everyday language this
means how similar things are and that's what we're
essentially going to test here. We're gonna look at two different groups and see whether the
distributions of those groups for a certain variable are similar or not. And so the question I'm
going to think about, or we're going to think about
together in this video is, let's say we were thinking about left-handed versus right-handed people
and we're wondering, do they have the same
preferences for subject domains? Are they equally inclined to science, technology,
engineering, math, humanities, or neither. And so we can set up our null
and alternative hypotheses. Our null hypothesis is that there is no difference in
the distribution between left-handed and right-handed people in terms of their preference
for subject domains. So, no difference in subject subject preference for left and right. For left and right-handed folks. And then the alternative hypothesis. Well, no, there is a difference. So there is a difference. So how would we go about testing this? Well, we've done hypothesis
testing many times in many videos already. But here, we're going to sample
from two different groups. So let's say that this
is the population of right-handed folks and
this is the population of left-handed folks. Let's say from that sample
of right-handed folks, I take a sample of 60, and then I do the same thing
for the left-handed folks. And these don't even have
to be the sample sizes, so the left-handed folks,
let's say I sample, 40 folks. And here is the data
that I actually collect. So for those 60 right-handed folks, 30 of them prefer the STEM subjects, science, technology, engineering, math. 15 preferred humanities. And 15 were indifferent,
they liked them equally. And then, for the 40 left-handed folks, I got 10 preferring STEM,
25 preferring humanities, and 5 viewed them equally. And then you see the total
number of right-handed folks, total number of left-handed folks, and then you have the total
number from both groups that preferred STEM, total
number from both groups that preferred humanities,
total from both groups that had no preference. So let's just start thinking
about what the expected data would be if we're assuming that
the null hypothesis is true, that there's no difference in preference between right and left-handed folks. This is the right-handed column. This is the left-handed column. Well assuming that the
null hypothesis is true, that there's no difference between right and left-handed people
in terms of their preference, our best estimate of what the
distribution of preference would be in the population generally would come from this total column. Since we're assuming no difference, we would assume that in either group, 40 out of every 100 would
prefer STEM or 40 percent. 40 percent would prefer humanities and 20 percent would have no preference. And so our expected
would be that 40 percent of the 60 right-handed
folks would prefer STEM. So what's 40 percent of 60? 0.4 times 60 is 24. And similarly, we would expect 40 percent preferring humanities. 40 percent times 60 is 24 again. And then we would expect 20 percent of the right-handed group
to have no preference. So 20 percent of 60 is 12. And these, once again, they add up to 60. And then for the left-handed folks, we would go through the same process. We would expect that 40
percent of them prefer STEM, 40 percent of 40, that is 16. On the humanities, again,
40 percent of 40 is 16. And equal, 20 percent of 40 is 8. And then, all of these add up to 40. Once you calculate these expected values, it's a good time to make sure
you're meeting your conditions for conducting a Chi-squared test. The first is the random condition. And so, these need to
be truly random samples, so hopefully, we met that condition. The second is that the expected value for any of these data points have to be at least equal to five. And so we have met that condition. These are all at least equal to five. And then the last condition
is the independence condition that we are either
sampling with replacement or if we're not sampling with replacement, we have to feel good that our samples are no more than 10
percent of the population. So let's assume that
that is the case as well. And now, we're ready to calculate
our Chi-squared statistic. We would get our Chi-squared statistic is going to be equal to the difference between what we got and the expected, squared. So 30 minus 24, squared, divided by the expected, divided by 24. And we'll do it for all
six of these data points. So then, I will go to the next one. So then, this is going to be, so plus and if I look
at this and this here, I'm going to have 10 minus 16, squared over expected 16. And then, I'm going to have,
I'll look at that data point and that expected and I will get 15 minus 24 squared
over expected, over 24. I'm running out of colors. And then we would look at
that, those two numbers and we would say, plus 25 minus 16 squared,
divided by expected, and then, we would get, we
would look at these two, plus 15 minus 12 squared, over expected, over 12. And then, last but not least, lemme find a color I haven't used. We would look at that and
that and we would say, plus five minus eight squared over expected, over eight. Now, once you get that value
for the Chi-squared statistic, the next question is, what
are the degrees of freedom? Now a simple rule of thumb
is to just look at your data and think about the number of
rows and the number of columns and we have three rows and two columns. And so your degrees of
freedom are going to be the number of rows minus
one, three minus one, times the number of columns
minus one, two minus one. And so this is going to
be equal to two times one which is equal to two. Now the reason why that
makes intuitive sense is think about it, if you knew two of these data points, and if you knew all of the totals, then you could figure out
the other data points. If you knew these two data points, you could figure out that. If you knew this data point
and you knew the total, you could figure out that. If you knew this data point and you knew the total you could figure out that And if you figured out that and that, then you could figure
out this right over here. And so that's why this
rule of thumb works. The number of rows minus one times the number of columns minus one gives you your degrees of freedom. Now, given this Chi-squared statistic that I haven't calculated but you could type this into a calculator
and figure it out, and this degrees of freedom, we could then figure out the P value. We could figure out the probability of getting a Chi-squared
statistic this extreme or more extreme. And if this is less than
our significance level which we should have set ahead of time, then we would reject the null hypothesis and it would suggest the alternative. If this is not less than
our significance level, then it does not allow us to
reject the null hypothesis.