If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

Main content

### Course: Statistics and probability>Unit 14

Lesson 2: Chi-square tests for relationships

# Filling out frequency table for independent events

Give the row and column totals, Sal fills in the cells of a frequency table so that the events are independent. Created by Sal Khan.

## Want to join the conversation?

• I'm kind of confused why I'm completing this question in the very first section when we actually learn about it much later on :/
(38 votes)
• In other words if the pattern/frequency remains the same in every event/condition, then there is nothing in either of those event/conditions that is affecting the pattern or frequency, meaning the events/conditions are independent?
(16 votes)
• I don't buy it. Why exactly 20%? It could be an approximate value that woul be near to 20%.....
(1 vote)
• In this video, Sal is basically setting up the first half of a Chi Square Test for Independence. To do the test, you find the expected frequencies for each cell based on what would have happened if there is no relationship between the two events (thus, 20%), and you compare this to the observed (actual) frequencies. Thus 20% is the "null hypothesis," and when you do the Chi Square Test, the result is based on the difference between the observed and expected frequencies. But you are also correct that the 20% is only an approximation. In other words, to determine that there is independence (i.e. that the two events aren't related), the test statistic doesn't have to exactly equal 20%, but if it's close enough to 20% then we can reject the hypothesis that the two events are related.
(12 votes)
• What IS "categorical data" and where do I find a discussion/explanation of the term?
(4 votes)
• Here "data" just means counting.
And "category" just means something you can observe or ask about.
There are a crazy number of possible categories, if you think about anything for a minute you can easily imagine lots of categories.
Here is a toy example: bicycles.

Let's say you were studying bicycles so you wait at school in the morning to watch who arrives by bicycle. You can count things (categories) about the bicycles: how many "gears" they have (single speed, 3 speed, 10 speed, 16 speed etc), what color they are (red, blue, green etc.), handlebar style (straight, under, over), if they have a water bottle (yes vs no).
Also you can count things about the riders: female vs male, long hair vs short hair, age, teacher vs student. You could also count how many people arrive by bicycle vs. on foot vs. by car and so on.
You could ask the bikers how many days they bike to school (every school day? once per week? twice? etc).

So that is how you get "categorical data" - you just count stuff.
There are a bunch of categories here you could add (did the rider wear glasses? a helmet? a jacket? what kind of tires did the bicycles have? did the bicycle have stickers / decorations? did the riders have a tattoo? It just goes on and on :-) ).

Now... why does anybody care?
Well, maybe you want to persuade more people to ride bicycles.
Knowing who likes bicycles - and who doesn't - may help you decide who to target with your "advertising".
Since there are so many things you could count this is what researchers study to help decide what things should be counted (where "researcher" is a scientist or statistician or marketing... lots of people get interested in this stuff because they're interested in something else).
Anyway, this was just a toy example but I hope it gives you an idea of "categorical data".
(5 votes)
• By the way, I noticed that if Mom is grouchy exactly 1/5 of the time, rain or shine, then the 3 entries of all 3 columns are in the same ratios (perhaps obvious, but maybe worth noting).
(1 vote)
• Why does

P(mom grouchy | raining) = P(mom grouchy)

imply that

P(mom grouchy | not raining) = P(mom grouchy)?

It seems like it might be obvious but I can't tell what.

Edit: nevermind guys some algebra proves it.
(1 vote)
• why is it so fricking hard and you dont help very much only for one type of problem
(1 vote)
• why do we calculated the 20% for the raining+ grouchy and not for not raining+grouchy? If the probability is the same it wouldn't matter.
(1 vote)
• The 20% chance only applies to whether their mother is grouchy or not (the columns). It doesn't apply to whether or not it's raining that day (the rows). I hope that answers your question.
(1 vote)
• Can anyone tell me what a frequency polygon is?
(1 vote)
• How to find the frequency distribution
(0 votes)

## Video transcript

Voiceover:One rainy Saturday morning, Adam woke up to hear his mom complaining about the house being dirty. "Mom is always grouchy when it rains," Adam's brother said to him. So Adam decided to figure out if this statement was actually true. For the next year, he charted every time it rained and every time his mom was grouchy. What he found was very interesting. Rainy days and his mom being grouchy were entirely independent events. Some of his data are shown in the table below. Fill in the missing values from the frequency table. Let's see, we have raining days and not raining days and the total days that he kept the data for. And then he tabulated on or let's say, the raining days whether his mom was grouchy or not grouchy. And on a not raining day whether his mom was grouchy or not grouchy. And there's a total of 35 days it rained, 330 days that it didn't rain. And then 73 times his mom was grouchy and 292 times his mom was not grouchy. So the first thing is how do we figure this out? We have these 4 boxes here. It's not clear that we can just... we have enough information to fill it out just with this table. But we have to remember what they told us. They told us that his mom being grouchy and it raining were entirely independent events. Another way of saying that is the probability of his... Let me do this in color that you're more likely to see. Another way of saying that... So independent events, that means that the probability... My pen is acting up a little bit. Probability that mom is grouchy. So let me write that. Mom... My pen is really... Mom is grouchy given it is raining. It shouldn't really matter whether it's raining. It should just be the same thing as the probability of mom being grouchy in general. So what does that tell us? Well we can figure out the probability that mom is grouchy in general. She's grouchy 73 out of 365 days. So the probability that mom is grouchy in general is going to be 73 divided by 365. And so [are these] just based on the data we have. That's the best estimate that mom is grouchy. The probability that mom is grouchy. It's the percentage of days that she's been grouchy. So that is .2. So based on the data, the best estimate of the probability of mom being grouchy is .2 or 20%. And so we should have the probability of mom being grouchy given that it's raining should be 20% as well. So this number... So given that it's raining, we should also have 20% of the time, mom is grouchy because these are independent events. It shouldn't matter whether it's raining or not. This should be 20%... She should be grouchy 20% of time that it's raining and she should be grouchy 20% of the time that it's not raining. That's what would be consistent with the data saying that these were entirely independent events. So what is 20% of 35? Well 20% is 1/5th. 1/5th of 35 is 7. And once again, all I did is I said 20% of 35 is 7. And if that's 7 then 35 minus 7. That's gonna be 28 right over there. And then if this is 7, then 73 minus 7 is going to be 66. And 330... I guess there's a couple of way we could do it. We could take... Actually we could just take 292 minus 28 is going to be... Let's see 292 minus 8 would be 284. Minus another 20, 264. And do the numbers all add up? Yes. 66 plus 264 is 330. So the key realization here is what he's saying he found was very interesting. Rainy days and his mom being grouchy were entirely independent events. That means that the probability of his mom being grouchy... It shouldn't matter whether it's raining or not. It should just be... It should be the same probability of whether it's raining or not. And our best estimate of the probability of his mom being grouchy is on the total days, is 20%. And so if the data's backing up that it's independent events then the best way to fill this out would be the probability of his mom being grouchy on a rainy day or not rainy day should be the same. And that's what we filled out right over here.