Statistics and probability
Large sample proportion hypothesis testing
Sal uses a large sample to test if more than 30% of US households have internet access. Created by Sal Khan.
Want to join the conversation?
- Bernoulli distribution can be approximated by a Normal distribution (if np >5 & n(1-p) >5)?? Can someone point out the logic here? It seems that I have missed some formula or sth?(11 votes)
- As far as I am aware, these conditions are subjective. They are used because someone looked at the exact vs. approximate probabilities, and decided that when these conditions were satisfied, the results were close enough. I have never once seen a mathematical justification of those conditions. I would be delighted to be proven wrong, but my (admittedly sparse) searching has turned up nothing.
Also relevant, I seen the conditions require
np ≥ 10or even
np ≥ 15(and similarly for
n(1-p)), which again makes me think these are subjective rules. Probably it should depend on the context: if your application requires greater precision, then you should probably insist on a larger number (like 10 or 15) before using the Normal Approximation to the Binomial Distribution.
Actually, if you need higher precision, you should probably just use exact probabilities. Using the Normal distribution to approximate the Binomial distribution was more important before there was computing power to evaluate large factorials. These days, we have powerful computers that can give us exact Binomial probabilities even for large sample sizes.(12 votes)
- Why shouldn't we calculate it in terms of our sample and take the confidence interval approach? So we calculate the sample mean (.38) and sample variance (.3835), and then estimate the standard deviation of our sampling distribution of the sample mean (.0313) using our sample variance. Then we'd see that a mean of .3 is about 2.55 standard deviations away from a mean of .38, which leads us to reject the null hypothesis. Can we do that?(8 votes)
- Even I solved the problem with sample point of view. It is perfectly fine to solve it in that manner. Sal 's method is better than sample approach because hypothesis becomes more accurate if we calculate the SD of population than estimating SD of population from sample.(2 votes)
- Do these type of problems require a random or representative sample of the population? If so, are there some kind of rules to collecting this sample to make sure it is actually representative of the entire population?(4 votes)
- The rule for it to be most representative of the population is in fact that it is random. In a Data Analysis course at University we spent most of our time understanding how easy it is to alterate the randomness of a sample in "real life". There is tons of material on DATA GENERATING PROCSSES, however it is out of the scope of introductory statistics. For example think of an online survey a supermaket gathers from its costumers. This collects data only from people having a computer: can it still produce an unbiased set of information? It could for example exclude a higher percentage of old people w.r.t. young people.(4 votes)
- How is that a null hypothesis? Aren't both of those directional hypothesizes?(5 votes)
- Null hypotheses can be directional. The null hypothesis essentially states that whatever you are looking to find, you haven't found it. So, if your question is such that it is making a directional claim, the null hypothesis will also be directional.(2 votes)
- At about half-way through, Sal mentions a test to indicate if the Bernoulli distribution can be approximated by a Normal distribution (if np >5 & n(1-p) >5). Could someone elaborate on this test?(3 votes)
- Bernoulli distributions are situations where there are 2 options.
Example 1 - You have Internet access, OR, you don't have Internet access.
Example 2 - A person is taller than 5' 8", OR a person is not taller than 5'8"
Example 3 - A car is black, white or grey, OR a car is NOT black white or grey.
In each case, the probabilities of the two options ALWAYS add to 1, and they can be written as "p" and (1-P".
Examples with MADE UP DATA
Example 1 p=.38 ; 1-p = 1-.38 = .62.
Example 2 p=.82; 1-p = 1-.82 = .18
Example 3 p=.75; 1-p = 1-.75 = .25.
Bernoulli distributions are excellent for computer programming and electrical circuits when something is either TRUE or NOT-TRUE; or ON or NOT-ON, which is the same way of saying TRUE or FALSE; or ON or OFF, modeled by binary as 0 or 1.
At4:11he is testing if .30 HAVE and (1-.3) don't have, or .3 OR .7.
Hope that makes a little more sense(6 votes)
- When do we use the sqrt((n*p*(1-p))/(n-1 )) formula to get the standard deviation of the sampling distribution if the sample mean and when do we use the sqrt(p*(1-p)) formula as used in this video?(5 votes)
- At 6.02, when you talk about the success-failure rule, shouldn't n(p)>10 and n(1-p)>10? Not 5?(2 votes)
- I've seen both recommended, and also both using ≥ instead of >. The idea is basically just something to try and "help out" the Central Limit Theorem.
With numeric data, if some random variable X is normally distributed, then xbar will be normally distributed no matter the sample size. If X isn't quite normal, but close to normal, then xbar will be approximately normal even if n is fairly small. The more non-normal X is, the larger the sample size must be in order to "force" the sampling distribution of xbar to be normal.
This is the same sort of idea with the Normal Approximation to the Binomial distribution. If the probability is p=0.5, the distribution will be very symmetric, and we won't need so many observations before phat is roughly normal. If p is more extreme (closer to 0 or 1), then the distribution will be more skewed, and the sample size will need to be larger to overcome this.
The exact value, 5 or 10, etc, I'm not sure where they come from, but I've never seen any sort of "derivation" of them. My guess is that they are just based on some simulation study. And at the end of the day, regardless of which rule we use, we are only getting an approximation anyway. Back in the day, when computers were less advanced and hence it was more difficult to compute a lot of Binomial probabilities (or large combinations, etc), these approximations were more important. With the development of more advanced technology, we're able to bypass the need for them altogether. For instance, we could perform this same test using the Binomial distribution directly, instead of approximating it with the Normal distribution.(4 votes)
- so what happens if np or n(1-p) is smaller than 5? we can't assume its normal. so what can we do to solve the problem? use t-statistics?(3 votes)
The procedure to solve the problem will remain same. Only difference will be that we will be referring to t-table instead of z-table to calculate the critical z-value.(2 votes)
- why are we only doing a one tailed test?(2 votes)
- It's because of how the stating question is phrased. Since we're trying to see if "more than 30% of U.S. households" are online, we're only interested in the "upper" tail of the curve - if far less than 30% are connected, we still want to retain the null hypothesis.(1 vote)
- Why did Sal use the Z-table instead of the T-table with such a small standard deviation. Is it because we do not have a mean?(1 vote)
- We use t-distribution when we approximate the standard deviation of the population with the standard deviation of the sample.
In case of proportions we can use z-scores since we calculate the population standard deviation directly by the formula, without referring to the sample standard deviation .(2 votes)
We want to test the hypothesis that more than 30% of U.S. households have internet access with a significance level of 5%. We collect a sample of 150 households, and find that 57 have access. So to do our hypothesis test, let's just establish our null hypothesis and our alternative hypothesis. So our null hypothesis is that the hypothesis is not correct. Our null hypothesis is that the proportion of U.S. households that have internet access is less than or equal to 30%. And our alternative hypothesis, is what our hypothesis actually is, is that the proportion is greater than 30%. We see it over here. We want to test the hypothesis that more than 30% of U.S. households have internet access. That's that right here. This is what we're testing. We're testing the alternative hypothesis. And the way we're going to do it is we're going to assume a P-value based on the null hypothesis. We're going to assume a proportion based on the null hypothesis for the population. And the given that assumption, what is the probability that 57 out of 150 of our samples actually have internet access. And if that probability is less than 5%, if it's less than our significance level, then we're going to reject the null hypothesis in favor of the alternative one. So let's think about this a little bit. So we're going to start off assuming-- we're going to assume the null hypothesis is true. And in that assumption we're going to have to pick a population proportion or a population mean-- we know that for Bernoulli distributions do the same thing. And what I'm going to do is I'm going to pick a proportion so high so that it maximizes the probability of getting this over here. And we actually don't even know what that number is. And actually so that we can think about a little more intelligent, let's just find out what our sample proportion even is. We had 57 people out of 150 having internet access. So 57 households out of 150. So our sample proportion is 0.38, so let me write that over here. Our sample proportion is equal to 0.38. So when we assume our null hypothesis to be true, we're going to assume a population proportion that maximizes the probability that we get this over here. So the highest population proportion that's within our null hypothesis that will maximize the probability of getting this is actually if we are right at 30%. So if we say our population proportion, we're going to assume this is true. This is our null hypothesis. We're going to assume that it is 0.3 or 30%. And I want you understand that-- 29% would have been a null hypothesis. 28% that would have been a null hypothesis. But for 29% or 28%, the probability of getting this would have been even lower. So it wouldn't have been as strong of a test. If we take the maximum proportion that still satisfies our null hypothesis, we're maximizing the probability that we get this. So if that number is still low, if it's still less than 5%, we can feel pretty good about the alternative hypothesis. So just to refresh ourselves we're going to assume a population proportion of 0.3, and if we just think about the distribution-- sometimes it's helpful to draw these things, so I will draw it. So this is what the population distribution looks like based on our assumption, based on this assumption right over here. Our population distribution has-- or maybe I should write 30% have internet access. And I'll signify that with a 1. And then the rest don't have internet access. 70% do not have internet access. This is just a Bernoulli distribution. We know that the mean over here is going to be the same thing as the proportion that has internet access. So the mean over here is going to be 0.3, same thing as 30%. This is the population mean. And maybe I should write this way. The mean assuming our null hypothesis, the population mean assuming our null hypothesis is 0.3. And then the population standard deviation. Let me write this over here in yellow. The population standard deviation assuming our null hypothesis. And we've seen this when we first learned about Bernoulli distributions. It is going to be the square root of the percentage of the population that has internet access, so 0.3 times the proportion of the population that does not have internet access, times 0.7 right over here. So this is the square root of 0.21. And we could deal with this later using our calculator. Now, with that out of the way, we want to figure out the probability of getting a sample proportion that has a 0.38. So let's look at the distribution of sample proportions. So you could literally look at every combination of getting 150 households from this, and you would actually get a binomial distribution. And we've also seen this before. You would actually get a binomial distribution where you'd get a bunch of bars like that. But if your n is suitably large, and in particular-- and this is kind of the test for it-- the test if n times p-- and in this case we're saying p is 30%-- if n times p is greater than 5, and n times 1 minus p is greater than 5, you can assume that the distribution of the sample proportion or the sample proportion distribution is going to be normal. So if you looked at all of the different ways you could sample 150 households from this population, you get all of these bars. But since our n is pretty big, it's 150, and 150 times 0.3 is obviously greater than 5. 150 times 0.7 is also greater than 5. You can approximate that with a normal distribution. So let me do that. So you can approximate it with a normal distribution. So this is a normal distribution right over there. Now the mean of the distribution of the proportion data that we're assuming is a normal distribution is going to be-- and remember, working under the context that the null hypothesis is true. So this mean is going to be-- this mean right here-- so the mean of our sample proportions is going to be the same thing as our population mean. So this is going to be 0.3, same value as that. And the standard deviation--- this comes straight from the central limit theorem. So the standard deviation of our sample proportions, the standard deviation is going to be the square root-- let me put it this way-- it's going to be our population standard deviation, the standard deviation we're assuming with our null hypothesis divided by the square root of the number of samples we have. And in this case we have 150 samples. It's going to be 150 samples and we can calculate this. This value on top we just figured out is the square root of 0.21. So this is the square root of 0.21 over the square root of 150. And I can get the calculator out to calculate this. So I'll just do it the way I wrote it. The square root of 0.21-- and I'm going to divide that, so whatever answer is I'm going to divide that by the square root of 150. So it's 0.037. So we figured out the standard deviation here of our-- or the distribution of our sample proportions is going to be-- let me write this down, I'll scroll over to the right a little bit-- it is 0.037. I think I'm falling off the screen a little bit. So we'll just say 0.037. Now to figure out the probability of having a sample proportion of 0.38, we just have to figure out how many standard deviations that is away from our mean, or essentially calculate a Z-statistic for our sample, because a Z-statistic or a Z-score is really just how many standard deviations you are away from the mean. And then figure out whether the probability of getting that Z-statistic is more or less than 5%. So let's figure out how many standard deviations we are away from the mean. So just remind ourselves, this sample proportion we got we can view as just a sample from this distribution of all of the possible sample proportions. So how many standard deviations away from the mean is this? So if we take our sample proportion, subtract from that the mean of the distribution of sample proportions and divide it by the standard deviation of the distribution of the sample proportions, we get 0.38, 0.38 minus 0.3. All of that over this value which we just figured out was 0.037. So what does that give us? The numerator over here is a 0.08. The denominator is 0.037. So let's figure this out. So our numerator is 0.08 divided by this last number right here, which is the 0.037. So second answer and we get 2.1-- I'll just round it to 2.14 standard deviations. So this is equal to-- this right here is equal to 2.14 standard deviations. Or we could say that our Z-statistic, right, we could call this our Z-score or our Z-statistic, the number of standard deviations we are away from our mean is 2.14. We're at 2.14, and to be exact, we're 2.14 standard deviations above the mean. We're going to care about a one-tailed distribution. Now is the probability of getting this more or less than 5%? If it's less than 5% we're going to reject the null hypothesis in favor of our alternative. So how do we think about that? Well let's think about just a normalized normal distribution. Or maybe you could call the Z-distribution if you want. If you look at a normal distribution, a completely normalized normal distribution, it's mean is at 0. And essentially each of these values are essentially Z-scores. Because a value of 1 literally means you are 1 standard deviation away from this mean over here. So we need to find a critical Z-value right over here. Let me call that a critical Z-- we could even say critical Z-score or critical Z-value-- so that the probability of getting a Z-value higher than that is 5%. So that this whole area right here is 5%. And that's because that's what our significance level is. Anything that has a lower than 5% a chance of occurring, for us will be validation to reject our null hypothesis. Or another way of thinking about it is that area's 5%, this whole area right over here is 95%. And once again, this is a one-tailed test, because we only care about values greater than this. Z-values greater than that will make us reject the null hypothesis. And to figure out this critical Z-value you can literally just go to a Z-table. And we say OK, the probability of being a Z-value less than that is 95%. And that's exactly the number that this gives us. The cumulative probability of getting a value less than that. So if we just scan this, we're looking for 95%. We have 0.9495, we have 0.9505. So I'll go with this just to make sure we're a little bit closer. So this Z-value, and the z-value here is 1.6, and the next digit is 5. 1.6 5. So this critical Z-value is equal to 1.65. So the probability of getting a Z-value less than 1.65, or even in a completely normalized normal distribution, the probability of getting a value less than 1.65. Or in any normal distribution, the probability of being less than 1.65 standard deviations away from the mean is going to be 95%. So that's our critical Z-value. Now does Z-value, or the Z-statistic, for our actual sample is 2.14. Our actual Z-value we got is 2.14. It's sitting all the way out here some place. So the probability of getting that was definitely less than 5%. And actually we could even say what's the probability of getting that or a more extreme result. And if you figured out this area, and you could actually figure it out by looking at a Z-table, you could figure out the P-value of this result. But anyway, the whole exercise here is just to figure out if can reject the null hypothesis with a significance level 5%. We can. This is a more extreme result than our critical Z-value, so we can reject the null hypothesis in favor of our alternative.