Main content

### Course: Statistics and probability > Unit 12

Lesson 5: More significance testing videos# Large sample proportion hypothesis testing

Sal uses a large sample to test if more than 30% of US households have internet access. Created by Sal Khan.

## Want to join the conversation?

- Bernoulli distribution can be approximated by a Normal distribution (if np >5 & n(1-p) >5)?? Can someone point out the logic here? It seems that I have missed some formula or sth?(13 votes)
- As far as I am aware, these conditions are subjective. They are used because someone looked at the exact vs. approximate probabilities, and decided that when these conditions were satisfied, the results were close enough. I have never once seen a mathematical justification of those conditions. I would be delighted to be proven wrong, but my (admittedly sparse) searching has turned up nothing.

Also relevant, I seen the conditions require`np ≥ 10`

or even`np ≥ 15`

(and similarly for`n(1-p)`

), which again makes me think these are subjective rules. Probably it should depend on the context: if your application requires greater precision, then you should probably insist on a larger number (like 10 or 15) before using the Normal Approximation to the Binomial Distribution.

Actually, if you need higher precision, you should probably just use exact probabilities. Using the Normal distribution to approximate the Binomial distribution was more important before there was computing power to evaluate large factorials. These days, we have powerful computers that can give us exact Binomial probabilities even for large sample sizes.(15 votes)

- Why shouldn't we calculate it in terms of our sample and take the confidence interval approach? So we calculate the sample mean (.38) and sample variance (.3835), and then estimate the standard deviation of our sampling distribution of the sample mean (.0313) using our sample variance. Then we'd see that a mean of .3 is about 2.55 standard deviations away from a mean of .38, which leads us to reject the null hypothesis. Can we do that?(8 votes)
- Even I solved the problem with sample point of view. It is perfectly fine to solve it in that manner. Sal 's method is better than sample approach because hypothesis becomes more accurate if we calculate the SD of population than estimating SD of population from sample.(2 votes)

- How is that a null hypothesis? Aren't both of those directional hypothesizes?(5 votes)
- Null hypotheses can be directional. The null hypothesis essentially states that whatever you are looking to find, you haven't found it. So, if your question is such that it is making a directional claim, the null hypothesis will also be directional.(3 votes)

- Do these type of problems require a random or representative sample of the population? If so, are there some kind of rules to collecting this sample to make sure it is actually representative of the entire population?(4 votes)
- The rule for it to be most representative of the population is in fact that it is random. In a Data Analysis course at University we spent most of our time understanding how easy it is to alterate the randomness of a sample in "real life". There is tons of material on DATA GENERATING PROCSSES, however it is out of the scope of introductory statistics. For example think of an online survey a supermaket gathers from its costumers. This collects data only from people having a computer: can it still produce an unbiased set of information? It could for example exclude a higher percentage of old people w.r.t. young people.(4 votes)

- At about half-way through, Sal mentions a test to indicate if the Bernoulli distribution can be approximated by a Normal distribution (if np >5 & n(1-p) >5). Could someone elaborate on this test?(3 votes)
- Bernoulli distributions are situations where there are 2 options.

Example 1 - You have Internet access, OR, you don't have Internet access.

Example 2 - A person is taller than 5' 8", OR a person is not taller than 5'8"

Example 3 - A car is black, white or grey, OR a car is NOT black white or grey.

In each case, the probabilities of the two options ALWAYS add to 1, and they can be written as "p" and (1-P".

Examples with MADE UP DATA

Example 1 p=.38 ; 1-p = 1-.38 = .62.

Example 2 p=.82; 1-p = 1-.82 = .18

Example 3 p=.75; 1-p = 1-.75 = .25.

Bernoulli distributions are excellent for computer programming and electrical circuits when something is either TRUE or NOT-TRUE; or ON or NOT-ON, which is the same way of saying TRUE or FALSE; or ON or OFF, modeled by binary as 0 or 1.

At4:11he is testing if .30 HAVE and (1-.3) don't have, or .3 OR .7.

Hope that makes a little more sense(6 votes)

- When do we use the sqrt((n*p*(1-p))/(n-1 )) formula to get the standard deviation of the sampling distribution if the sample mean and when do we use the sqrt(p*(1-p)) formula as used in this video?(5 votes)
- At 6.02, when you talk about the success-failure rule, shouldn't n(p)>10 and n(1-p)>10? Not 5?(2 votes)
- I've seen both recommended, and also both using ≥ instead of >. The idea is basically just something to try and "help out" the Central Limit Theorem.

With numeric data, if some random variable X is normally distributed, then xbar will be normally distributed no matter the sample size. If X isn't quite normal, but close to normal, then xbar will be approximately normal even if n is fairly small. The more non-normal X is, the larger the sample size must be in order to "force" the sampling distribution of xbar to be normal.

This is the same sort of idea with the Normal Approximation to the Binomial distribution. If the probability is p=0.5, the distribution will be very symmetric, and we won't need so many observations before phat is roughly normal. If p is more extreme (closer to 0 or 1), then the distribution will be more skewed, and the sample size will need to be larger to overcome this.

The exact value, 5 or 10, etc, I'm not sure where they come from, but I've never seen any sort of "derivation" of them. My guess is that they are just based on some simulation study. And at the end of the day, regardless of which rule we use, we are only getting an approximation anyway. Back in the day, when computers were less advanced and hence it was more difficult to compute a lot of Binomial probabilities (or large combinations, etc), these approximations were more important. With the development of more advanced technology, we're able to bypass the need for them altogether. For instance, we could perform this same test using the Binomial distribution directly, instead of approximating it with the Normal distribution.(4 votes)

- so what happens if np or n(1-p) is smaller than 5? we can't assume its normal. so what can we do to solve the problem? use t-statistics?(3 votes)
- Yes.

The procedure to solve the problem will remain same. Only difference will be that we will be referring to t-table instead of z-table to calculate the critical z-value.(2 votes)

- In all previous videos, we said we have to have at least 10 expected successes and 10 failure to assume normality, why is it 5 now ?(3 votes)
- why are we only doing a one tailed test?(2 votes)
- It's because of how the stating question is phrased. Since we're trying to see if "more than 30% of U.S. households" are online, we're only interested in the "upper" tail of the curve - if far
*less*than 30% are connected, we still want to retain the null hypothesis.(1 vote)

## Video transcript

We want to test the hypothesis
that more than 30% of U.S. households have internet access
with a significance level of 5%. We collect a sample of 150
households, and find that 57 have access. So to do our hypothesis test,
let's just establish our null hypothesis and our alternative
hypothesis. So our null hypothesis is that
the hypothesis is not correct. Our null hypothesis is that
the proportion of U.S. households that have internet
access is less than or equal to 30%. And our alternative hypothesis,
is what our hypothesis actually is, is
that the proportion is greater than 30%. We see it over here. We want to test the hypothesis
that more than 30% of U.S. households have internet
access. That's that right here. This is what we're testing. We're testing the alternative
hypothesis. And the way we're going to do it
is we're going to assume a P-value based on the
null hypothesis. We're going to assume a
proportion based on the null hypothesis for the population. And the given that assumption,
what is the probability that 57 out of 150 of our samples
actually have internet access. And if that probability is less
than 5%, if it's less than our significance level,
then we're going to reject the null hypothesis in favor
of the alternative one. So let's think about
this a little bit. So we're going to start off
assuming-- we're going to assume the null hypothesis
is true. And in that assumption we're
going to have to pick a population proportion or a
population mean-- we know that for Bernoulli distributions
do the same thing. And what I'm going to do is I'm
going to pick a proportion so high so that it maximizes
the probability of getting this over here. And we actually don't even
know what that number is. And actually so that we can
think about a little more intelligent, let's just find
out what our sample proportion even is. We had 57 people out of 150
having internet access. So 57 households out of 150. So our sample proportion
is 0.38, so let me write that over here. Our sample proportion
is equal to 0.38. So when we assume our null
hypothesis to be true, we're going to assume a population
proportion that maximizes the probability that we get
this over here. So the highest population
proportion that's within our null hypothesis that will
maximize the probability of getting this is actually
if we are right at 30%. So if we say our population
proportion, we're going to assume this is true. This is our null hypothesis. We're going to assume that
it is 0.3 or 30%. And I want you understand that--
29% would have been a null hypothesis. 28% that would have been
a null hypothesis. But for 29% or 28%, the
probability of getting this would have been even lower. So it wouldn't have been
as strong of a test. If we take the maximum
proportion that still satisfies our null hypothesis,
we're maximizing the probability that we get this. So if that number is still low,
if it's still less than 5%, we can feel pretty good
about the alternative hypothesis. So just to refresh ourselves
we're going to assume a population proportion of 0.3,
and if we just think about the distribution-- sometimes it's
helpful to draw these things, so I will draw it. So this is what the population
distribution looks like based on our assumption,
based on this assumption right over here. Our population distribution
has-- or maybe I should write 30% have internet access. And I'll signify
that with a 1. And then the rest don't
have internet access. 70% do not have internet
access. This is just a Bernoulli
distribution. We know that the mean over here
is going to be the same thing as the proportion that
has internet access. So the mean over here is going
to be 0.3, same thing as 30%. This is the population mean. And maybe I should
write this way. The mean assuming our null
hypothesis, the population mean assuming our null
hypothesis is 0.3. And then the population
standard deviation. Let me write this over
here in yellow. The population standard
deviation assuming our null hypothesis. And we've seen this when we
first learned about Bernoulli distributions. It is going to be the square
root of the percentage of the population that has internet
access, so 0.3 times the proportion of the population
that does not have internet access, times 0.7
right over here. So this is the square
root of 0.21. And we could deal with this
later using our calculator. Now, with that out of the way,
we want to figure out the probability of getting a sample proportion that has a 0.38. So let's look at the
distribution of sample proportions. So you could literally look at
every combination of getting 150 households from this, and
you would actually get a binomial distribution. And we've also seen
this before. You would actually get a
binomial distribution where you'd get a bunch of
bars like that. But if your n is suitably large,
and in particular-- and this is kind of the test for
it-- the test if n times p-- and in this case we're saying
p is 30%-- if n times p is greater than 5, and n times 1
minus p is greater than 5, you can assume that the distribution
of the sample proportion or the sample
proportion distribution is going to be normal. So if you looked at all of the
different ways you could sample 150 households from this
population, you get all of these bars. But since our n is pretty big,
it's 150, and 150 times 0.3 is obviously greater than 5. 150 times 0.7 is also
greater than 5. You can approximate that with
a normal distribution. So let me do that. So you can approximate it with
a normal distribution. So this is a normal distribution
right over there. Now the mean of the distribution
of the proportion data that we're assuming is a
normal distribution is going to be-- and remember, working
under the context that the null hypothesis is true. So this mean is going to be--
this mean right here-- so the mean of our sample proportions
is going to be the same thing as our population mean. So this is going to be 0.3,
same value as that. And the standard deviation---
this comes straight from the central limit theorem. So the standard deviation of
our sample proportions, the standard deviation is going to
be the square root-- let me put it this way-- it's going to
be our population standard deviation, the standard
deviation we're assuming with our null hypothesis divided by
the square root of the number of samples we have. And in this
case we have 150 samples. It's going to be 150 samples
and we can calculate this. This value on top we just
figured out is the square root of 0.21. So this is the square root
of 0.21 over the square root of 150. And I can get the calculator
out to calculate this. So I'll just do it the
way I wrote it. The square root of 0.21-- and
I'm going to divide that, so whatever answer is I'm going to
divide that by the square root of 150. So it's 0.037. So we figured out the standard
deviation here of our-- or the distribution of our sample
proportions is going to be-- let me write this down, I'll
scroll over to the right a little bit-- it is 0.037. I think I'm falling off the
screen a little bit. So we'll just say 0.037. Now to figure out the
probability of having a sample proportion of 0.38, we just have
to figure out how many standard deviations that is
away from our mean, or essentially calculate a
Z-statistic for our sample, because a Z-statistic or a
Z-score is really just how many standard deviations you
are away from the mean. And then figure out whether
the probability of getting that Z-statistic is more
or less than 5%. So let's figure out how many
standard deviations we are away from the mean. So just remind ourselves, this
sample proportion we got we can view as just a sample from
this distribution of all of the possible sample
proportions. So how many standard
deviations away from the mean is this? So if we take our sample
proportion, subtract from that the mean of the distribution
of sample proportions and divide it by the standard
deviation of the distribution of the sample proportions, we
get 0.38, 0.38 minus 0.3. All of that over this
value which we just figured out was 0.037. So what does that give us? The numerator over
here is a 0.08. The denominator is 0.037. So let's figure this out. So our numerator is 0.08 divided
by this last number right here, which
is the 0.037. So second answer and we get
2.1-- I'll just round it to 2.14 standard deviations. So this is equal to-- this right
here is equal to 2.14 standard deviations. Or we could say that our
Z-statistic, right, we could call this our Z-score or our
Z-statistic, the number of standard deviations we are away
from our mean is 2.14. We're at 2.14, and to be exact,
we're 2.14 standard deviations above the mean. We're going to care about a
one-tailed distribution. Now is the probability
of getting this more or less than 5%? If it's less than 5% we're
going to reject the null hypothesis in favor of
our alternative. So how do we think about that? Well let's think about just
a normalized normal distribution. Or maybe you could call the
Z-distribution if you want. If you look at a normal
distribution, a completely normalized normal distribution, it's mean is at 0. And essentially each
of these values are essentially Z-scores. Because a value of 1 literally
means you are 1 standard deviation away from this
mean over here. So we need to find a critical
Z-value right over here. Let me call that a critical Z--
we could even say critical Z-score or critical Z-value--
so that the probability of getting a Z-value higher
than that is 5%. So that this whole area
right here is 5%. And that's because that's what
our significance level is. Anything that has a lower than
5% a chance of occurring, for us will be validation to reject
our null hypothesis. Or another way of thinking about
it is that area's 5%, this whole area right
over here is 95%. And once again, this is a
one-tailed test, because we only care about values
greater than this. Z-values greater than that will
make us reject the null hypothesis. And to figure out this critical
Z-value you can literally just go
to a Z-table. And we say OK, the probability
of being a Z-value less than that is 95%. And that's exactly the number
that this gives us. The cumulative probability
of getting a value less than that. So if we just scan this,
we're looking for 95%. We have 0.9495, we
have 0.9505. So I'll go with this just
to make sure we're a little bit closer. So this Z-value, and the z-value
here is 1.6, and the next digit is 5. 1.6 5. So this critical Z-value
is equal to 1.65. So the probability of getting
a Z-value less than 1.65, or even in a completely
normalized normal distribution, the probability
of getting a value less than 1.65. Or in any normal distribution,
the probability of being less than 1.65 standard deviations
away from the mean is going to be 95%. So that's our critical
Z-value. Now does Z-value, or the
Z-statistic, for our actual sample is 2.14. Our actual Z-value
we got is 2.14. It's sitting all the way
out here some place. So the probability of
getting that was definitely less than 5%. And actually we could even say
what's the probability of getting that or a more
extreme result. And if you figured out this
area, and you could actually figure it out by looking at a
Z-table, you could figure out the P-value of this result. But anyway, the whole exercise
here is just to figure out if can reject the null hypothesis
with a significance level 5%. We can. This is a more extreme result
than our critical Z-value, so we can reject the null
hypothesis in favor of our alternative.