If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

## Statistics and probability

### Course: Statistics and probability>Unit 12

Lesson 5: More significance testing videos

# Large sample proportion hypothesis testing

Sal uses a large sample to test if more than 30% of US households have internet access. Created by Sal Khan.

## Want to join the conversation?

• Bernoulli distribution can be approximated by a Normal distribution (if np >5 & n(1-p) >5)?? Can someone point out the logic here? It seems that I have missed some formula or sth? • As far as I am aware, these conditions are subjective. They are used because someone looked at the exact vs. approximate probabilities, and decided that when these conditions were satisfied, the results were close enough. I have never once seen a mathematical justification of those conditions. I would be delighted to be proven wrong, but my (admittedly sparse) searching has turned up nothing.

Also relevant, I seen the conditions require `np ≥ 10` or even `np ≥ 15` (and similarly for `n(1-p)`), which again makes me think these are subjective rules. Probably it should depend on the context: if your application requires greater precision, then you should probably insist on a larger number (like 10 or 15) before using the Normal Approximation to the Binomial Distribution.

Actually, if you need higher precision, you should probably just use exact probabilities. Using the Normal distribution to approximate the Binomial distribution was more important before there was computing power to evaluate large factorials. These days, we have powerful computers that can give us exact Binomial probabilities even for large sample sizes.
• Why shouldn't we calculate it in terms of our sample and take the confidence interval approach? So we calculate the sample mean (.38) and sample variance (.3835), and then estimate the standard deviation of our sampling distribution of the sample mean (.0313) using our sample variance. Then we'd see that a mean of .3 is about 2.55 standard deviations away from a mean of .38, which leads us to reject the null hypothesis. Can we do that? • Do these type of problems require a random or representative sample of the population? If so, are there some kind of rules to collecting this sample to make sure it is actually representative of the entire population? • The rule for it to be most representative of the population is in fact that it is random. In a Data Analysis course at University we spent most of our time understanding how easy it is to alterate the randomness of a sample in "real life". There is tons of material on DATA GENERATING PROCSSES, however it is out of the scope of introductory statistics. For example think of an online survey a supermaket gathers from its costumers. This collects data only from people having a computer: can it still produce an unbiased set of information? It could for example exclude a higher percentage of old people w.r.t. young people.
• How is that a null hypothesis? Aren't both of those directional hypothesizes? • At about half-way through, Sal mentions a test to indicate if the Bernoulli distribution can be approximated by a Normal distribution (if np >5 & n(1-p) >5). Could someone elaborate on this test? • Bernoulli distributions are situations where there are 2 options.

Example 1 - You have Internet access, OR, you don't have Internet access.

Example 2 - A person is taller than 5' 8", OR a person is not taller than 5'8"

Example 3 - A car is black, white or grey, OR a car is NOT black white or grey.

In each case, the probabilities of the two options ALWAYS add to 1, and they can be written as "p" and (1-P".

Example 1 p=.38 ; 1-p = 1-.38 = .62.

Example 2 p=.82; 1-p = 1-.82 = .18

Example 3 p=.75; 1-p = 1-.75 = .25.

Bernoulli distributions are excellent for computer programming and electrical circuits when something is either TRUE or NOT-TRUE; or ON or NOT-ON, which is the same way of saying TRUE or FALSE; or ON or OFF, modeled by binary as 0 or 1.

At he is testing if .30 HAVE and (1-.3) don't have, or .3 OR .7.

Hope that makes a little more sense
• When do we use the sqrt((n*p*(1-p))/(n-1 )) formula to get the standard deviation of the sampling distribution if the sample mean and when do we use the sqrt(p*(1-p)) formula as used in this video? • At 6.02, when you talk about the success-failure rule, shouldn't n(p)>10 and n(1-p)>10? Not 5? • I've seen both recommended, and also both using ≥ instead of >. The idea is basically just something to try and "help out" the Central Limit Theorem.

With numeric data, if some random variable X is normally distributed, then xbar will be normally distributed no matter the sample size. If X isn't quite normal, but close to normal, then xbar will be approximately normal even if n is fairly small. The more non-normal X is, the larger the sample size must be in order to "force" the sampling distribution of xbar to be normal.

This is the same sort of idea with the Normal Approximation to the Binomial distribution. If the probability is p=0.5, the distribution will be very symmetric, and we won't need so many observations before phat is roughly normal. If p is more extreme (closer to 0 or 1), then the distribution will be more skewed, and the sample size will need to be larger to overcome this.

The exact value, 5 or 10, etc, I'm not sure where they come from, but I've never seen any sort of "derivation" of them. My guess is that they are just based on some simulation study. And at the end of the day, regardless of which rule we use, we are only getting an approximation anyway. Back in the day, when computers were less advanced and hence it was more difficult to compute a lot of Binomial probabilities (or large combinations, etc), these approximations were more important. With the development of more advanced technology, we're able to bypass the need for them altogether. For instance, we could perform this same test using the Binomial distribution directly, instead of approximating it with the Normal distribution.   