If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

# Conditions for valid confidence intervals for a proportion

There are three conditions we need to satisfy before we make a one-sample z-interval to estimate a population proportion. We need to satisfy the random, normal, and independence conditions for these confidence intervals to be valid.

## Want to join the conversation?

• as the first simulation, as I understand, that population is 250, and sample size is 200, without replacement (mean we will not put the gumball back to machine). How can we have many sample? With population 250, and sample 200, I think we only have 1 sample? • Hi guys, here @ Sal mentioned to get a normally distributed sample distribution of sample mean, we will need to have at least 10x successes and 10x failures in each sample. However, in one of the previous exercises, the minimal sample size is said to be 30x if we want to have a normal distribution. These two are contradicting each other. Any advice on this? • They actually aren't contradicting. The sample size needs to be at least 30, so n>=30, and there needs to be at least 10 successes and 10 failures or p and 1-p, in the sample. So, remember that np>=10 and n(1-p)>= 10, which means the proportion of successes times the sample size needs to be greater than 10 and the number of failures times the sample size needs to be greater than 10. Multiplying the proportion of successes/ failures by the sample size basically gives you the number of successes/failures in a sample. Heres an example to see how they relate:

Sample size: 50
Proportion of successes: 0.4
Proportion of failiures:0.6 (or 1-0.4)

n(p)=?
50(0.4)=20

n(1-p)=?
50(0.6)=30

Now look, we can take the number of successes/ failures to find the proportion of successes/failures in the sample:

20/50= 0.4
0.4=p

30/50=0.6
0.6= 1-p

So essentially, we need to first check that the sample size is larger than 30. And if that is met, then we check if the number of successes/ failures in a sample are more than 10. If not then the sample would probably not be normal.
• What's the normal condition for a non-Bernoulli distribution? • I can't understand why in normal condition, we should expect more than 10 success and failure each. If the precondition for normal distribution of mean of sample proportion is np>=5 and n(1-p)>=5 in a sample, why the number of success and failure have to be more than 10 in samples? Is that means we have to conduct at least 2 samples? • How do you access the gumball simulation? • The independence condition is unintuitive to me. Shouldn't the sample parameters approach the population parameters as the sample proportion approaches 100%? Wouldn't that mean that the only consequence of not meeting the independence condition is that our estimates of the population parameters become more accurate than expected? How is getting "too accurate" estimates ever a problem in real life?

(Intuitive, if polling ten people produces more accurate results than polling one person ten times, then replacement when sampling can only ever decrease the accuracy of a poll.) • You are comparing samples of different size (1 and 10). Indeed, the bigger the sample size, the closer to the population mean the sample mean is expected to be.

The problem lies elsewhere. Since we calculate our confidence intervals in the number of stddevs from the mean, it is important for the stddev of our sample to be an unbiased estimate of the stddev of the population.

The stddev of the sample with replacement is such an estimate. But the stddev of the sample without replacement is not, it is actually smaller. So, when we claim with 95% confidence that the population mean is not farther than 2 stddevs away from the sample mean and calculate that distance using the stddev of the sample without replacement, we are falling short, the interval is smaller than it's supposed to be.

Intuitively, the bigger the sample, the closer we are to the mean, but the less confident we are about how close :)
• I want to find more information about normal condition. Is there anyone who knows the search word or key word? • In the 10% rule when Khan says _n_<10% of the population, isn't it supposed to include 10% itself? • I'm trying to recreate the simulation at in Python.

`for s in [.1, .2, .3, .4, .5, .6, .7, .8, .9, 1.]: c=0 for i in range(1500): p = .6 N = 250 # create list of Bernoulli trials population = np.random.binomial(1,p, N) ix = list(range(N)) # random sample of s percent of population test_ix = np.random.choice(ix, int(N*s), replace=False) test_x = population[test_ix].sum() lower, upper = proportion_confint(test_x, int(N*s) ) # is true P in CI if lower <= p <= upper: c+=1 # "hit rate" print("Prc: %s, Hit Rate: %s " % (s, (c / 1500))) `

Output

`Prc: 0.1, Hit Rate: 0.9486666666666667 Prc: 0.2, Hit Rate: 0.9506666666666667 Prc: 0.3, Hit Rate: 0.9486666666666667 Prc: 0.4, Hit Rate: 0.9553333333333334 Prc: 0.5, Hit Rate: 0.95 Prc: 0.6, Hit Rate: 0.9466666666666667 Prc: 0.7, Hit Rate: 0.942 Prc: 0.8, Hit Rate: 0.9386666666666666 Prc: 0.9, Hit Rate: 0.962 Prc: 1.0, Hit Rate: 0.9533333333333334 `

I'm not getting similar results. Any ideas what's going on? I think my code is correct. Am I getting something wrong with the theory?
(1 vote) • Why does the margin of error CHANGE? For example, if we want 95% confidence intervals, and we take samples of size n = 10, they would all be the same length for that study; margin of error = (critical value)*(stdev), say (2)(4.5) if we want to cover 2 stdevs from either side of p hat, where stdev = 4.5. Wouldn’t this value of margin of error (2*4.5=9) (the “stems” on either side of p hat) be the SAME for ALL confidence intervals for that study? Thanks!
(1 vote) 