Main content

## Statistics and probability

### Course: Statistics and probability > Unit 13

Lesson 1: Comparing two proportions# Comparing population proportions 1

Sal uses an election example to compare population proportions. Created by Sal Khan.

## Want to join the conversation?

- I have a question about how Sal derived the variance for the men and women groups, starting at5:31or so of this video. It seems that he takes the formula for variance in bernoulli distribution for populations and uses that to calculate the sampling distribution variance. Since you have data from the sample itself, why wouldn't you calculate the sample's variance first (i.e., s^2), then use s^2? It seems like this was done in an earlier video for bernoulli distributions? I thought it would work something like this for the men's sample: s^2=(358x(0-.358)^2)+(642x(.642-0)^2)/1000-1. ... I think that the video showing this was entitled, "Estimating Population Proportion," Any thoughts on this greatly appreciated!(7 votes)
- At about7:15Sal put an n in the sampling distribution for the women, but we already know that it is 1000, correct? We put that number in on the men's side. Help, I'm trying not to get confused!!!!! lol...(5 votes)
- Yes, n=1,000 for the women as well; it's corrected in the next video when it comes time to calculate it.(5 votes)

- A bit confused as to why variance of p1 - p2 is the sum of variance(p1) and variance(p2)?(3 votes)
- Why are you using p-bar and not p-hat? Please help, I am very confused by this. Also why are we not using x-bar?(3 votes)
- Are the male and female voters populations or sample of populations? Note, we do not use the unbiased formula for the standard deviation. And and in later videos in this series, we do not use the standard error of the mean formula.(2 votes)
- when we take 2.5 % of alpha

if our z value is more than 1.96 (but less than 5%) than we reject null hypothesis(1 vote)- using 1.96, you put the 95% in the middle section of the normal distribution, and you have 5% left over, 2.5% in each tail. For 95% confidence, alpha is 0.05, but alpha/2 = 0.025. To use some of the typical normal dist tables, you have to look up the probability from a z-value down to negative infinity, so you are looking up the probability 0.95 + 0.025 (the lower tail probability), and that will give you z=1.96. It is still the 95% confidence level, but you have to work with how the z-table is constructed.(2 votes)

- Does the sample numbers must be the same? Could N.Men=2000 and N.Women=1000? Can we still compare them?(1 vote)
- If the mean of Bernoulli is p, then why is the mean that sal takes at4:16642/1000?(1 vote)
- in maths what we say we write in formulae. in this video you put sigma of two proportion with minus sign in suffix. this difference can be shown only in paired condition, not in independent samples. many books has this problem but as per my understandings this is wrong . you can put there only + sign. if i am wrong kindly email me. I may get more insight for variance behaviour. thanks(1 vote)
- Did Sal ever prove why the standard deviations for large samples vary by the factor of 1/n?(1 vote)

## Video transcript

Let's say there's an election
coming up and I want to figure out if there's a meaningful
difference between the proportion of men and the
proportion of women that are going to vote for a candidate. So let's look at the population
distributions here. So we have the men, some
proportion are going to vote for the candidate. We'll call that P1. So this is the proportion that
will vote for the candidate. And the rest of the men will
not vote for the candidate. So 1 minus P1 will not vote
for the candidate. And then for the women,
you're going to see something similar. So this is the women
right over here. And some proportion will
vote for the candidate. We don't know if it's the same
as P1, we don't know if it's same as the men, so
we'll call it P2. And then the rest of
the women will not vote for the candidate. 1 minus P2. So the not voting are zeroes,
the ones that are voting are ones. And these are both Bernoulli
distributions and we know, just because this'll be useful
later on, that the means of this distribution are the same
as the proportion that will vote for it. So the mean of the men, or the
proportion of the men that will vote, so we'll call that
mean one, is equal to P1. I should do everything
in yellow. So the mean of this distribution
is P1. The variance of this
distribution, we'll call that variance one, is just these two
proportions multiplied by each other. So it's P1 times 1 minus P1. And we saw this many many videos
ago when we learned about Bernoulli distributions. And we're going to see
the exact same thing with the women. The mean of this Bernoulli
distribution is going to be P2. And then the variance of this
Bernoulli distribution is going to be these two
proportions multiplied. So P2 times 1 minus P2. Now, what I want to do, and
I think I said this at the beginning of the video, is I
want to figure out if there's a meaningful difference between
the way that the men will vote and the
women will vote. I want to figure out, let
me write this, is this meaningful? So is there a meaningful
difference here? And what we're going to do in
this video is try to come up with a 95% confidence interval
for this parameter. This difference of parameters
is still a parameter. We don't know what the true
difference of these two population parameters are. Or these two population
proportions. But we're going to try to come
up with a 95% confidence interval for that difference. And the way we do that, we go
out and we find 1,000 men likely to vote. And 1,000 women likely
to vote. So let's write this down. So we get 1,000 men. When we survey the 1,000 men,
let's say 642 say that they will vote for the candidate. So they are ones. And then the remainder, 358,
I'll just say the remainder. So the rest are zeros. That we do the same
thing with women. We survey 1,000 women who
are likely to vote. But we survey them randomly. And let's say 591 say
that they will vote for the candidate. And the rest say that
they will not vote for the candidate. So just here based on our sample
proportions, or our sample means, it looks like
there is a difference. But we still have to come up
with our confidence interval. And let's just make sure we
understand what we just did. So we could figure out a sample
proportion over here for the men. Which is really just the sample
mean of this sample right over here. We have 642 ones, the
rest are zero. So we have 642 in
the numerator. We have 1,000 samples. 642 divided by 1,000 is 0.642. So you could view this is a
sample mean or as a sample proportion. If you do the same thing for
the women, the sample proportion is going
to be 0.591. Or you could even just view this
as the sample mean of the sample of 1,000 women. Where the ones voting for it
are one, the rest are zero. And just to visualize it
properly, let me draw the sampling distribution for
the sample proportions. We have a large sample size. And especially because the
proportions that we're dealing with aren't close to one or
zero, and we have a large sample size, the sampling
distribution will be approximately normal. Let me write this. So it's going to have
some mean over here. So the mean of the sampling
distribution of the sample proportion. And we've seen in
multiple times. It's going to be the same
thing as the mean of the population. And the mean of the population
is actually the true population proportion. So this is going to
be equal to P1. This is something that we
don't to know about. And then the variance of this,
and we've seen this several times already, the variance of
this distribution, I have to put a one here, we're dealing
with the men. The variance of this
distribution by the central limit theorem is going to
be the variance of this distribution up here, which is
P1 times 1 minus P1 over our sample size, over 1,000. And we can do the exact same
thing for the women. So this is the sampling
distribution. This is for P2 bar, or this
sample mean over here. Let me put a one over here. Remember, this is
all for the men. And then this over here
is all for the women. Can't forget those
twos over there. And so this distribution is
going to have some mean. Let me draw it right
over here. So mu sub P2 with
a bar over it. So the mean of the sampling
distribution for this sample proportion, for the women, which
is going to be the same thing as the mean of the
population, which we already saw is going to be
equal to P2. And then the variance for this
distribution, for this sampling and distribution over
here, is going to be this variance over here divided
by our sample size. So P2 times 1 minus P2. All of that over n. Now, our whole goal is to
get a 95% confidence interval for that. And so what we're going to do is
we're going to think about the sampling distribution, not
for this, and not the sampling distribution for this. But we're going to think about
the sampling distribution for the difference of this sample
proportion and this sample proportion. We've seen it already. We're talking about proportions,
but it's really the same exact ideas that we
did when we just compared sample means generally. So let's look at that. Let's look at this
distribution. And just to be clear, when we
got this sample mean here, this sample proportion,
we just sampled it. You could view it as taking
a sample from this distribution over here. When we got this sample
proportion, it was like taking a sample from this over here. We took 1,000 samples from this,
when we took their mean. Where it's equivalent to taking
a sample from the sampling distribution. Now, this distribution over
here is going to be the distribution of all of the
differences of the sampling proportions, or of the
sample proportions. So it will look like this. It will have some mean value. I should do this in
a different color. I'll do it in green. Yellow and blue make green. So I'll call this the sampling
distribution of this statistic, of P1 minus P2. And so it has some
mean over here. The sample of P1 minus the
sample mean, or the sample proportion, of P2. And we know, from things that
we've done in the last several videos, that this is going to
be the exact same thing as this mean minus this mean. Which is the exact same
thing as P1 minus P2. So this is going to be
equal to P1 minus P2. And the variance of this
distribution, P1 minus P2, just like this, is going to be
the sum of the variances of these two distributions. So it's going to be this thing
over here, I'll just copy and paste it, plus this variance
over here. There's no radical sign, because
we're not taking the standard deviation. We're focused on the
variance right now. So plus this thing
right over here. So let me copy and
let me paste it. So that's going to
be the variance. And if you want the standard
deviation, you can literally just get rid of this. You're taking the square
root of both sides. So you take the square root of
the variance, you get the standard deviation, that's why
I got rid of that to the second power. And you want to take a square
root of the right-hand side just like that. Now, all I did right now was
just to kind of conceptually set things up in our brain. What we now need to do
is actually tackle the confidence interval. We actually need to come up with
a 95% confidence interval for P1 minus P2. Or a 95% confidence interval for
this mean right over here. And because I'm trying to make
my best effort not to make videos too long, I'll do part
two in the next video, where we actually solve the
confidence interval.