If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

Main content

Sampling distribution of the difference in sample proportions

We can calculate the mean and standard deviation for the sampling distribution of the difference in sample proportions. Also, we can tell if the shape of that sampling distribution is approximately normal. Created by Sal Khan.

Want to join the conversation?

  • blobby green style avatar for user Lambert Yan
    Great video. For people who are confused at the formula around 4 minutes. I googled it :), there is the answer
    "The variance of X/n is equal to the variance of X divided by n², or (np(1-p))/n² = (p(1-p))/n . This formula indicates that as the size of the sample increases, the variance decreases."
    (2 votes)
    Default Khan Academy avatar avatar for user
  • blobby green style avatar for user Lambert Yan
    When we are trying to find the standard deviation of sample difference, why don't we just calculate the difference between the two sample's std deviation?
    (1 vote)
    Default Khan Academy avatar avatar for user
    • blobby green style avatar for user daniella
      The standard deviation of the difference between two independent sample proportions isn't found by simply taking the difference between their individual standard deviations. This is because standard deviation measures the spread or variability of a distribution, and when you're looking at the difference (or sum) of two variables, you're essentially combining their variabilities. Mathematically, when two variables are independent, the variance of their sum (or difference) is the sum of their variances. This is why we add the variances of the sample proportions from plants A and B to find the variance of the difference in sample proportions. The formula reflects how the variability of the difference between two proportions encompasses the variability of each proportion. Subtracting standard deviations wouldn't accurately capture the combined variability of the difference between the two sample proportions.
      (1 vote)
  • blobby green style avatar for user hollieH
    Did Sal forget to square the standard deviation of A and B?
    (1 vote)
    Default Khan Academy avatar avatar for user

Video transcript

- [Instructor] We're told, suppose that 8% of all cars produced at plant A have a certain defect, and 6% of all cars produced at plant B have this defect. Each month, a quality control manager takes separate random samples of 200 of the over 3000 cars produced from each plant. The manager looks at the difference between the proportions of cars with the defect in each sample. So they're looking at the difference of sample proportions every month. Describe the distribution of the difference of sample proportions in terms of its mean standard deviation and shape. So let's take these step-by-step. So first, let's think about the mean of the difference of our sample proportions. Pause this video and try to figure out what that's going to be. Well, we have seen this in previous videos, that if we have the mean of the difference of two random variables, that's the same as the difference of the means or another way to think about it is if we wanna figure out the mean of this, so sample proportion from plant A minus sample proportion from plant B, this is just going to be equal to the mean of the sample proportion from plant A, minus the mean of the sample proportion from plant B. Now, what are these going to be equal to? Well, what's the mean of the sample proportion of plant A. Is just going to be the true population proportion for plant A. And they tell us that. They tell us that 8% of all cars produced at plant A have a certain defect. So this could be 8% or we could write it as 0.08. And then from that, we are going to subtract the mean of the sample proportion from plant B. And we know what that mean's going to be. The mean of a sample proportion is going to be the population proportion. The parameter of the population, which we know for plant B is 6%, 0.06, and then that gets us a mean of the difference of 0.02 or 2% or 2% difference in defect rate would be the mean. Now let's think about the standard deviation. So instead of thinking in terms of standard deviation, let's think about the square of the standard deviation, which is variance. And from there, we can go back to standard deviation by taking a square root. So if we're looking at the variance, lemme write it this way, if we're looking at the variance of the difference of the sample proportions, so the sample proportion from plant A minus the sample proportion from plant B, but just as a review, if you assume that we're sampling independently from each of the plants. So what we're sampling from plant A does not affect what we're sampling from plant B or vice versa, then we can add the variances. So this is going to be equal to the variance of the sample proportion from plant A plus the variance of the sample proportion from plant B. Some of you might be saying, "Wait, aren't we taking the difference of sample proportions here? Why are we adding?" And the reminder is, remember, variance is a measure of a spread. And whether you're now taking the difference of random variables or you're taking the sum of them, when you have more variables, you're going to have more spread. So regardless of whether this is a negative or positive over here, this is going to be a positive. So what is this going to be equal to. We can take each of these terms, what's going to be the variance of the sample proportion from plant A? Well, if every time we looked at one of the cars, we looked at it and then we put it back into the mix. So if we were sampling with replacement, which means that each of our observations are independent of the other ones, we have a formula. We know that this variance would be the population proportion of plant A times one minus the population proportion of plant A divided by the number that we sampled from plant A. Now, in the scenario that we are talking about, we didn't sample with replacement, we just took 200 at a time and looked at them. We didn't take one at a time and replace it and do that 200 times. But we also know that this is a pretty good approximation, even when you are not sampling with replacement. If your sample is less than 10% of the population, and 200 is less than 10% of 3000. So this is a pretty good approximation, what you would use in a first year statistics class. And of course, we can use the same logic. This is going to be equal to the population proportion plant B times one minus the population proportion in plant B, all of that over your sample size from plant B. And we know all of these things. We know that your population proportion in plant A is 8% or 0.08. One minus that is 0.92. We're taking samples of 200 at a time from plant A. And then in plant B, we know the population proportion, they told us is 6% or 0.06. One minus that is 0.94. And then the sample size from plant B is also going to be 200. It's going to be 200. We get 0.08 times 0.92 divided by 200 and then plus, let's open parentheses here, we get 0.06 times 0.94 divided by 200, and then actually let me close the parentheses, and that equals this business. So 0.00065. So 0.00065. And then from this, we can figure out what the standard is going to be. The standard deviation of the difference between our sample proportions is going to be just the square root of this. It's going to be the square root of 0.00065. And that is approximately equal to, let's just take the square root, and we get this, 0.025. 0.025. And there you have it, we have thought about the standard deviation. And then last but not least, let's think about the shape. So just as a review, we just have to remind ourselves that the distribution of each sample proportion is going to be normal as long as we expect at least 10 successes and 10 failures. Well, let's look at each of these. How many successes you expect where a success would actually be a defect? But let's think about this. 8% of in each case of a sample of 200, that's going to be 16. So you would expect 16 defects, and then you would expect 200 minus 16, which is a lot larger than 10 of no defects. So both of those are greater than or equal to 10. And then if you did the same thing for plant B, you get the same idea. 6% of 200 is 12. And then if you say the ones that have no defects, that's 200 minus 12, which is way more than 10, and especially in that latter case. But in every situation, we expect to have at least 10 successes and 10 failures. And so we can assume that the distributions of each of these are going to be normal. And we also know that the difference of two normally distributed variables is also normal, so long as they pass that large count condition that we just talked about. And so let's draw what this distribution might look like. It might look something like this. It's going to be a normal distribution where you have a mean right over here. I'll do that in that same color. A mean of 0.02. You can definitely take on negative values because there are some situations in which your sample proportion from plan B actually could be larger just by random chance than it is from plant A. So you can definitely take on negative values. But if I wanted to show where zero is, maybe zero is right over here, so we could draw an axes right over here. And then we know what the standard deviation is. It's 0.025 or it's approximately that. So if we were to go one standard deviation down, we would go right about there, and if we were to go one standard deviation up, we would go right about there. And obviously, we could go more than one standard deviation above or below that mean.