If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

Main content

Simulation showing bias in sample variance

AP.STATS:
UNC‑1 (EU)
,
UNC‑1.J (LO)
,
UNC‑1.J.3 (EK)
,
UNC‑3 (EU)
,
UNC‑3.I (LO)
,
UNC‑3.I.1 (EK)
Simulation by Peter Collingridge giving us a better understanding of why we divide by (n-1) when calculating the unbiased sample variance. Simulation available at: http://www.khanacademy.org/cs/challenge-unbiased-estimate-of-population-variance/1169428428. Created by Sal Khan.

Want to join the conversation?

  • leaf green style avatar for user santi
    I would love to see another simulation comparing the biased estimator to the unbiased estimator (two of the right hand corner graph, one with each type) to appreciate the real difference, has anyone done this?
    thanks!
    (40 votes)
    Default Khan Academy avatar avatar for user
  • hopper jumping style avatar for user kyle shapiro
    i understand how the n-1 can be derived through simulation, but from a logical standpoint, why could it not be n-2 or n-3 ?
    (29 votes)
    Default Khan Academy avatar avatar for user
  • old spice man green style avatar for user jonathan hay
    Is n-1 only an unbiased sample if the underling population is normal distribution or is it always true?
    (7 votes)
    Default Khan Academy avatar avatar for user
  • leaf green style avatar for user Nadav Lapidot
    Is the fact that the unbiased sample variance should be divided by (n-1) totally empirical (i.e. derived from observations), or is there a mathematical way of showing why it should be (n-1) and not just n?
    (7 votes)
    Default Khan Academy avatar avatar for user
    • leaf green style avatar for user August Sonne
      By definition an unbiased estimator is an estimator such that E[Û]=U, where Û denotes the estimator and U denotes the true value of the variable you wish to estimate. By considering E[S^2]=E[sum(Yi^2-Ym)^2], the sum from sample 1 to n where Ym denotes the mean of your n samples it can be proved that this equals (n-1)V/n, where V is the true variance. By multiplying the sample variance S^2 by n/(n-1) you then get an unbiased estimator.
      (0 votes)
  • starky ultimate style avatar for user ThamarBoaz
    Maybe I am just misreading the graph but if the lower left graph is showing biased sample variance why are there so many points occurring above the population variance of 36.8? Since it is biased toward a lower value shouldn't the biased sample variance always be below the actual population variance (as was in the case with the previous videos example?) Also If it is showing unbiased values can you tell me where in the code the variance is divided by n-1 (because I looked and didn't find it anywhere)
    (5 votes)
    Default Khan Academy avatar avatar for user
  • blobby green style avatar for user Gibb3d
    How do we know when the sample size is small enough to warrant using n-1 as opposed to n?
    (3 votes)
    Default Khan Academy avatar avatar for user
    • leafers sapling style avatar for user Tejas Veeramani
      we need to use "n-1" instead of "n" as long as the sample size is smaller than the population (which is, mostly all the time)... doesn't matter if the sample is really large or really small, either ways it's less than than the population so we need to compensate by using "n-1"... and the value of "n-1" changes when the size of the sample changes so the compensation is in proportion to the size of the sample.. hope this was helpful..
      (4 votes)
  • leaf green style avatar for user Beau Hansen
    I understand how Sal shows the population variance to equal sum(xi-x[bar])^2/n-1. What i do not quite understand is the relationship between the unbiased calculated population variance [sigma^2] and the sample variance [s^2]. Are they the same? Do we use [sigma^2] when using calculating p-values?
    Thanks
    (2 votes)
    Default Khan Academy avatar avatar for user
    • leaf blue style avatar for user Dr C
      I think you may have had a typo in there. If we have population data (all the data possible), we could calculate the population variance exactly, because we have the population: σ^2 = sum(xi-μ)^2/n.

      If we only have a sample, then we need to estimate the population variance. It's a foregone conclusion that we can't calculate it, since we don't have all the possible data, but we can estimate it. And this is where the "unbiased" version comes into play: not all estimates are created equal. Going back to the population version (above), we could simply replace the μ with an xbar and call it a day:
      s^2 = sum(xi-xbar)^2/n
      However, this is what Sal is showing is a biased estimate. Since we're simulating data, we know the true variance, but we pretend as if we don't and so calculate the sample variance. If this formula gave us "good" results, the ratio of these two values, s^2 / σ^2, should be about 1. Sometimes it would be larger, other times smaller, but that's what it would average out to be. Unfortunately, this doesn't work. The formula above gives biased results - it tends to miss the mark by underestimating a little bit. Thankfully, we can figure out how to correct for this, and it's just a scaling factor. So we replace the formula above with
      s^2 = sum(xi-xbar)^2/(n-1)
      When we use this formula, the ratio of the sample variance to the population variance will tend to be 1, so we have an "unbiased" estimate of the sample variance.

      The estimated sample variance is not "the same" as the population variance, it's our best guess at what the population variance is. This is the same way that the sample mean is not the same as the population mean, but it's our best guess.

      In terms of what we use: if we have σ^2, we should use it. If we don't, then we use s^2. In the case of just variances, since the formulas are so similar, this basically means the following: If we have population data, we divide by n. If we only have a sample, then we divide by n-1. There are some additional consequences of this further down the line (e.g., if we have σ^2, then we'd use a Z-test instead of a t-test later on).
      (5 votes)
  • leafers ultimate style avatar for user Sleuth
    So i understand how and why the correction is necessary. What i dont understand is: what is the reason for the bias to be (n-1)/n?

    It makes sense that the variance will be different in a sample. But why this pattern of (n-1)/n?
    If the points i take from the sample are close to each other the variance will be smaller than from the total population. If they are at the extrem ends or just few points spread out over maximum distances the variance could be even greater than the one of the total population. Why dont the samples cancel each other out? Why (n-1)/n?
    (1 vote)
    Default Khan Academy avatar avatar for user
  • blobby green style avatar for user shakti
    what actually biased and unbiased tells us, what they actually means?
    (1 vote)
    Default Khan Academy avatar avatar for user
  • blobby green style avatar for user InnocentRealist
    This video is helpful and intriguing, but I don't yet get at all why the data behave this way. But, I still don't get why we square the distance from the sample mean, and I imagine that's somehow involved in this bias thing. Oh well, "ours is not to reason why..."
    (2 votes)
    Default Khan Academy avatar avatar for user

Video transcript

Voiceover: This right here is a simulation that was created by Peter Collingridge using the Khan Academy computer science scratch pad to better understand why we divide by n minus one when we calculate an unbiased sample variance. When we are in an unbiased way trying to estimate the true population variance. So what this simulation does is first it constructs a population distribution, a random one, and every time you go to it, it will be a different population distribution. This one has a population of 383, and then it calculates the parameters for that population directly from it. the mean is 10 point nine the variance is 25 point five. and then it uses that population and samples from it and it does samples of size two, three, four, five, all the way up to 10, and it keeps sampling from it, calculates the statistics for those samples so the sample mean and the sample variance, in particular the biased sample variance It starts telling us some things about us that give us some intuition. You can actually click on each of these and zoom in to really be able to study these graphs in detail. I have already taken a screen shot of this and put it on my little doodle pad, so you can really delve into some of the math and the intuition of what this is actually showing us. So here I took a screen shot, and you see for this case right over here, the population was 529. Population mean was 10 point six, and down here in this chart, he plots the population mean right here at 10 point six, right over there, and you see that the population variance is at 36 point eight, and right here he plots that right over here, 36 point eight. This first chart on the bottom left tells us a couple of interesting things. Just to be clear, this is the biased sample variance that he is calculating. This is the biased sample variance. So he is calculating it. That is being calculated for each of our data points. So starting with our first data point in each of our samples, going to our nth data point in the sample. You're taking that data point, subtracting out the sample mean, squaring it, and then dividing the whole thing, not by n minus one, but by lower case n. This tells us several interesting things. The first thing it shows us is that the cases where we are significantly underestimating the sample variance, and we are getting sample variances close to zero, these are also the cases, or they are disproportionately the cases where the means for those samples are way far off from the true sample mean, or we could do that the other way around. The cases where the mean is way far off from the sample mean it seems like you're much more likely to underestimate the sample variance in those situations. The other thing that might pop out at you is the realization that the pinker dots are the ones for smaller sample size, while the bluer dots are the ones of a larger sample size. You see here these two little, I guess the tails ,so to speak, of this hump, that these ends, are more of a reddish color. that most of the blueish or the purplish dots are focused right in the middle right over here, that they are giving us better estimates. There are some red ones here, and that's why it gives us that purplish color, but out here on these tails, it's almost purely some of these red. Every now and then by happenstance you get a little blue one, but it's disproportionately far more red, which really makes sense when you have a smaller sample size, you are more likely to get a sample mean that is a bad estimate of the population mean, that's far from the population mean, and you're more likely to significantly underestimate the sample variance. Now this next chart really gets to the meat of the issue, because what this is telling us is that for each of these sample sizes, so this right over here for sample size two, if we keep taking sample size two, and we keep calculating the biased sample variances and dividing that by the population variance, and finding the mean over all of those, you see that over many, many, many trials, and many, many samples of size two, that that biased sample variance over population variance, it's approaching half of the true population variance. When sample size is three, it's approaching 2/3, 66 point six percent, of the true population variance. When sample size is four, it's approaching 3/4 of the true population variance. So we can come up with the general theme that's happening. When we use the biased estimate, we're not approaching the population variance. We're approaching n minus one over n times the population variance. When n was two, this approached 1/2. When n is three, this is 2/3. When n is four, this is 3/4. So this is giving us a biased estimate. So how would we unbias this? Well, if we really want to get our best estimate of the true population variance, not n minus one over n times the population variance, we would want to multiply, I'll do this in a color I haven't used yet, we would want to multiply times n over n minus one. to get an unbiased estimate. Here, these cancel out and you are just left with your population variance. That's what we want to estimate. Over here you are left with our unbiased estimate of population variance, our unbiased sample variance, which is equal to, and this is what we saw in the last several videos, what you see in statistics books, and sometimes it's confusing why, hopefully Peter's simulation gives you a good idea of why, or at least convinces you that it is the case. So you would want to divide by n minus one.