If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

Main content

Simulation providing evidence that (n-1) gives us unbiased estimate

Simulation by KA user tetef showing that dividing by (n-1) gives us an unbiased estimate of population variance. Simulation at: http://www.khanacademy.org/cs/will-it-converge-towards-1/1167579097. Created by Sal Khan.

Want to join the conversation?

  • aqualine ultimate style avatar for user vilgot
    Just curious: Was it by simulations like this that statisticians originally figured out the n-1 thing? Or is that conclusion actually really obvious if you just understand the "pure math" underlying it?
    (60 votes)
    Default Khan Academy avatar avatar for user
    • blobby green style avatar for user yoplait
      No, they did it analytically. They probably came up with some intuition of the need to adjust the variance, but intuition cannot tell you why you have to divide exactly by n-1.

      There is a geometrical reason to dividing by n-1, it's the number of degrees of freedom. You can see this for the sample variance by considering the number of independent data points. To compute the sample variance, you compute first the sample mean. This means that given this sample mean, if someone gives you all the data points except one, you can figure out by yourself what the last data point is. So, you actually don't have a sample size of n data points to compute the sample variance, but a sample size of n-1.
      (63 votes)
  • starky ultimate style avatar for user Ian
    I'm sorry, but what does biased and unbiased mean?
    (10 votes)
    Default Khan Academy avatar avatar for user
    • leaf green style avatar for user Ade
      A biased estimate is an one that consistently underestimates or overestimates.

      For example, sample estimates using (n) tend to consistently underestimate the population variance. So we say it has a BIAS for underestimation.

      Sample estimates using (n-1) however do not tend to underestimate or overestimate, so we consider it UNBIASED.

      Note that unbiased is not the same thing as accurate. Suppose I use another method that sometimes way underestimates, but at other times way overestimates. This method is not very accurate, but it is also unbiased -- the mean of its errors would be close to zero since the overestimates would "cancel out" the underestimates.
      (2 votes)
  • duskpin ultimate style avatar for user Chris
    When do you make a question to do with variance (n-1)? When is it just n? Thank you. would really appreciate a clear answer...
    (6 votes)
    Default Khan Academy avatar avatar for user
  • leaf green style avatar for user Santiago Scanlan
    These explanations are based on empirical evidence, Is there a theoretical explanation for dividing by n-1?
    (8 votes)
    Default Khan Academy avatar avatar for user
  • blobby green style avatar for user Eric Antolović
    I understand that n-1 provides a more accurate estimation. However, if we know our population N value, couldn't we just subtract the n/N ratio from n instead? For example, if N=20 and n=10, we would know the ratio is 0.5. Therefore, we could find an even better estimate from n-0.5.
    (1 vote)
    Default Khan Academy avatar avatar for user
    • leaf blue style avatar for user Dr C
      The number that we subtract has nothing to do with the size of the population. It's not just that it makes the estimate "more accurate," it's that it makes it what Statisticians call "unbiased."

      Think back to the sampling distribution of the sample mean. So, if we repeated an experiment over and over again, and recorded the sample mean from each of the repeated experiments. The mean of the sampling distribution of the sample mean -- what Sal talks sometimes refers to as the "mean of means" -- happens to be equal to the mean of the original distribution. Because of this, we say that the sample mean is "unbiased" - it doesn't systematically overestimate or underestimate the population mean.

      This is not the case with the variance. If we calculate the variance over and over again, using n in the denominator, the "mean of variances" (a strange concept, but it's the proper one to think about) will not be equal to σ^2, it will be σ^2 * (n-1)/n. By dividing by n-1 instead of n, we fix this problem. Using n, the sample variance is biased, because it tends to underestimate the population variance. Using n-1, the sample variance is unbiased.

      So in this sense, it's not possible to get a better estimate for the variance. Subtracting 1, and specifically 1, is the best we can do. Changing what we divide by can only make it worse. Now, there are other criteria we might look at which may make a different estimate of the sample variance seem "better," but if we're just talking about the denominator we're using, n-1 can't be beat.
      (12 votes)
  • leafers ultimate style avatar for user Jan
    Hi all,

    I have also heard people saying that we divide by the degrees of freedom, which, as I understand, would be the numbers of values I need to fix to get the information on all values. In this case, this would mean that, if I am provided with the sample mean, I only have n-1 degrees of freedom as I can calculated the last value in my sample by the information I got.
    Question 1: Did I understand this correctly so far?
    Question 2: Where is the logical link between 'I can estimate the last value based on the information I am given' and 'I better divide by n-1 to estimate the variance'?
    Question 3: The same idea would be true for the population variance. Here, too, I can calculate the missing value given n-1 values and the mean? So why, under the aspect of degrees of freedom, would I still divide by n here?
    Question 4: How is the concept of degrees of freedom related to the explanation for using n-1 provided in the video?

    Thank you very much for your help!
    (6 votes)
    Default Khan Academy avatar avatar for user
  • scuttlebug yellow style avatar for user lobotomaniac
    Isn't the relative size of the sample compared to the population relevant when calculating the sample variance? I mean, if we calculate the variance of 99 elements out of a population of 100 elements, won't the variance of this sample be more accurately described by N, and not (N-1)? Is there a threshold for a sample to be described by (N-1)?
    (4 votes)
    Default Khan Academy avatar avatar for user
    • leaf green style avatar for user Tanner P
      That’s an excellent question, and I’m not sure about the answer.

      But if our sample size is only one or two less than our population size, we might as well look at every element in the population instead. Sampling is used when it is not practical to take information from the whole population, so there is usually a good portion of the population left over. So, this situation isn’t practical, but it is interesting to think about theoretically.
      (2 votes)
  • blobby green style avatar for user shakti
    what dont we divide our sample mean by n-1, is it not a biased estimator?
    (1 vote)
    Default Khan Academy avatar avatar for user
  • leaf yellow style avatar for user Ismael Cherif
    When your sample size approaches the pop. size, at what point would it be best to stop using (n-1) and use (n)
    (3 votes)
    Default Khan Academy avatar avatar for user
  • hopper happy style avatar for user Heitor Murilo Gomes
    if my sample size is greater than, let's say, half of the population size, i.e. n > N/2, should I use the biased sample variance to get a better estimate? More generally, if I am aware of the value of N should I use this information to decide which formula to use? And at which value of n/N should I consider using the biased sample variance.
    (3 votes)
    Default Khan Academy avatar avatar for user

Video transcript

Here's a simulation created by Khan Academy user TETF. I can assume that's pronounced tet f. And what it allows us to do is give us an intuition as to why we divide by n minus 1 when we calculate our sample variance and why that gives us an unbiased estimate of population variance. So the way this starts off, and I encourage you to go try this out yourself, is that you can construct a distribution. It says build a population by clicking in the blue area. So here, we are actually creating a population. So every time I click, it increases the population size. And I'm just randomly doing this, and I encourage you to go onto this scratch pad-- it's on the Khan Academy Computer Science-- and try to do it yourself. So here I could stop at some point. So I've constructed a population. I can throw out some random points up here. So this is our population, and as you saw while I was doing that, it was calculating parameters for the population. It was calculating the population mean at 204.09 and also the population standard deviation, which is derived from the population variance. This is the square root of the population variance, and it's at 63.8. It was also pop plotting the population variance down here. You see it's 63.8, which is the standard deviation, and it's a little harder to see, but it says it's squared. These are these numbers squared. So essentially, 63.8 squared is the population variance. So that's interesting by itself, but it really doesn't tell us a lot so far about why we divide by n minus 1. And this is the interesting part. We can now start to take samples, and we can decide what sample size we want to do. I'll start with really small samples, so the smallest possible sample that makes any sense. So I'm going to start with a really small sample. And what they're going to do-- what the simulation is going to do-- is every time I take a sample, it's going to calculate the variance. So the numerator is going to be the sum of each of my data points in my sample minus my sample mean, and I'm going to square it. And then it's going to divide it by n plus a, and it's going to vary a. It's going to divide it by anywhere between n plus negative 3, so n minus 3, all the way to n plus a. And we're going to do it in many, many, many, many, times. We're going to essentially take the mean of those variances for any a and figure out which gives us the best estimate. So if I just generate one sample right over there, when we see kind of this curve, when we have high values of a, we are underestimating. When we have lower values of a, we are overestimating the population variance, but that was just for one sample, not really that meaningful. It's one sample of size two. Let's generate a bunch of samples and then average them over many of them. And you see when you look at many, many, many, many, many examples, something interesting is happening. When you look at the mean of those samples, when you average together those curves from all of those samples, you see that our best estimate is when a is pretty close to negative 1, is when this is n plus negative 1 or n minus 1. Anything less than negative 1-- if we did negative n minus 1.05 or n minus 1.5-- we start overestimating the variance. Anything less than negative 1, so if we have n plus 0, if we divide by n or if we have n plus 0.05 or whatever it might be, we start underestimating the population variance. And you can do this for samples of different sizes. Let me try a sample size 6. And here you go once again, as I press-- I'm just keeping Generate Sample pressed down-- as we generate more and more and more samples-- and for all the a's we essentially take the average across those samples for the variance depending on how we calculate it-- you'll see that once again, our best estimate is pretty darn close to negative 1. And if you were to get this to millions of samples generated, you'll see that your best estimate is when a is negative 1 or when you're dividing by n minus 1. So once again, thanks TETF, tet f, for this. I think it's a really interesting way to think about why we divide by n minus 1.