Main content
AP®︎/College Statistics
Course: AP®︎/College Statistics > Unit 3
Lesson 5: More on standard deviation (optional)- Review and intuition why we divide by n-1 for the unbiased sample variance
- Why we divide by n - 1 in variance
- Simulation showing bias in sample variance
- Simulation providing evidence that (n-1) gives us unbiased estimate
- Unbiased estimate of population variance
© 2023 Khan AcademyTerms of usePrivacy PolicyCookie Notice
Simulation providing evidence that (n-1) gives us unbiased estimate
Simulation by KA user tetef showing that dividing by (n-1) gives us an unbiased estimate of population variance. Simulation at: http://www.khanacademy.org/cs/will-it-converge-towards-1/1167579097. Created by Sal Khan.
Want to join the conversation?
- Just curious: Was it by simulations like this that statisticians originally figured out the n-1 thing? Or is that conclusion actually really obvious if you just understand the "pure math" underlying it?(58 votes)
- No, they did it analytically. They probably came up with some intuition of the need to adjust the variance, but intuition cannot tell you why you have to divide exactly by n-1.
There is a geometrical reason to dividing by n-1, it's the number of degrees of freedom. You can see this for the sample variance by considering the number of independent data points. To compute the sample variance, you compute first the sample mean. This means that given this sample mean, if someone gives you all the data points except one, you can figure out by yourself what the last data point is. So, you actually don't have a sample size of n data points to compute the sample variance, but a sample size of n-1.(61 votes)
- I'm sorry, but what does biased and unbiased mean?(10 votes)
- A biased estimate is an one that consistently underestimates or overestimates.
For example, sample estimates using (n) tend to consistently underestimate the population variance. So we say it has a BIAS for underestimation.
Sample estimates using (n-1) however do not tend to underestimate or overestimate, so we consider it UNBIASED.
Note that unbiased is not the same thing as accurate. Suppose I use another method that sometimes way underestimates, but at other times way overestimates. This method is not very accurate, but it is also unbiased -- the mean of its errors would be close to zero since the overestimates would "cancel out" the underestimates.(1 vote)
- When do you make a question to do with variance (n-1)? When is it just n? Thank you. would really appreciate a clear answer...(6 votes)
- n-1 when you chose a sample from the population
n when you've counted the entire population.(18 votes)
- These explanations are based on empirical evidence, Is there a theoretical explanation for dividing by n-1?(8 votes)
- For me this Wikipedia link is more detailed and more understandable, however more or less the same as the "Sample variance" page. But still a bit different, might be worth to check for those who needed more info on "n-1" stuff after the Sample variance article. https://en.wikipedia.org/wiki/Bias_of_an_estimator(3 votes)
- Hi all,
I have also heard people saying that we divide by the degrees of freedom, which, as I understand, would be the numbers of values I need to fix to get the information on all values. In this case, this would mean that, if I am provided with the sample mean, I only have n-1 degrees of freedom as I can calculated the last value in my sample by the information I got.
Question 1: Did I understand this correctly so far?
Question 2: Where is the logical link between 'I can estimate the last value based on the information I am given' and 'I better divide by n-1 to estimate the variance'?
Question 3: The same idea would be true for the population variance. Here, too, I can calculate the missing value given n-1 values and the mean? So why, under the aspect of degrees of freedom, would I still divide by n here?
Question 4: How is the concept of degrees of freedom related to the explanation for using n-1 provided in the video?
Thank you very much for your help!(6 votes) - I understand that n-1 provides a more accurate estimation. However, if we know our population N value, couldn't we just subtract the n/N ratio from n instead? For example, if N=20 and n=10, we would know the ratio is 0.5. Therefore, we could find an even better estimate from n-0.5.(1 vote)
- The number that we subtract has nothing to do with the size of the population. It's not just that it makes the estimate "more accurate," it's that it makes it what Statisticians call "unbiased."
Think back to the sampling distribution of the sample mean. So, if we repeated an experiment over and over again, and recorded the sample mean from each of the repeated experiments. The mean of the sampling distribution of the sample mean -- what Sal talks sometimes refers to as the "mean of means" -- happens to be equal to the mean of the original distribution. Because of this, we say that the sample mean is "unbiased" - it doesn't systematically overestimate or underestimate the population mean.
This is not the case with the variance. If we calculate the variance over and over again, using n in the denominator, the "mean of variances" (a strange concept, but it's the proper one to think about) will not be equal to σ^2, it will be σ^2 * (n-1)/n. By dividing by n-1 instead of n, we fix this problem. Using n, the sample variance is biased, because it tends to underestimate the population variance. Using n-1, the sample variance is unbiased.
So in this sense, it's not possible to get a better estimate for the variance. Subtracting 1, and specifically 1, is the best we can do. Changing what we divide by can only make it worse. Now, there are other criteria we might look at which may make a different estimate of the sample variance seem "better," but if we're just talking about the denominator we're using, n-1 can't be beat.(11 votes)
- what dont we divide our sample mean by n-1, is it not a biased estimator?(1 vote)
- Different sample means will oscillate around the population mean, can be both higher and lower, but different sample variances will tend to be lower than then population variance.(5 votes)
- When your sample size approaches the pop. size, at what point would it be best to stop using (n-1) and use (n)(3 votes)
- You should use n-1 unless your sample size is the entire population N. However, for large n and large N, it does not matter much whether you use n instead of the preferred n-1 since the ratio (n-1)/n is small.(1 vote)
- if my sample size is greater than, let's say, half of the population size, i.e. n > N/2, should I use the biased sample variance to get a better estimate? More generally, if I am aware of the value of N should I use this information to decide which formula to use? And at which value of n/N should I consider using the biased sample variance.(3 votes)
- You should use n-1 unless your sample size is the entire population N. However, for large n and large N, it does not matter much whether you use n instead of the preferred n-1 since the ratio (n-1)/n is small.(1 vote)
- I'm still confused with how to find the variance. can someone explain please?(2 votes)
- The variance formula for an entire population is the sum of the squares of the difference between the values and the mean of the data, divided by the number of data points.
The variance formula for a sample of a population is the sum of the squares of the difference between the values and the mean of the data, divided by the number of data points minus 1!(1 vote)
Video transcript
Here's a simulation created
by Khan Academy user TETF. I can assume that's
pronounced tet f. And what it allows
us to do is give us an intuition as to why
we divide by n minus 1 when we calculate
our sample variance and why that gives us an
unbiased estimate of population variance. So the way this starts
off, and I encourage you to go try this
out yourself, is that you can construct
a distribution. It says build a population
by clicking in the blue area. So here, we are actually
creating a population. So every time I click, it
increases the population size. And I'm just
randomly doing this, and I encourage you to go
onto this scratch pad-- it's on the Khan Academy
Computer Science-- and try to do it yourself. So here I could
stop at some point. So I've constructed
a population. I can throw out some
random points up here. So this is our
population, and as you saw while I was doing
that, it was calculating parameters for the population. It was calculating
the population mean at 204.09 and
also the population standard deviation, which is
derived from the population variance. This is the square root of
the population variance, and it's at 63.8. It was also pop plotting the
population variance down here. You see it's 63.8, which
is the standard deviation, and it's a little harder to
see, but it says it's squared. These are these numbers squared. So essentially, 63.8 squared
is the population variance. So that's interesting
by itself, but it really doesn't tell us a
lot so far about why we divide by n minus 1. And this is the
interesting part. We can now start
to take samples, and we can decide what
sample size we want to do. I'll start with really small
samples, so the smallest possible sample that
makes any sense. So I'm going to start with
a really small sample. And what they're going to do--
what the simulation is going to do-- is every
time I take a sample, it's going to
calculate the variance. So the numerator is going to
be the sum of each of my data points in my sample
minus my sample mean, and I'm going to square it. And then it's going to
divide it by n plus a, and it's going to vary a. It's going to divide it
by anywhere between n plus negative 3, so n minus
3, all the way to n plus a. And we're going to do it in
many, many, many, many, times. We're going to essentially take
the mean of those variances for any a and figure out which
gives us the best estimate. So if I just generate
one sample right over there, when we see kind
of this curve, when we have high values of a, we
are underestimating. When we have lower
values of a, we are overestimating the
population variance, but that was just
for one sample, not really that meaningful. It's one sample of size two. Let's generate a
bunch of samples and then average them
over many of them. And you see when you look at
many, many, many, many, many examples, something
interesting is happening. When you look at the
mean of those samples, when you average together
those curves from all of those samples, you see
that our best estimate is when a is pretty
close to negative 1, is when this is n plus
negative 1 or n minus 1. Anything less than
negative 1-- if we did negative n minus
1.05 or n minus 1.5-- we start overestimating
the variance. Anything less than
negative 1, so if we have n plus 0, if we divide
by n or if we have n plus 0.05 or whatever it
might be, we start underestimating the
population variance. And you can do this for
samples of different sizes. Let me try a sample size 6. And here you go once
again, as I press-- I'm just keeping Generate
Sample pressed down-- as we generate more and
more and more samples-- and for all the a's we
essentially take the average across those samples
for the variance depending on how
we calculate it-- you'll see that once again, our
best estimate is pretty darn close to negative 1. And if you were to get this to
millions of samples generated, you'll see that your
best estimate is when a is negative 1 or when
you're dividing by n minus 1. So once again, thanks
TETF, tet f, for this. I think it's a really
interesting way to think about why we
divide by n minus 1.