If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

## AP®︎/College Statistics

### Course: AP®︎/College Statistics>Unit 3

Lesson 5: More on standard deviation (optional)

# Simulation showing bias in sample variance

AP.STATS:
UNC‑1 (EU)
,
UNC‑1.J (LO)
,
UNC‑1.J.3 (EK)
,
UNC‑3 (EU)
,
UNC‑3.I (LO)
,
UNC‑3.I.1 (EK)
Simulation by Peter Collingridge giving us a better understanding of why we divide by (n-1) when calculating the unbiased sample variance. Simulation available at: http://www.khanacademy.org/cs/challenge-unbiased-estimate-of-population-variance/1169428428. Created by Sal Khan.

## Want to join the conversation?

• I would love to see another simulation comparing the biased estimator to the unbiased estimator (two of the right hand corner graph, one with each type) to appreciate the real difference, has anyone done this?
thanks! • i understand how the n-1 can be derived through simulation, but from a logical standpoint, why could it not be n-2 or n-3 ? • Is n-1 only an unbiased sample if the underling population is normal distribution or is it always true? • Is the fact that the unbiased sample variance should be divided by (n-1) totally empirical (i.e. derived from observations), or is there a mathematical way of showing why it should be (n-1) and not just n? • By definition an unbiased estimator is an estimator such that E[Û]=U, where Û denotes the estimator and U denotes the true value of the variable you wish to estimate. By considering E[S^2]=E[sum(Yi^2-Ym)^2], the sum from sample 1 to n where Ym denotes the mean of your n samples it can be proved that this equals (n-1)V/n, where V is the true variance. By multiplying the sample variance S^2 by n/(n-1) you then get an unbiased estimator.
• Maybe I am just misreading the graph but if the lower left graph is showing biased sample variance why are there so many points occurring above the population variance of 36.8? Since it is biased toward a lower value shouldn't the biased sample variance always be below the actual population variance (as was in the case with the previous videos example?) Also If it is showing unbiased values can you tell me where in the code the variance is divided by n-1 (because I looked and didn't find it anywhere) • How do we know when the sample size is small enough to warrant using n-1 as opposed to n? • we need to use "n-1" instead of "n" as long as the sample size is smaller than the population (which is, mostly all the time)... doesn't matter if the sample is really large or really small, either ways it's less than than the population so we need to compensate by using "n-1"... and the value of "n-1" changes when the size of the sample changes so the compensation is in proportion to the size of the sample.. hope this was helpful..
• I understand how Sal shows the population variance to equal sum(xi-x[bar])^2/n-1. What i do not quite understand is the relationship between the unbiased calculated population variance [sigma^2] and the sample variance [s^2]. Are they the same? Do we use [sigma^2] when using calculating p-values?
Thanks • I think you may have had a typo in there. If we have population data (all the data possible), we could calculate the population variance exactly, because we have the population: `σ^2 = sum(xi-μ)^2/n`.

If we only have a sample, then we need to estimate the population variance. It's a foregone conclusion that we can't calculate it, since we don't have all the possible data, but we can estimate it. And this is where the "unbiased" version comes into play: not all estimates are created equal. Going back to the population version (above), we could simply replace the μ with an xbar and call it a day:
`s^2 = sum(xi-xbar)^2/n`
However, this is what Sal is showing is a biased estimate. Since we're simulating data, we know the true variance, but we pretend as if we don't and so calculate the sample variance. If this formula gave us "good" results, the ratio of these two values, `s^2 / σ^2`, should be about 1. Sometimes it would be larger, other times smaller, but that's what it would average out to be. Unfortunately, this doesn't work. The formula above gives biased results - it tends to miss the mark by underestimating a little bit. Thankfully, we can figure out how to correct for this, and it's just a scaling factor. So we replace the formula above with
`s^2 = sum(xi-xbar)^2/(n-1)`
When we use this formula, the ratio of the sample variance to the population variance will tend to be 1, so we have an "unbiased" estimate of the sample variance.

The estimated sample variance is not "the same" as the population variance, it's our best guess at what the population variance is. This is the same way that the sample mean is not the same as the population mean, but it's our best guess.

In terms of what we use: if we have σ^2, we should use it. If we don't, then we use s^2. In the case of just variances, since the formulas are so similar, this basically means the following: If we have population data, we divide by n. If we only have a sample, then we divide by n-1. There are some additional consequences of this further down the line (e.g., if we have σ^2, then we'd use a Z-test instead of a t-test later on).
• So i understand how and why the correction is necessary. What i dont understand is: what is the reason for the bias to be (n-1)/n?

It makes sense that the variance will be different in a sample. But why this pattern of (n-1)/n?
If the points i take from the sample are close to each other the variance will be smaller than from the total population. If they are at the extrem ends or just few points spread out over maximum distances the variance could be even greater than the one of the total population. Why dont the samples cancel each other out? Why (n-1)/n?
(1 vote) • what actually biased and unbiased tells us, what they actually means?
(1 vote) • This video is helpful and intriguing, but I don't yet get at all why the data behave this way. But, I still don't get why we square the distance from the sample mean, and I imagine that's somehow involved in this bias thing. Oh well, "ours is not to reason why..." 