If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

## AP®︎/College Statistics

### Course: AP®︎/College Statistics>Unit 3

Lesson 5: More on standard deviation (optional)

# Why we divide by n - 1 in variance

AP.STATS:
UNC‑1 (EU)
,
UNC‑1.J (LO)
,
UNC‑1.J.3 (EK)
,
UNC‑3 (EU)
,
UNC‑3.I (LO)
,
UNC‑3.I.1 (EK)
Another visualization providing evidence that dividing by n-1 truly gives an unbiased estimate of population variance. Simulation at: http://www.khanacademy.org/cs/unbiased-variance-visualization/1167453164. Created by Sal Khan.

## Want to join the conversation?

• Is there a concrete logical or mathematical proof using simple maths behind this?
• There is a concrete mathematical proof. Whether or not it uses "simple" math depends on what you think is simple vs. not-simple math. The proof requires you to understand the following:

1. Expected values of probability distributions.
2. Expected values of sums of independent random variables.

If you are comfortable with these three things, the proof is easily accessible. If you are not comfortable with these things, the proof may seem like picking things out of thin air. In the math, I'm going to use a dot (•) to represent multiplication. The asterisk sometimes causes issues with formatting, and the × get too confused with an x.

Our goal with the sample variance is to provide an estimate of the population variance that will be correct on average. Taking different samples will result in different values of s², but if we take a lot of samples, and record s² each time, we want that distribution to be centered on σ². Since s² is a random variable (different samples will result in different values), we write this mathematically as saying that the expected value of s² should be equal to σ²:

`E[ s² ] = σ²`

One thing we need to assume is that all observations are independent, and identically distributed - meaning that they all come from a population with the same mean µ and the same variance σ².

First, we're going to need a little side-derivation. For a random variable X, the variance, `σ² = E[ (X - µ)² ]`, where E[X]=µ. Expanding the square, we get:
``σ² = E[X² - 2•X•µ + µ² ]σ² = E[X²] - 2µE[ X ] + µ²σ² = E[X²] - µ²E[X²] = µ² + σ²``

We will get to a point where we need E[X²], so just keep this in your back pocket for the moment. Now let's get back to E[ s² ]. To start, just substitute in the definition for the sample variance:

`E[ s² ] = E[ Σ (xi - xbar)² / (n-1) ]`

Now, since (n-1) is a constant, it can be pulled out of the expected value. I'm also going to expand the squared term.
`E[ s² ] = (1/(n-1)) E[ Σ (xi - xbar)² ]`

First, expand the square:
`E[ s² ] = (1/(n-1)) E[ Σ xi² - 2•xi•xbar + xbar² ]`

Summations can be distributed across addition and subtraction, we get get three separate sums:
`E[ s² ] = (1/(n-1)) E[ Σ xi² - Σ2•xi•xbar + Σxbar² ]`

Now, xbar and xbar² are constant respective to their summations, so they can get pulled out:
`E[ s² ] = (1/(n-1)) E[ Σxi² - 2•xbar•Σxi + n•xbar² ]`

Also, note that since `xbar = (1/n) Σ xi`, we can multiply each side by n to get `Σ xi = n*xbar`. This is a useful little trick.
`E[ s² ] = (1/(n-1)) E[ Σxi² - 2•xbar•n•xbar + n•xbar² ]`

Combine the second and third terms:
`E[ s² ] = (1/(n-1)) • E[ Σxi² - n•xbar² ]`

Now, the expected value can distribute over addition and subtraction to get us:
`E[ s² ] = (1/(n-1)) • [ ΣE[xi²] - n•E[xbar²] ]`

Remember that little thing we derived earlier and put in our back pocket? We need it now. We have two random variables, xi and xbar, that are squared, and for which we need the expectation. So: E[X²] = µ² + σ². The second one is a little different, because we need the mean and variance of the sampling distribution of the sample mean. These are µ and σ²/n, respectively. So for the second term we have:
E[xbar²] = µ² + σ²/n.

Substituting these values in above, we have:
`E[ s² ] = (1/(n-1)) • [ Σ (µ² + σ²) - n•(µ² + σ²/n) ]`

We can do this, because E[xi²] is the same for every xi (we assumed earlier that the x's are independent and identically distributed, so E[xi²] doesn't depend on the the i part). Now nothing depends on the summation anymore, we are just adding a constant, so we can just multiply by n:
`E[ s² ] = (1/(n-1)) • [n•(µ² + σ²) - n•(µ² + σ²/n) ]`

Then distribute the n multiplication over the parentheses:
`E[ s² ] = (1/(n-1)) • [n•µ² + n•σ² - n•µ² - n•σ²/n ]`

And simplify:
`E[ s² ] = (1/(n-1)) • [ n•σ² - σ² ]`
`E[ s² ] = (1/(n-1)) • [ (n-1)•σ² ]`
`E[ s² ] = σ²`

Voila! We are done, and we have proven that `E[ s² ] = σ²`. If, going back to the beginning, we had divided by n in the denominator instead of by n-1, that would have carried through to the end, and the result would have been:

`E[ s² ] = [(n-1)/n] • σ²`

Which is not exactly equal to σ², it is slightly smaller, because the ratio (n-1)/n is less than 1.
• I get these various "intuitions" about why n-1 is better, but I have two questions:
1.) Who figured this out in the first place, and how did they do so? Presumably they didn't run computer simulations, did the tediously do a lot of simulations by hand?
2.) Is there a mathematical proof than n-1 is better, or is it all based on intuition and empirically experimenting with different ways of getting the least biased sample variance?
• I kind of feel some of these videos for this section are not ordered correctly.
• I think this process of n-1 'unbiases' the estimation of variation of data because of the nature of collecting data. In real life, there is going to be much more variation of things than you will ever see in a sample group. Take height, for instance.

There are so many possible heights from very short to amazingly tall. However, if we wanted to find out what the average height is, the actual odds that we will meet the extremely tall and extremely short people is unlikely because they are what people call 'statistical outliers' - because these people are rare, you probably won't be able to include them in your list of peoples heights, so naturally you're going to meet a less diverse group of people. This will mean your data has less variation than real life does. Therefore your data is underestimating the variation of real life because of the odds of finding certain types of people being greater or smaller.
• A reasonable thought, but it's not really the reason. The reason dividing by `n-1` corrects the bias is because we are using the sample mean, instead of the population mean, to calculate the variance. Since the sample mean is based on the data, it will get drawn toward the center of mass for the data. In other words, using the sample mean to calculate the variance is too specific to the dataset. If we were able to use the population mean instead of the sample mean, there would be no bias.
• Initially I found this confusing, but here's a restatement:

Makes perfect sense (once I read Justin Help’s explanation, which is available at https://www.khanacademy.org/computer-programming/unbiased-variance-visualization/1167453164). BASICALLY , when sample mean is the same as population mean (center of the charts) then sample VARIANCE is also the same as “sample variance calculated using the true population mean” (this is a weird statistic, but allows you to see why n-1 works). However, when sample variance is calculated using the sample variance (the normal way) this differs increasingly (by a negative amount, as variance is being underestimated) from the “sample variance calculated using the true population mean” (the weird statistic which Sal refers to as “pseudo sample variance” again). Thus, the charts on the left show sample variance against true population variance, the charts on the rights show sample variance against “pseudo sample variance”, a statistic that is a hybrid of true population variance and sample variance.
• At , when he is subtracting by the population mean, the denominator shouldn't be N instead of n since we are talking in this part of the formula about the population and not just the sample?
• He says he is actually subtracting from the sample variance a "pseudo-sample variance," using the true mean but changing nothing else. Therefore, the denominator is n for both expressions for the red graph (and would be n-1 for the blue's and n-2 for the green's).

If he was finding the difference between the sample and population variances, you would be correct. But for this simulation, he uses the "pseudo-sample variance" to best demonstrate the unbiased estimate.
• So, what is the significance of the number 1 with respect to unbiased sample variance? What I mean is, it just seems a bit odd that it would conveniently converge to a whole number such as 1. Is there a simple answer or is it a mystical property akin to Euler's identity?
• If you look at the sample variance for lots and lots of random samples, and take the average of all those different variances, that average will tend to agree with the true population variance. It will tend to agree more as you consider more samples. Formally, we say that the "expectation value" of the sample variance is equal to the population variance. This Wikipedia article shows a proof of why this true: https://en.wikipedia.org/wiki/Variance#Sample_variance That proof also shows where the factor of n-1 comes from.
(1 vote)
• I still don't understand how does the computer program calculate the "pseudo-sample variance" @ if we don't know mu's value. Can someone please explain?
• In real life we generally don't know the value of μ. However, in a simulation, we are making up the data, and we do in fact know μ. What were doing is:

1. Set μ and create some data from a distribution with that mean.
2. Pretend that we don't know μ, and calculate the mean and standard deviation.
3. Remember that we know μ, and perform the calculations shown in the video.