If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

### Course: Statistics and probability>Unit 16

Lesson 1: Analysis of variance (ANOVA)

# ANOVA 2: Calculating SSW and SSB (total sum of squares within and between)

Analysis of Variance 2 - Calculating SSW and SSB (Total Sum of Squares Within and Between). Created by Sal Khan.

## Want to join the conversation?

• Thanks Sal for the great video.

Would you please help in elaborating the degrees of freedom part? I find it really interesting but I think that unfortunately I did not grasp their meaning in its entirety.

- "In the future we might do a more detailed discussion of what degrees of freedom mean and how to ..."

Thank you

• Well the way I explained it to myself:

If we have a defined population we can easily calculate the mean.

5,4,3,2,5 = 5+4+3+2+5/5=3,8
We can always find the "one missing" population member if we now that the mean is 3,8 and that there are 5 members and if we know the 4 members.

(5+4+3+2+x)/5=3,8 (multiply both sides by 5)
5+4+3+2+x=19
x=19-5-4-3-2
x=5 (fifth member of our population was indeed 5)

If we tried the same thing with two missing members it would not have came out the same.

(5+4+x+x)/4=5 multiply both sides by 4
5+4+2x=20
2x=20-9
2x=11
x=5,5

Clearly not correct.
• These videos on ANOVA follow the videos on regression. I was wondering if someone can post a statement that clarifies the distinction between the respective goals of regression and ANOVA analysis, i.e., you conduct regression analysis when your objective is __, but you need ANOVA when your goal is ____. thanks in advance.
• Regression: You have two quantitative (numerical) variables, and you want to know the relationship and/or predict values of one of them. For example, you know a person's height, you want to predict his/her weight.

ANOVA: You are interested in a numerical variable (the response), and you want to see if there are difference in this variable over several groups. For example, you want to see if gas milage differs between sedans, minivans, and SUVs.
• For the SSB, I thought it was the mean of each group divided by the Grand mean--all squared, and so would be (2-4)**2 + (4-4)**2 + (6-4)**2. Why is it (2-4) repeated three times?
• One can understand the repetition of the deviations between group mean and the grand mean for each group by considering ANOVA for groups with different sample sizes. Repeating the deviations for every sample allows us to place some weightage to the sample size of that group. This is important as larger sample sizes give better estimates and lesser variances from the true statistic (population mean. proportion or variance). Hence by taking the number of samples into account, we explain how much we are relying on the data available from the sample in order to estimate the variances of the whole data. Therefore the variances of each group needs to include its sample size as well.
(1 vote)
• When calculating SSB and SSW do the groups have to have the same degrees of freedom? In your example all the groups have three data points and 2 degrees of freedom, is it ok to have group A with 6 data points group B with 3 and group C with 5?
• Yes, it is possible to have groups with different sample sizes. This would be called an "unbalanced" design (versus a "balanced" design when all groups have the same sample size).

For the unbalanced design, the notation gets a little bit more complicated, but it all still works out. Some things that Sal mentions won't be strictly as they sound. For example, the "mean of means" that he references cannot literally be calculated as the mean of the group means. We would need to calculate a weighted mean of the group means, or directly average all the data points from all the groups.
• Whenever I hear the term "n-1 degrees of freedom" when there are really "n" independent samples, that always sounds strange to me. At , SK is talking about the "n-1 degrees of freedom" b/c we "know" the mean. But, in calculating the mean itself, "n" independent samples were used. If any of those "n" random variables were different, the mean itself would be different. Any one of those "n" variables can have a result in all of the values in SK's blackboard above.**

Now, that being said, when SK talks about "n-1 degrees of freedom", he appears to be talking about the ERROR only FROM the mean. As in, we calculated the variance (a measure of error), and in calculating that "error", we used the mean as an intermediate value. But, GIVEN that value, there are only "n-1" degrees of freedom. However, how can we lose sight of the fact that calculating the mean ITSELF required "n" independent variables?

To summarize, there are really "n" independent variables / degrees of freedom in the WHOLE calculation. Why do we restrict the degrees of freedom in "knowing" the mean, when the mean itself was calculated with a set of values with "n" degrees of freedom - and NOT "n-1" degrees of freedom?

** I challenge anyone to tell me how they can keep all the final values (e.g. SST = 30) by holding only "n-1" of the values the same in the 3x3 matrix without any other constrains (i.e. mean). If I take the last variable (which you didn't hold), and change it, the mean itself and all the other following stats would change.

Thanks.
• I think the (n-1)d.o.f. comes from this:

Hope you can follow my line of thought here:
-If we say the mean HAS to be ie. 25
-We have 3 variables (X1, X2 and X3) and we're allowed to change these variables as we wish as long as the mean remains 25
-Then say we change X1 and X2 as we wish, we can move them freely around, but here's the caveat, in order for the mean to remain at 25, when we change X1 and X2, we have to use X3 to adjust for the movement in X1 and X2 such that our mean remains at 25. And with this in mind, we can't really say that X3 is a free variable, since it's tied to the mean. Thus we get (d.o.f. = n-1)

Also to note, this bears a high resemble of "solving for x"-type equations, and thus it's not only for the mean that this works.

Atleast that's how I see it, would love to be corrected if I'm wrong, since then my understanding needs to be updated :)
• Is the Sum of Squares Within variables also known as the Sum of Squares Estimated
Is the Sum of Squares Between variables also known as the Sum of Squares for Residuals
• Sum of Squares Within = Sum of Squared Residual
• We are calculating SSW as defined in the video but it is not intuitive to take 3*(2-4)^2 + 3*(4-4)^2 + 3*(6-4)^2 (according to the example given in the video) to measure squared sum between group.

(2-4)^2 + (4-4)^2 + (6-4)^2 is more intuitive but incorrect,why?
In measuring variance between group, why we are going up to element level 3*(2-4)^2 and not just performing (2-4)^2 for one group?
group mean represents a group. So if we want to take variation among group just take squared sum of (group_mean - mean_of_group_means).

Instead if we calculate (2-4)^2+(4-6)^2+(2-6)^2 or (x1bar-x2bar)^2+(x2bar-x3bar)^2+(x3bar-x1bar)^2 then it can also measure variation between the groups and also it is more intuitive. I am not sure about this method but in this case it worked =24 same as calculated in the video..
I know that this method is not calculating variation from centre of means and just calculating distance between means of different groups.
(1 vote)
• What if one group had a mean that was a long way away from the grand mean, but it only had few observations, while another group mean was very close to the grand mean, and had many observations? Groups with more observations should have higher weight. That is one intuitive reason why the "point wise" calculations are the way they are.

There are a number of ways to understand this through the mathematics, but that will get into some more complicated formulas.
• When we use statistical analysis chi-Square
• How would we calculate SSwithin if only variances are given for each group?
(1 vote)
• SSwithin can be calculated as: (n-1)*[ s1^2 + s2^2 + ... ]

To get this, look at what we do for SSwithin. For every point, we subtract the group mean from each value, square it, and add them all up. For notation, I'm going to use Mi as the group means.

SUM( SUM( (Xij - Mi)^2 ) )

The outside sum is just going across the groups, so let's look at each group separately: SUM( (Xij - Mi)^2 )

This looks very close to being a variance, doesn't it? In fact, all we need to is divide by (n-1) and we'd have the variance for that group! But we already have the variance, and we want the sum. So we can just "undo" the division by (n-1) that the variances have. For the ith group, we'd take (n-1)*Si^2 to get this sum. From there, we just have to do the same thing to every group, and add up the results.

If we have equal sample sizes for each group, they are all "n", so the denominator in the variance is (n-1) for each, meaning we can factor that out and just multiply (n-1) by the sum of all the variances. Hence, the formula at the top.
• Hi. Could this be expanded into three dimensions, i.e. done with a data cube?

When I watched the video I thought of some hourly data I've worked on before, which has a variation within the week, between different weeks and between different years. It would be interesting to have a good measure of how much of the total variation each part contributes.

I did a quick test in Excel to see if I could figure it out, but I couldn't quite get the sums to work out som that the sum of the parts was equal to the total. Does anyone know if it's possible to do in a similar manner to what the video describes?
(1 vote)
• This video is about Analysis of Variance (ANOVA), or more specifically, One-Factor ANOVA. In this setting, we are thinking about two variables:

1. Something that we are measuring, like height, weight, MPG of an automobile, etc. These result in the 9 numbers that Sal was working with in this video.
2. A "factor", which is a variable comprising two or more groups. For example, say we want to compare the average MPG of an automobile, and we are looking at several groups: Sedans, Minivans, and SUVs/Trucks. This would be the three groups / columns that Sal had, group 1, 2, and 3 (or Green, Purple, Pink, if you prefer).

It is absolutely possible to add a second factor (i.e. another set of groups) into the analysis. For instance, say we think that in addition to differences in MPG between car types, we think there may also be differences between Asian, American, and European cars in terms of MPG. We could build that into the model as well, and it would be called "Two-factor ANOVA". We can extend this as much as we needed, though the more factors that we add, the more complicated it will be to actually understand the results.

That being said, your brief description would not be able to get dressed by having several factors. Since your factors are overlapping time periods (within week, between week, between years), the groups are not what we call independent. One problem is this: A collection of weeks belongs to a certain year. Say year A was during a recession, and year B was not. Then year B will look better, but so will all of the weeks associated with year B. The weeks are what we call "nested" within the years. Different weeks "belong" to specific years.