Main content
AP®︎/College Statistics
Course: AP®︎/College Statistics > Unit 5
Lesson 5: Analyzing departures from linearity- R-squared intuition
- R-squared or coefficient of determination
- Standard deviation of residuals or root mean square deviation (RMSD)
- Interpreting computer regression data
- Interpreting computer output for regression
- Impact of removing outliers on regression lines
- Influential points in regression
- Effects of influential points
- Identify influential points
- Transforming nonlinear data
- Worked example of linear regression using transformed data
- Predict with transformed data
© 2023 Khan AcademyTerms of usePrivacy PolicyCookie Notice
Standard deviation of residuals or root mean square deviation (RMSD)
Standard deviation of the residuals are a measure of how well a regression line fits the data. It is also known as root mean square deviation or root mean square error.
Want to join the conversation?
- At the end of the video, Sal mentions about the significance of RMSD; we can treat it like "average prediction error between [the] points."
I am confused. We call it Standard Deviation of residuals. The name sounds like it's going to tell us about how spread the residuals are. However, the formula quite looks like root square mean of residuals which tells us about the average prediction error between the points.(5 votes) - In the video on the same topic of the Statistics and Probability course (https://www.khanacademy.org/math/statistics-probability/describing-relationships-quantitative-data/assessing-the-fit-in-least-squares-regression/v/standard-deviation-of-residuals-or-root-mean-square-error-rmsd), Sal asks us to divide by n-1 (instead of n-2). Could anyone explain what is the difference between the two examples ? I suppose in both cases, we are talking about a sample (statistics) instead of the population (parameter) and hence n should not be used ? Many thanks.(4 votes)
- Did Sal square the sum of the residuals?(3 votes)
- First, he squared each residual individually and then he summed them up.(1 vote)
- For the answer Sal got at, shouldn't it be (sqrt(3)/sqrt(2)) instead of (sqrt(3)/2) 6:09(1 vote)
- No. Sal originally had the equation sqrt(1.5/2). You can then multiply the part inside the parentheses by 1 represented as 2/2, giving us sqrt((1.5*2)/(2*2)), which can be simplified to sqrt(3/4). We can then separate this into sqrt(3)/sqrt(4), where we can finally simplify it to sqrt(3)/2.(2 votes)
- How do I change mi avatar.(1 vote)
- you go to your asinments and click on your charcter and sat all avatar(1 vote)
- How can you solve this on a TI84 Plus?(1 vote)
- I understand the concept and how to do the problems, but what is the point of squaring the residual in general? To go off of that idea, I also don't understand why you can't just do the "regression line of best fit" instead of calling it the "least squares" regression.
Basically, why is anything being squared? Thanks!(0 votes)- We square everything to essentially get rid of negative values. Otherwise we wouldn't get a "TRUE" idea of how far spread our data is. Think about it, if we have negative differences we will end up with a smaller difference from the regression line. Reducing the difference from our line of best fit.(3 votes)
- Why in the other stats course on this site is RMSD calculated using n-1 in the denominator but here it is calculated using n-2?(1 vote)
- Would we be provided this formula on the AP test, or would we not even need it?(1 vote)
- At, Sal says "divide by two" but he means "subtract by two". 1:45(0 votes)
Video transcript
- [Tutor] So we are interested in studying the relationship between the amount that folks study for a test
and their score on a test, where the score is between zero and six and so what we're going to
do is go look at the people who took the test, we're
going to plot for each person the amount that they
studied and their score, so for example, this data point is someone who studied an hour and
they got a one on the test and then we're going to
fit a regression line and this blue regression line
is the actual regression line for these four data points and here is the equation
for that regression line. Now there's a couple of
things to keep in mind, normally when you're doing
this type of analysis, you would do it with far
more than four data points, the reason why I kept this to four is because we are actually
going to calculate how good a fit this
regression line is by hand and typically you would not do it by hand, we have computers for that. Now the way that we're going to measure how good a fit this regression line is to the data has several names, one name is the standard
deviation of the residuals, another name is the root
mean square deviation, sometimes abbreviated RMSD, sometimes it's called
root mean square error, so what we're going to
do is is for every point, we're going to calculate the residual and then we're going to square it and then we're gonna add up the sum of those squared residuals, so we're gonna take the sum of the residuals, residuals squared and then we're going to divide that by the number of data
points we have minus two and we can talk in future videos or a more advanced statistics class of why you divide by two,
but it's related to the idea that what we're calculating
here is a statistic and we're trying to
estimate a true parameter as best as possible and n minus two actually does the trick for us. But to calculate the root
mean square deviation, we would then take a square root of this and some of you might recognize strong parallels between
this and how we calculated sample standard deviation
early in our statistics career and I encourage you to think about it. But let's actually calculate it by hand, as I mentioned earlier in this video, to see how things actually play out. So to do that, I'm going to
ourselves a little table here, so let's say that is our
x value in that column, let's make this our y value, let's make this y hat, which is going to be equal to 2.5x minus two and then let's make this
the residual squared, which is going to be our y
value minus our y hat value, our actual minus our estimate
for that given x squared and then we're going to sum them all up, divide by n minus two
and take the square root. So first let's do this data point, so that's the point 1,1, 1,1, now what is the estimate
from our regression line? Well, for that x value,
when x is equal to one, it's gonna be 2.5 times one minus two, so it's gonna be 2.5 times one minus two, which is equal to 0.5 and so our residual squared
is going to be one minus 0.5, one minus 0.5 squared, which is equal to, that's
gonna be .5 squared, which is gonna be 0.25. Alright, let's do the next data point, we have this one right over here, it is 2,2, now our estimate from the regression line when x equals two is going to be equal to 2.5 times our x value, times two minus two, which is going to be equal to three and so our residual squared is
going to be two minus three, two minus three squared,
which is negative one squared, which is going to be equal to one, then we can go to this point, so that's the point 2,3, 2,3, now our estimate from our regression line is going to be 2.5 times
our x value, times two minus two, which is going
to be equal to three and so our residual
here is going to be zero and you can see that that point
sits on the regression line, so it's going to be three minus three, three minus three squared,
which is equal to zero and then last but not least, we have this point right
over here, when x is three, our y value, this person
studied three hours and they got a six on the
test, so y is equal to six, and so our estimate from
the regression line, you can see what you would
have expected to get, based on the regression line is 2.5 times our x value, times three minus two is equal to 5.5 and so our residual squared
is six minus 5.5 squared, it is 5.5 squared, so it's .5 squared, which is 0.25. So now the next step, let me take the sum of all
of these squared residuals, so this is, let me just write
it this way, do it like this, so the sum of the residuals, residuals squared is equal to, if I just sum all of this up, it's going to be 1.5, 1.5 and then if I divide that by n minus two, so if I divide by n minus two, that's going to be equal
to, I have four data points, so I'm gonna divide by four minus two, so I'm gonna divide by two and then I'm gonna wanna
take the square root of that, then I'm gonna take
the square root of that and so this is going to get us 1.5 over two is the same
thing as three-fourths, so it's the square root of three-fourths or the square root of three over two and you could use a calculator
to figure what that is, to figure out what that is as a decimal, but this gives us a sense of how good a fit this regression line is, the closer this is to zero, the better the fit of the regression line, the further away from zero, the worst fit and what would be the units for the root mean square deviation? Well, it would be in terms of whatever your units are for your y axis, in this case, it would
be the score on the test and that's one of the other
values of this calculation, of taking the square root of the sum of the squares of the residuals
dividing by n minus two. So big picture, this square
root of three over two can be viewed as the approximate size of a typical or average prediction error between these points and what the regression
line would have predicted or you could view it
as the approximate size of a typical or average residual.