Main content

## Statistics and probability

### Course: Statistics and probability > Unit 5

Lesson 6: More on regression- Squared error of regression line
- Proof (part 1) minimizing squared error to regression line
- Proof (part 2) minimizing squared error to regression line
- Proof (part 3) minimizing squared error to regression line
- Proof (part 4) minimizing squared error to regression line
- Regression line example
- Second regression example
- Calculating R-squared
- Covariance and the regression line

© 2023 Khan AcademyTerms of usePrivacy PolicyCookie Notice

# Calculating R-squared

Calculating R-Squared to see how well a regression line fits data. Created by Sal Khan.

## Want to join the conversation?

- Great explanation! I understand everything, but I have a question: what is the practical value of looking at a graph with the R-squared value?

For example, you have a business and your statistician gives you two reports about two unrelated projects / ideas with a large R-squared value on one report, and a small R-squared value on the second report.

You, the business owner and the decision maker, will ask, "So what? What does this mean? What value is this graph to the decisions I will make?"

There has to be an economic purpose behind the R-squared values. What is the answer to "so what?"(37 votes)- Knowing the extent to which the model "fits" reality (represented by the points we have actually observed, the plotted points) will help the business owner in your example assess the likelihood of a value
*predicted*by the model (all the other points on the line) being actually true.

For example:

The business owner distributes grains which he buys from farmers and sells to a breakfast cereal company. He suspects that the amount of rainfall has something to do with the price the farmer charges him for the grains (rain affects the crops and therefore the supply). He asks the statistician to build a model and report its R^2 along with it. The statistician has access to 10 years of rainfall and grain price data, so he plots the prices against the rainfall and builds a model. The R^2 for this model is 88%.

The next year, the business owner measures the rainfall and uses the model to predict the price of grain and gets a rather accurate result. He's quite pleased with the statistician's work so he asks him to build a model relating the number of births of ten years past to the number of cereal boxes sold that year (he assumes more kids means more parents buying cereals). The statistician builds a model and comes back with an R^2 of 42%.

The business owner decides is probably best not to try to predict the demand for his cereals from the births 10 years past. He ignores the model.

Finally:

High R^2 = good model, probably profitable

Low R^2 = bad model, probably dangerous

Hope this helps.(108 votes)

- How is the variation in x the same as the regression line, when calculating r2?(12 votes)
- I think you're just wondering why he's using the term "variation in X" at all. It helps to think of it as though the x-axis is time and the y-axis shows results taken at different times. Each result at each time has some bit of error associated with it based on the line you're measuring against (drawing those vertical lines from point to line are aka residuals).

The regression line attempts to change where you draw your residuals to so that a y value of 10 might have lots of error at one value of x (at one time), but if you were to get that same value of y=10 at a different value of x it will have a different amount of error (due to the slope of the regression line).

So the regression line changes the requirement for "error" as X varies (aka the variation of X).

I hope that helped a little. I'm no Sal.(13 votes)

- My statistics textbook suggests that the total error would be the sum of the explained and the unexplained error which in this case would be 2.74 + 22.75. The book then calculates r squared as the explained error divided by the total error which in this case would be 22.75/(2.74+22.75) = 0.89. Are the two methods equivalent (i.e. this method and the one described in the lecture) ?(3 votes)
- You're missing something that the video didn't fully explain. There are three ways to categorize the error here:

1. Total error

2. Explained error

3. Unexplained error

R^2 is then (Explained Error) / (Total Error) = 1 - (Unexplained Error) / (Total Error)

The total error is the sum of (Y-Ybar)^2, so in the video this is the 22.75.

The unexplained error is the sum of (Y-Y*)^2, so in the video this value is 2.74.

He never actually calculated the explained error, but it would be the difference, 22.75 - 2.74 = 20.01. You could get this by taking the sum of (Y*-Ybar)^2. He hints at this when he says that the 12% is the "percent of error NOT explained by variation in X," and subtracts that value from 1 to get the percent of error that IS explained by variation in X. So, that hints that the Unexplained and Explained errors must add to the Total error.

Anyway, after this, you appear to have followed the formulas in your statistics textbook correctly. Using the new values, we'd have:

R^2 = (Explained Error / Total Error) = 20.01/22.75 = 0.879

or

R^2 = 1 - (Unexplained Error / Total Error) = 1 - 2.74/22.75 = 0.879(15 votes)

- At7:14, Sal's explanation seems to imply that the standard error from the line can be seen as a fraction of the standard error from the mean. The SE of the line is the 'unexplained variation' and the SE of the mean is the 'explained variation'. But as far as I can tell, the SE of the line is measuring a completely different type of variation, right? How does the variation from the regression line in any way contribute or relate to the variation from the mean?(5 votes)
- Be sure to watch the next video
**all the way to the end**. Sal pulls it all together, and I think his explanation will answer your question.(3 votes)

- Why do we square the error on the line?(3 votes)
- It makes positive and negative differences both positive, so they don't cancel each other out in the sum. Also, it makes large deviations from the line have a disproportionally large effect on the total error. There are other reasons why this is the convention, but I don't know them well enough to comment more.(6 votes)

- How does the squared error from the mean explain then"total" variation? I understand the squared error of the line, but do not understand the squared error of the mean(4 votes)
- Consider if we tried to fit the model y=b instead of y=mx+b, basically limiting ourselves to using a horizontal line. The line of best fit would be a horizontal line at the mean of all y values, because it minimizes the vertical distance between itself and the points. That's why we use y_mean as the denominator in R-squared. A slope will always give us a better line of best fit, and R-squared is a measure of how much better.(2 votes)

- If you draw a line which is extremely far away from all the points, e.g.

y=4000+x

The Squared Error from the line will be much higher than the Squared Error from the mean.

SEl>>>SEy

so SEi/SEy is greater than 1

and R^2, 1-SEl/SEy will be negative

Is this a real thing?

I realise that this never happens when you try to actually draw a good regression line, but

it means that "R^2" can be a negative number?(2 votes)- Yes, that's true, but it's also violating the basic premise of the model. The reason R^2 = 1-SEl/SEy works is because we assume that the total sum of squares, the SSy, is the total variation of the data, so we can't get any more variability than that. When we intentionally make the regression line bad like that, it's making one of the other sum of square terms larger than the total variation.(4 votes)

- Perhaps go into different statistic tests for ANOVA and explain what passing/failing a test means, briefly?(2 votes)
- There are other videos specifically discussing ANOVA that can also be found under the Probability + Statistics topic.(2 votes)

- As R^2 gets closer to 1 that indicates that the variation in data points is explained by the variation in x, meaning that the regression line is an increasingly better fit for the data, as explained at9:04. Is this correct? For research purposes and reporting on studies what value is considered "good enough" to make the statement that x + y are correlated?(2 votes)
- That's a great question. Unfortunately, like so many great questions, the answer is "it depends" :)

In something like a physics or chemistry experiment, where you are able to tightly control all the variables and using high-quality sensors, you can get R-squared values like 0.999 or even higher. If you are expecting a value like this and get something like R-squared = 0.9, you might start rethinking your hypothesis or the design of your experiment.

However, if the data is less precise or a bit noisier - perhaps you're plotting self-reported happiness versus self-reported height - then an R-squared value of less than 0.9 might still be enough to demonstrate a correlation. Ultimately, it all comes down to how much random variation you can expect in your data.(2 votes)

- Is r^2 the same as r? The correlation coefficient intuition module leads me to believe this.

Also, what does it mean to say that "x% of the total variation of the y values is explained by the variation in x"? Are we talking about the variation of the x values in each of the ordered pairs?(1 vote)

## Video transcript

In the last video, we were able
to find the equation for the regression line for these
four data points. What I want to do in this video
is figure out the r squared for these data points. Figure out how good this
line fits the data. Or even better, figure out the
percentage-- which is really the same thing-- of the
variation of these data points, especially the variation
in y, that is due to, or that can be explained
by variation in x. And to do that, I'm actually
going to get a spreadsheet out. I've actually tried to do this
with a calculator and it's much harder. So hopefully this doesn't
confuse you too much to use a spreadsheet. And I'm a make a couple
of columns here. And spreadsheets actually have
functions that'll do all of this automatically, but I really
want to do it so that you could do it by hand
if you had to. So I'm going to make a couple
of columns here. This is going to
be my x column. This is going to
be my y column. This is going to be the column--
I'll call this y star-- this'll be the y value
that our line predicts based on our x value. This is going to be the
error with the line. Let me caught it the squared
error with the line. I don't want us to take
up too much space. And then the next one, I'm
going to have the squared variation for that y value
from the mean y. And I think these columns by
themselves will be enough for us to do everything. So let's first put all
the data points in. So we had negative 2
comma negative 3. That was one data point. Negative 1 comma negative 1. And we had 1 comma 2. Then we have 4 comma 3. Now, what does our
line predict? Well our line says, you give
me an x value, I'm going to tell you what y value
I'll predict. So when x is equal to negative
2, the y value on the line is going to be the slope. So this is going to be equal
to 41 divided by 42 times our x value. And I just selected that cell. And just a little bit of a
primer on spreadsheets, I'm selecting the cell D2. I was able to just move my
cursor over and select that. But that tells me the x value. Minus 5/21. Minus 5 divided by 21. Just like that. So just to be clear of what
we're even doing. This y star here, I
got negative 2.19. That tells us at this
point right over here is negative 2.19. So when we figure out the error,
we're going to figure out the distance between
negative 3, that's our y value, and negative 2.19. So let's do that. So the error is just going to
be equal to our y value. That's cell E2. Minus the value that our
line would predict. So just that value is
the actual error. But we want to square it. And then, the next thing
we want to do is the squared distance. so this is equal to the squared
distance of our y value from the y's mean. So what's the mean of the y's? Mean of the y's is 1/4. So minus 0.25, is the
same thing is 1/4. And we also want
to square that. Now, this is what's fun
about spreadsheets. I can apply those formulas
to every row now. And notice, what it did
when I did that. Now all of a sudden, this is the
y value that my line would predict, it's now using
this x value and sticking it over here. It's now figuring out the square
distance from the line using what the line would
predict and using the y value, this one. And then does the same
thing over here. It's figures out the squared
distance of this y value from the mean. So what is the total squared
error with the line? So let me just sum this up. The total squared error
with the line is 2.73. And then the total variation
from the mean, squared distances from the mean
of the y, are 22.75. So let me be very clear
what this is. So let me write these
numbers down. I'll write it up here so we
can keep looking at this actual graph. So are squared error versus our
line, our total squared error, we just computed
to be 2.74. I rounded a little bit. And what that is, is you take
each of these data points' vertical distance to the line. So this distance squared, plus
this distance squared, plus this distance squared, plus
this distance squared. That's all we just calculated
on Excel. And that total squared variation
to the line is 2.74. Or total squared error
with the line. And then the other number we
figured out was the total distance from the mean. So the mean here is
y is equal to 1/4. So that's going to be
right over here. This is 1/2. So right over here. So this is our mean y value. Or the central tendency
for our y values. And so what we calculated next
was the total error, the squared error, from the
means of our y values. That's what we calculated over
here in the spreadsheet. You see in the formula. It is this number, E2, minus
0.25, which is the mean of our y's squared. That's exactly what
we calculated. We calculated for each
of the y values. And then we summed
them all up. It's 22.75. It is equal to 22.75. So this is essentially
the error that the line does not explain. This is the total error,
this is the total variation of the numbers. So if you wanted to know the
percentage of the total variation that is not explained
by the line, you could take this number divided
by this number. So 2.74 over 22.75. This tells us the percentage
of total variation not explained by the line or
by the variation in x. And so what is this number
going to be? I can just use Excel for this. So I'm just going to divide this
number divided by this number right over there. I get 0.12. So this is equal to 0.12. Or another way to think about
it is 12% of the total variation is not explained
by the variation in x. The total squared distance
between each of the points or their kind of spread, their
variation, is not explain by the variation in x. So if you want the amount that
is explained by the variance in x, you just subtract
that from 1. So let me write it
right over here. So we have our r squared, which
is the percent of the total variation that is
explained by x, is going to be 1 the minus that 0.12 that
we just calculated. Which is going to be 0.88. So our r squared here is 0.88. It's very, very close to 1. The highest number
it can be is 1. So what this tells us, or a way
to interpret this, is that 88% of the total variation of
these y values is explained by the line or by the
variation in x. And you can see that it looks
like a pretty good fit. Each of these aren't too far. Each of these points are
definitely much closer to the line than they are
to the mean line. In fact, all of them are closer
to our actual line than to the mean.