Main content

## Statistics and probability

### Course: Statistics and probability > Unit 5

Lesson 4: Least-squares regression equations- Introduction to residuals and least squares regression
- Introduction to residuals
- Calculating residual example
- Calculating and interpreting residuals
- Calculating the equation of a regression line
- Calculating the equation of the least-squares line
- Interpreting slope of regression line
- Interpreting y-intercept in regression model
- Interpreting a trend line
- Interpreting slope and y-intercept for linear models

© 2023 Khan AcademyTerms of usePrivacy PolicyCookie Notice

# Calculating the equation of a regression line

Calculating the equation of a least-squares regression line. Intuition for why this equation makes sense.

## Want to join the conversation?

- What video is he referring to in the beginning?(30 votes)
- He's referring to the video in the "Correlation coefficients" section called "Calculating correlation coefficient r": https://www.khanacademy.org/math/statistics-probability/describing-relationships-quantitative-data/scatterplots-and-correlation/v/calculating-correlation-coefficient-r

(The Correlation coefficients section is the second section is this category. We skip it if we pass the quiz, but it's a pretty useful video!)(33 votes)

- Why for a least-squares regression line I'm definitely going to have the sample mean of x and y on the line?(25 votes)
- Hmm. Interesting, right? The proof involves hairy algebra and some partial derivatives, but here it is, a series of videos.

https://www.khanacademy.org/math/statistics-probability/describing-relationships-quantitative-data/more-on-regression/v/proof-part-3-minimizing-squared-error-to-regression-line (start from the top video of that section)

But it kind of makes intuitive sense, no? the coordinate, mean of x, AND mean of y, because they are mean, it is a point that really does describe the whole data points well.(15 votes)

- At3:10,why regeression line must go through the point (mean of x,mean of y)?(21 votes)
- Why do we not use x hat in the equation of the least regression line?

y hat = m (x) + b?(2 votes)- A hat over a variable in statistics means that it is a predicted value. In general, the explanatory variable is on the x-axis and the response variable is on the y-axis. The response variable can be predicted based on the explanatory variable. The response variable is
**not**exact, while the explanatory variable is exact. This is why the response variable (y) is written with a hat.(7 votes)

- Why is r always between -1 and 1?

I know that this question has been asked before but the answers are either too technical or too naive. Could someone please provide an answer that is mathematical in nature but can be understood by someone who have ok but not strong mathematical foundation.

Thanks for your help.(5 votes)- The number and the sign are talking about two different things. If the scatterplot dots fit the line exactly, they will have a correlation of 100% and therefore an r value of
**1.00**However, r may be positive or negative depending on the slope of the "line of best fit". So, a scatterplot with points that are halfway between random and a perfect line (with slope 1) would have an r of 0.50, because there is a 50% correlation and because the slope is positive.(8 votes)

- In later videos we see another formula for calculating m, which is m = (X_bar*Y_bar - XY_bar) / X_bar^2 - X^2_bar, which is derived by taking the partial derivatives of the square errors function with respect to m and b. and here we see another formula m = r*Sy/Sx. can someone please say if there is any relationship between these two?(7 votes)
- If r = 0 then slope is 0, then how can line pass through

Y mean not y-intercept?(5 votes) - For those who don't get it.

Goal is to find regression line that best fits the data point. He shows formula to get the correlation coefficient, but they have already done all the calculation to get the best correlation coefficient. They have also provided x,y mean and stddev.

Now the way they derive the y=mx+b.

First they use the Xmean and Ymean as reference. The Ymean is NOT the y intercept. And then he draws 1 stddev lines for x and y axis. Then he shows that rise over run, which is slope, is equal to Sy/Sx. But the r also factors into this calculation. Therefore m = r*Sy/Sx. But we still have to find y intercept.

We know for a fact that for the regression line function, we have Xmean and Ymean as part of its points or at its intersection. So we substitute the m, Xmean, Ymean, and then get Y intercept.

Honestly it's pretty smart. Wouldn't have thought about it and was going to skip this video. But glad I spent time to understand it.

This has applications in machine learning and AI - FYI.(5 votes) - I am still quite confused. Why is m=r(Sy/Sx)? I think r is just to measure the strength of the correlation, no? What is r doing in this formula? Thanks for your help in advance!(4 votes)
- Given the spread of x values and the spread of y values, the correlation coefficient still influences the slope of the line of best fit. If the correlation is very weak (r is near 0), then the slope of the line of best fit should be near 0. The more strongly positive the correlation (the more positive r is), the more positive the slope of the line of best fit should be. The more strongly negative the correlation (the more negative r is), the more negative the slope of the line of best fit should be.(3 votes)

- Why is this the least squares regression line. It seems we do not use the least squares anywhere?(3 votes)
- All examples and practice problems have showed simple applications of least square, check them.(4 votes)

## Video transcript

- [Instructor] In previous videos, we took this bivariate data and we
calculated the correlation coefficient, and just
as a bit of a review, we have the formula here, and it looks a bit intimidating, but
in that video we saw all it is is an average of
the product of the z scores for each of those pairs. And as we said, if r is equal to one, you have a perfect positive correlation. If r is equal to negative
one, you have a perfect negative correlation, and
if r is equal to zero, you don't have a correlation, but for this particular bivariate dataset,
we got an r of 0.946, which means we have a fairly
strong positive correlation. What we're going to do
on this video is build on this notion and actually
come up with the equation for the least squares
line that tries to fit these points. So before I do that, let's just visualize some of the statistics that we have here for these data points. We clearly have the four
data points plotted, but let's plot the statistics for x. So the sample mean and the
sample standard deviation for x are here in red, and actually let me box these off
in red so that you know that's what is going on
here, so the sample mean for x, it's easy to calculate
one plus two plus two plus three divided by four,
is eight divided by four, which is two, so we have x
equals two right over here. And then this is one
sample standard deviation above the mean, this is one
sample standard deviation below the mean, and then
we could do the same thing for the y variables. So the mean is three,
and this is one sample standard deviation for y above the mean and this is one standard
deviation for y below the mean. And visualizing these means, especially their intersection and also
their standard deviations, will help us build an
intuition for the equation of the least squares line. So generally speaking,
the equation for any line is going to be y is equal to mx plus b, where this is the slope and
this is the y intercept. For the regression line, we'll
put a little hat over it. So this, you would literally say y hat, this tells you that this
is a regression line that we're trying to fit to these points. First, what is going to be the slope. Well the slope is going
to be r times the ratio between the sample standard
deviation in the y direction over the sample standard
deviation in the x direction. This might not seem intuitive at first, but we'll talk about it in a few seconds and hopefully it'll make a lot more sense, but the next thing we
need to know is alright, if we can calculate our
slope, how do we calculate our y intercept? Well like you first
learned in Algebra one, you can calculate the y
intercept if you already know the slope by saying well
what point is definitely going to be on my line? And for a least squares regression line, you're definitely going to have the point sample mean of x comma sample mean of y. So you're definitely going
to go through that point. So before I even calculate
for this particular example where in previous videos
we calculated the r to be 0.946 or roughly equal to that, let's just think about what's going on. So our least squares line
is definitely going to go through that point. Now if r were one, if we had a
perfect positive correlation, then our slope would be
the standard deviation of y over the standard deviation of x. So if you were to start at
this point and if you were to run your standard
deviation of x and rise your standard deviation of y, well with a perfect positive correlation, your line would look like this. And that makes a lot of sense. Because you're looking at your spread of y over your spread of x,
if r were equal to one, this would be your slope,
standard deviation of y over standard deviation of x. That has parallels to when you first learn about slope. Change in y over change
in x, you're seeing you could say the average spread in y over the average spread in x. And this would be the case when r is one, so let me write that down. This would be the case
if r is equal to one. What if r were equal to negative one? It would look like this. That would be our line if we had a perfect negative correlation. Now what if r were zero? Then your slope would be
zero and then your line would just be this line, y
is equal to the mean of y, so you would just go through
that right over there. But now let's think about this scenario. In this scenario, our
r is 0.946, so we have a fairly strong correlation,
this is pretty close to one, and so if you were to
take 0.946 and multiply it by this ratio, if you
were to move forward in x by the standard deviation
in x, for this case, how much would you move up in y? Well you would move up r times
the standard deviation of y. And as we said if r is one,
you would get all the way up to this perfect
correlation line, but here it's a 0.946, so you would get up about 95% of the way to that. And so our line without even looking at the equation is going to
look something like this, which we can see is a pretty
good fit for those points. I'm not proving it here in this video. But now that we have an
intuition for these things, hopefully you'll appreciate
this isn't just coming out of nowhere into some strange formula, it actually makes intuitive
sense, let's calculate it for this particular set of data. M is going to be equal to
r, 0.946, times the sample standard deviation of y,
2.160, over the sample standard deviation of x, 0.816. We can get our calculator
out to calculate that, so we have 0.946 times 2.160, divided by 0.816, it gets us to 2.50, let's just round to the nearest hundredth for simplicity here, so
this is approximately equal to 2.50. And so how do we figure
out the y intercept? Well remember, we go through this point, so we're going to have
2.50 times our x mean, so our x mean is two, times two, remember this right
over here is our x mean, plus b, plus b is going to be equal to our y mean, our y mean we
see right over here is three, and so what do we get? We get three is equal to five plus b. And so what is b, well
if you subtract five from both sides, you get b
is equal to negative two. And so there you have it. The equation for our
regression line, we deserve a little bit of a drum
roll here, we would say y hat, the hat tells us
that this is the equation for a regression line,
is equal to 2.50 times x minus two, minus two, and we are done.