Main content

### Course: Statistics and probability > Unit 5

Lesson 6: More on regression- Squared error of regression line
- Proof (part 1) minimizing squared error to regression line
- Proof (part 2) minimizing squared error to regression line
- Proof (part 3) minimizing squared error to regression line
- Proof (part 4) minimizing squared error to regression line
- Regression line example
- Second regression example
- Calculating R-squared
- Covariance and the regression line

© 2024 Khan AcademyTerms of usePrivacy PolicyCookie Notice

# Proof (part 4) minimizing squared error to regression line

Proof (Part 4) Minimizing Squared Error to Regression Line. Created by Sal Khan.

## Want to join the conversation?

- At1:00, why does Sal decide to subtract one equation from the other? And how is this okay?(6 votes)
- For any equation, there are two sides with " = " in between so let's call that

Left Hand Side (L.H.S) = Right Hand Side (R.H.S)

Consider two equations

L.H.S.1 = R.H.S.1 ........................... say equation1

L.H.S.2 = R.H.S.2 ........................... say equation2

We know that we can perform an add/subtract operation on both sides of any equation and the equation still stands valid i.e.

L.H.S.1 + x = R.H.S.1 + x or L.H.S.1 - x = R.H.S.1 - x

Now let's subtract L.H.S.2 (say x = L.H.S.2) in equation1 above

L.H.S.1 - L.H.S.2 = R.H.S.1 - L.H.S.2 ......................... say equation3

But from equation2, L.H.S.2 = R.H.S.2

So we can substitute L.H.S.2 from equation 3 with R.H.S.2 to get

L.H.S.1 - L.H.S.2 = R.H.S.1 - R.H.S.2 ........................ equation4

Now if we observe equation1,equation2 and equation4 we find that

equation4 is just another inference from (equation1 - equation2) So subtraction of equations is quite okay :)(9 votes)

- I calculated M by subtracting the first formula [(mx^2)+bx=xy] from the second (y=mx+b). Which is the opposite of what sal does @0:50. I get a different formula for M, is this OK? I can't equate the two formulas. Here is the formula I get (all the x and y should have the mean sign. M= [(xy/x)-y]/[(x^2/x)-x)(7 votes)
- Multiply both top and bottom by -1, and the result is

[(-xy/x)+y]/[(-x^2/x)+x]

where all the x, x^2, and y should have the mean sign. This is equivalent to

[y-(xy/x)]/[x-(x^2/x)]

simply by switching the order of the terms in the numerator and denominator, respectively. This is Sal's answer.(7 votes)

- Two questions please.

1. I understand that we 'square' the distances to the best fitting line because that will eliminate the negatives. I'm wondering however whether squaring skews the results somehow, so that the points that are furthest from the best fitting line exert more of a force in their direction?

2. The formula sought the minimum vertical distances between points and the best fitting line. Would the same result be achieved if, instead of minimizing the vertical distances, we minimized the absolute distance between the points and the line?

Thank you!(7 votes)- Those are some astute questions.

[1.] Yes and no. The more extreme points will exert a larger influence on the line, but there are some caveats. We have two variables, X and Y, and so points can be out of whack in either the x-direction or the x-direction. Points that are further out in the x-distance will exert a strong pull on the line. There is actually a statistic to measure this called "leverage." Outliers in the y-direction don't impact the regression nearly so much.

[2.] I'm not sure that you stated your question properly. The formula we used (called "Simple Linear Regression") minimizes the*squared*vertical distances between the points and the line. We could use the absolute value instead, though that would still be looking at the vertical distance.

There is also a type of regression which does not measure vertical distance, it's called Deming Regression. In one special case of this type of regression, instead of vertical distances, we look at distances orthogonal / perpendicular to the regression line.(8 votes)

- A few videos back Sal presented a formula for the least squares regression line where the slope m=r(Sy/Sx), that is, the correlation coefficient times the sample standard deviation in Y divided by the sample standard deviation in X. Is this formula equivalent to the one presented in this video, and if so how does one establish their equivalence?(7 votes)
- Yes, the formula m = r * (STD of y / STD of x) is equivalent to the formula derived in the video for linear regression using the method of least squares. The correlation coefficient r captures the linear relationship between x and y, and multiplying it by the ratio of the standard deviations standardizes the relationship in terms of variability in both x and y. Thus, both formulas aim to find the slope of the regression line that minimizes the squared errors.(1 vote)

- we have worked out m = ( x^_ y^_ - xy^_ ) / ( x^_ )^2 - ( x^2 ) ^_ . but what if somehow I get the denominator to be zero? does it mean there is a limitation of this 'analytical solution'? thanks(1 vote)
- Think about what the denominator represents: variation in the X-variable.

If there is no variation in the X-variable (i.e. all the X's are the same value), then there is absolutely no point in doing regression in the first place.(7 votes)

- @2:05:

I'm having trouble seeing that both P1 = (x bar, y bar) and P2 = ( (x^2 bar / x bar ), (xy bar / x bar) ) are both on the best fit line.

I imagine that any (u, v) such that v = mu + b is on the line y = mx + b; and therefore so are P1 and P2.

The 2 derivatives tell us that (1) y bar = m * x bar + b, and (2) xy bar = m * x^2 bar + b * x bar (or, dividing through by x bar, (3) xy bar / x bar = m * (x^2 bar / x bar) + b). solving (1) and (2) for m and b gives m1 and b1 (in terms of x and y). So (1) and (3) are both true using m1 and b1. Since (1) and (3) are both of the form v = m1u + b1, then P1 and P2 are both on the best fit line.(2 votes) - what if the mean is 0.The second point will cease toexist.(3 votes)
- In a previous video on the equation of Regression Line, m is derived from r*Sy/Sx, r being the Correlation Coefficient. Is this a different way of calculating m?(2 votes)
- The formula presented here for m is derived from the method of least squares, which minimizes the sum of the squared errors between the observed and predicted values of y. While the formula m = r * (Sy / Sx) also involves the correlation coefficient r, it's a different approach. The method of least squares directly optimizes the fit of the line by minimizing the squared errors, whereas the correlation coefficient approach is based on the relationship between the standard deviations and the correlation coefficient.(1 vote)

- Why to use this formula instead of m = r*(Sy/Sx)? And find b by substituting Xmean and Ymean in y = mx + b after that? I mean, it would take less time, isn't it?(2 votes)
- Both formulas are valid for calculating the regression line, but they approach the problem differently. The formula presented in the video derives the coefficients of the regression line directly from the method of least squares, minimizing the sum of the squared errors. The formula m = r * (Sy / Sx) uses the correlation coefficient to scale the standard deviations of x and y, providing a measure of the linear relationship. Depending on the context and available data, either approach can be used. The method in the video may be preferred when focusing on minimizing the squared errors, while the correlation coefficient approach may provide insights into the strength and direction of the linear relationship.(1 vote)

- While teaching about regression line few videos ago, Sal told how to use the value of R to get the best fitting line. Here we have another equation for the best fitting least squares regression line. Are both of these same, if yes then why we have different formulas for the lines. If not same, then what is the basic difference ?(2 votes)
- Both methods aim to find the best-fitting line for a given set of data points, but they use different approaches. The method involving the correlation coefficient r focuses on the linear relationship between x and y and utilizes the standard deviations of x and y, while the method of least squares minimizes the sum of the squared errors directly. The formulas may yield similar results for certain datasets, but they are conceptually different and applicable in different contexts.(1 vote)

## Video transcript

So if you've gotten this far,
you've been waiting for several videos to get to the
optimal line that minimizes the squared distance to
all of those points. So let's just get to
the punch line. Let's solve for the
optimal m and b. And just based on what we did
in the last videos, there's two ways to do that. We actually now know two points
that lie on that line. So we can literally find the
slope of that line and then the the y intercept,
the b there. Or, we could just say
it's the solution to this system of equations. And they're actually
mathematically equivalent. So let's solve for m first. And
if we want to solve for m, we want to cancel out the b's. So let me rewrite this top
equation just the way it's written over here. We have m times the mean of the
x squareds plus b times the mean of-- Actually,
we could even do it better than that. One step better than that is to,
based on the work we did in the last video, we can just
subtract this bottom equation from this top equation. So let me subtract it. Or let's add the negatives. So if I make this negative,
this is negative. This is negative. What do we get? We get m times the mean of the
x's minus the mean of the x squareds over the mean of x. The plus b and the negative
b cancel out. Is equal to the mean of the y's
minus the mean of the xy's over the mean of the x's. And then, we can divide both
sides of the equation by this. And so we get m is equal to the
mean of the y's minus the mean of the xy's over the mean
of the x's over this. The mean of the x's minus the
mean of the x squareds over the mean of the x's. Now notice, this is the exact
same thing that you would get if you found the slope between
these two points over here. Change in y, so the difference
between that y and that y, is that right over there. Over the change in x's. The change in that
x minus that x is exactly this over here. Now, to simplify it, we can
multiply both the numerator and the denominator by
the mean of the x's. And I do that just so we don't
have this in the denominator both places. So if we multiply the numerator
by the mean of the x's, we get the mean of the x's
times the mean of the y's minus, this and this will
cancel out, minus the mean of the xy's. All of that over, mean of the
x's times the mean of the x's is just going to be the mean of
the x's squared, minus over here you have the mean
of the x squared. And that's what we get for m. And if we want to solve for
b, we literally can just substitute back into either
equation, but this equation right here is simpler. And so if we wanted to solve for
b there, we can solve for b in terms of m. We just subtract m times the
mean of x's from both sides. We get b is equal to the mean
of the y's minus m times the mean of the x's. So what you do is you take
your data point. You find the mean of the x's,
the mean of the y's , the mean of the xy's, the mean
of the x's squared. You find your m. Once you find your m, then you
can substitute back in here and you find your b. And then you have your
actual optimal line. And we're done. So these are the two big formula
take aways for our optimal line. What I'm going to do in the next
video, and this is where if anyone wasn't skipping up to
this point, the next video is where they should re-engage,
because we're actually going to use these
formulas for the best fitting line. At least, when you measure the
error by the squared distances from the points. We're going to use these
formulas to actually find the best line for some data.