Main content

### Course: Statistics and probability > Unit 5

Lesson 6: More on regression- Squared error of regression line
- Proof (part 1) minimizing squared error to regression line
- Proof (part 2) minimizing squared error to regression line
- Proof (part 3) minimizing squared error to regression line
- Proof (part 4) minimizing squared error to regression line
- Regression line example
- Second regression example
- Calculating R-squared
- Covariance and the regression line

© 2024 Khan AcademyTerms of usePrivacy PolicyCookie Notice

# Proof (part 2) minimizing squared error to regression line

Proof Part 2 Minimizing Squared Error to Line. Created by Sal Khan.

## Want to join the conversation?

- why are the axis m and b instead of x and y? aren't m and b constants and parts of the equation for the surface, with x,y, and SE being the coordinates?(7 votes)
- In line fitting, we are trying to find the equation of the line - find the slope (m) and the y-intercept (b) of the best-fit line y=mx+b, based on a known set of (x,y) coordinates. So the mean of y, for example, is a constant, since it is the arithmetic mean of all the y's in our data set. We don't know what m or b are, and we're trying out different ones, so they are variables. "If I try this slope and this intercept, how big is my SE?" That is what this graph would showing. If the question was "which of these data points are closest to the ideal, given by this line" we would use x and y as the variables.

We're in the situation of "I know the answer,now what was the question?" We know y given our x values, now we need to find the line that would get us as close as possible to those y values if all we had was the x values.(17 votes)

- Why would you divided the y1^ 2+ y2^2 + ...yn^2 by n to simplify the equation? Just a little unclear on why you divide by n.(6 votes)
- If you add up all n y ^2 terms (each one is in general different), then divide by n, you get a mean value for y^2. So you no longer need all the different values, because you have one that represents them all.(7 votes)

- How exactly did Sal figure out that the surface was 3D parabolic ?(5 votes)
- I'm assuming this is because you are dealing with the slope (m), y intercept (b), and your SE line (yellow line) and you are estimating the partial derivative of the squared error. Anything that involves minute changes in the measuring of something takes it away from algebra (which deals in straight lines and x and y coordinates only) into calculus and derivatives and 3 variable i.e. 3 dimensional graphs.(6 votes)

- At9:20, If the partial derivative of SE with respect to m or b is 0, then it could be minimum but could also be maximum, since the derivative is 0, how can we say surely, that what we get is minimized m and b and not maximized m and b? Thank you.(3 votes)
- Hi Yuya!

The`m`

and`b`

containing terms can be looked at as equations for parabolas. Since the "squared" term is positive in both cases these parabolas open upwards. Consequently these parabolas can only have minima. You can confirm this using either the first or second derivative tests.(9 votes)

- how can you say that it will be a parabola?(4 votes)
- Whenever you deal with the square of an independent variable (x value or the values on the x-axis) it will be a parabola. What you could do yourself is plot x and y values, making the y values the square of the x values. So x = 2 then y = 4, x = 3 then y = 9 and so on. You will see it is a parabola.(5 votes)

- what is partial derivative by the way? I learned in the past but forgot xD. Can anyone explain briefly for me what its usage is??

Could you tell me where to find Sal's videos about it too?(3 votes)- let's say we have a function with 2 variables. Partial derivative of such a function would be derivative of the function wrt one variable where the second variable is treated as a constant.

Here is Sal's video about it

https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/partial-derivative-and-gradient-articles/a/introduction-to-partial-derivatives

When we plot such a function , it depicts a surface in 3D space. A infinite number of lines are tangential to this space and out job is to select one of these lines and find it's slope.

You can read more on this here :https://en.wikipedia.org/wiki/Partial_derivative(3 votes)

- In the 3D graph that Sal draws, can the SE values be negative? Aren't they positive for all values for m and b?(3 votes)
- You're correct that the squared error (SE) values should be non-negative, as they represent the sum of squared distances which cannot be negative. In the 3D graph, the surface representing the SE should indeed be non-negative for all valid combinations of m and b. Any negative values in the SE would indicate a computational error or a conceptual mistake.(1 vote)

- Proof Part 2 Minimizing Squared Error To Line:How is it that you are taking the mean of the parenthetical values (@4:22/9:53) in the video? It appears that this operation to simplify is being performed on only one side of the equation. How is this algebraically correct to not perform this on both sides of the equation?(2 votes)
- I believe he is using substitution. Therefore he is not changing any values, but simply rewriting the equation.(3 votes)

- I don't understand why we are trying to use the partial derivatives to find the solution to this 3D parabola. Wouldn't it make more sense to find the point at which SE is minimized?(3 votes)
- Using partial derivatives to find the critical points (where the derivatives are zero) of the SE function is a common method to locate potential minima or maxima. By setting the derivatives to zero and solving for the variables (m and b in this case), you find the critical points where the SE may be minimized. This approach is mathematically rigorous and widely used in optimization problems.(1 vote)

- but if we are finding minimum values of m and b differently, wont the coordinate actually lie off the equation?(3 votes)

## Video transcript

Our goal is to simplify this
expression for the squared error between those n points. Just to remind ourselves
what we're doing, we have these n points. And we're taking the sum of the
squared error between each of those n points and
our actual line, y equals mx plus b. And we get this expression over
here, which we've been simplifying over the
last couple videos. We're going to try
to simplify this expression as much as possible. And then, we're going to
try to to minimize this expression. Or find the m and b values
that minimize it. Or I guess you could call it
the best fitting line. Now to do that, it looks like
we were just making the algebra even hairier
and hairier. But this next step is going to
simplify things a good bit. So just to show you that, if I
want to take the mean of all of the squared values of the
y's-- So that would be this. That would be y1 squared plus y2
squared plus all the way to yn squared. So I've summed n values,
n squared values. And then I want to divide
it by n, since there are n values here. And this is the mean
of the y's squared. That's how we can denote
it, just like that. Or, if you multiply both sides
of this equation by n, you get y1 squared plus y2 squared plus
all the way to yn squared is equal to n times the mean
of the squared values of y. And notice, this is exactly
what we have over here. That is n times the mean of
the squared values of y. Or the mean of the y squareds. And we can do that with each of
these terms. What is x1y1 plus x2y2 plus all the way
to all the way to xnyn. Well, if we take this whole
sum and we divide it by n terms, this is going to be the
mean value for x times y. For each of those points,
you multiply x times y. And you find the mean of
all of those products. That's exactly what this is. Well, once again, you multiply
both sides of this equation by n, and you get x1y1 plus x2y2
plus all the way to xnyn is equal to n times the
mean of xy's. I think you see where
this is going. This term right here is going
to be equal to n times the mean of the products of xy. This term right here
is n times the mean of the y values. And then, this term right here
is n times the mean of the x squared values. This term right here is the
mean of the x's times n. If you divided this by n,
you'd get the mean. Since were not dividing it by
n, this is the mean times n. And then this is, obviously,
we don't the simplify anything. So let's rewrite everything
using our new notation, knowing that these are the
means of y squared, of xy, and all that. So our squared error to the
line from the sum of the squared error to the line from
the n points is going to be equal to-- this term right here
is n times the mean of the y squared values. This term right here is
equal to negative 2m. That's just that right there. Times n times the mean
of the xy values, the arithmetic mean. And then we have this
term over here. I think you can appreciate
this is simplifying the algebraic expression
a good bit. This term right over here is
going to be minus 2bn times the mean of the y values. And then we have plus m squared
times n times the mean of the x squared values. And then we have-- almost there,
home stretch-- we have this over here which is plus 2mb
times n times the mean of the x values. And then, finally, we have
plus nb squared. So really, in the last two to
three videos, all we've done is we simplified the expression
for the sum of the squared differences from the
those n points to this line, y equals mx plus b. So we're finished with the
hard core algebra stage. The next stage, we actually
want to optimize this. Maybe a the better way to talk
about it, we want to minimize this expression right
over here. We want to find the m and the
b values that minimize it. And to help visualize it, we're
going to start breaking into a little bit of three-dimensional calculus here. But hopefully it won't
be too daunting. If you've done any partial derivatives, it won't be difficult. This is a surface. If you view that you have
the x and y data points, everything here is a
constant except for the m's and the b's. We're assuming that we
have the x's and y's. So we can figure out the mean
of the squared values of y, the mean of the xy product, the
mean of the y's, the mean of the x squareds. We assume that those are
all actual numbers. So this expression right here,
it's actually going to be a surface in three dimensions. So you can imagine, this right
here, that is the m-axis. This right here is the b-axis. And then, you could imagine the
vertical axis to be the squared error. This is the squared error
of the line axis. So for any combination of m
and b, if you're in the mb plane, you pick some combination
of m and b. You put it into this expression
for the squared error of the line. It'll give you a point. If you do that for all of the
combinations of m's and b's, you're going to get a surface. And the surface is going to
look something like this. I'm going to try my
best to draw it. It's going to look like this. You could almost imagine
it as a kind of a bowl. Or you could even
think of it as a three-dimensional parabola. If you want to think
of it that way. Instead of a parabola that
just goes like this. If you were to kind of rotate
it around and distort it a little bit, you would get this
thing that looks kind of like a cup, or a thimble,
or whatever. And so what we want to do is
to find the m and b values that minimize. Notice, this is a
three-dimensional surface. I don't know if I'm doing
justice to it. So you can imagine a
three-dimensional surface that looks something like this. This is the back part that
you're not seeing. So that's the inside of our
three-dimensional surface. We want to find the m and b
values that minimize the value on the surface. So there's some m and b
value right over here that minimizes it. And I'll actually do the
calculation in the next video. But to do that, we're going to
find the partial derivative of this with respect to m. And we're going to find the
partial derivative of this with respect to b and set
both of them equal to 0. Because at this minimum point, I
guess you could say in three dimensions, this minimum point
on the surface is going to occur when the slope with
respect to m and the slope with respect to b is 0. So at that point, the partial
derivative of our squared error with respect to m is
going to be equal to 0. And the partial derivative of
our squared error with respect to b is going to
be equal to 0. So all we're going to do, in
the next video, is take the partial derivative of this
expression with respect to m, set that equal to 0. And the partial derivative of
this with respect to b, set that equal to 0. And then we're ready to solve
for the m in the b. Or the particular m and b.