Main content

## Statistics and probability

### Course: Statistics and probability > Unit 5

Lesson 6: More on regression- Squared error of regression line
- Proof (part 1) minimizing squared error to regression line
- Proof (part 2) minimizing squared error to regression line
- Proof (part 3) minimizing squared error to regression line
- Proof (part 4) minimizing squared error to regression line
- Regression line example
- Second regression example
- Calculating R-squared
- Covariance and the regression line

© 2023 Khan AcademyTerms of usePrivacy PolicyCookie Notice

# Proof (part 1) minimizing squared error to regression line

Proof (Part 1) Minimizing Squared Error to Regression Line. Created by Sal Khan.

## Want to join the conversation?

- What is the point or the purpose of squaring the error line? Why not cubed, square root or even dot or cross product? I do not mean in a mathematical sense, but in a practical sense. What information does the square of the error line give us?(35 votes)
- There are a couple reasons to square the errors. Squaring the value turns everything positive, effectively putting negative and positive errors on equal footing. In other words, it treats any deviation away from the line of the same absolute size (in the positive or negative direction) as the same.

You can achieve the same result (of turning negative numbers into positive ones) by taking the absolute value of the number or raising the values by any positive exponent (like 4, 6, etc.). So, why was squaring the value chosen over taking the absolute value? The most simplistic answer is that dealing with exponents is mathematically and computationally easier than dealing with absolute values - this was particularly true back in the day when people did this work solely by hand. Because of the power of computers now days, that computational "problem" is much less of a problem and some people argue for (and use) the sum of absolute errors (instead of sum of squared errors) instead; however, those people are the minority (I will warn that the general expectation is using the sum of squared errors as the measure... people have seen it, they understand it, they know the various tests and statistics around it. So if a person wanted to use absolute errors instead, they would have to possibly derive and educate their audience).

You could also argue that using the square error instead of the absolute error allows you to place a greater emphasis on values that are relatively further away from the line. In other words, you are punished more for producing a line that is relatively farther away from points because those errors are squared. A potential problem, however, is that outliers can more easily skew the regression line using this methodology. And, that is most likely why you use the smallest multiple of 2 as your exponent instead of something like the "sum of errors raised to the 4th power" or something of that nature, because doing so would highlight the outliers (or near outliers) even more.(88 votes)

- I don't understand, why y1-(mx1+b) ?

It shouldn't be (mx1+b)-y1?(3 votes)- It can be! That is the advantage of using squared error instead of just simply 'linear error'.

Notice that some points end up above the line (where y1-(mx1+b)) and some below (where (mx1+b) - y1). To resolve this problem, statisticians have used a system to square the values, so that all values are positive.

Overall, you can use either version, they both work.(9 votes)

- What video should I go to when I don't understand why there he starts putting 2's in front of things and having extra brackets worth of stuff...he calls it algebraic equations...it's sooo much fun doing inferential statistics with only a grade 6 education.(1 vote)
- I assume you mean what he's talking about what he's writing at0:52? That's algebra (probably 2-3 years beyond your level). He's expanding the quadratic (the thing in parentheses that is squared). I'm not sure where that is explained on KhanAcademy, but: (a+b)^2 = a^2 + 2ab + b^2 . Then we could compare what is "a" from what Sal had wrote, and what is "b".

However, if your math is at 6th grade, then you should probably skip any of the videos that say "Proof." Generally the proofs in Statistics will be using math that's 5 or more years beyond that level. Once you learn Calculus (mainly, finding a minimum or maximum via derivatives), I imagine the proof will make perfect sense.

At your level, I would assume that the focus would be on applying Statistical methods (e.g. estimate the mean, compute a confidence interval, etc) instead of deriving anything.

If you're just doing the Stats of KhanAcademy on your own, then if you want to understand the proofs better, I'd suggest going over to the Calculus and Algebra sections, as Statistics makes heavy use of the both of them (Calculus mainly for the proofs).(7 votes)

- Okay, so squaring is done in order to have positive values, but what's the problem actually in having both positive and negative errors? I mean, if we need a line which fits the data, the one which has a 0 or close to 0 error is right between the data set right?

The only case where this method doesn't work seems me to be the one of aligned data points, but for other cases the "true" error seems not that bad to me.(2 votes)- Sum of errors from the mean without squaring is always zero. Check it for yourself. If you have excel just put some numbers in a column of cells, calculate mean and in column next to this one subtract value from the mean. Then add those errors.(3 votes)

- Why would I need to do this? (Real life example)(1 vote)
- Regression is a very common technique in economics to predict the behaviour of the market. So if you ever decide to sell something to a big group of people, you will probably end up using regressions to find the best price for what you want to sell.(5 votes)

- Why are all the terms (y1, y2, yn...etc) being added? How will that help us find the minimized squared error to the line?(2 votes)
- Eventually we will find the derivative of the whole thing (find the function that finds the slope if you aren't familiar with calculus) and set that to zero, allowing us to solve the constants for the minimum possible error.(4 votes)

- at2:05, sal sounded a bit cute when he said keep going 2 times(2 votes)
- You call it SE.. But really you should likely be calling it SSE. Sum square errors.(2 votes)
- where did he get NB^2 from, where N coming from ??(1 vote)
- He's summing the b^2. Your probably used to seeing the Σ(x) meaning the sum of all x's (e.g., x1+x2+...xn) but that is only for variables. When you apply the arithmetic summation rules to a constant, which is what b^2 is in this equation, it is written as nb^2. Where n is the number of times the constant occurs.(2 votes)

- what is co-efficient of non-determination(1 vote)

## Video transcript

In the last video, we showed
that the squared error between some line, y equals mx plus
b and each of these n data points is this expression
right over here. In this video, I'm really just
going to algebraically manipulate this expression so
that it's ready for the calculus stage. So we can actually optimize, we
can actually find the m and b values that minimize this
value right over here. So this is just going to be a
ton of algebraic manipulation. But I'll try to color code
it well so we don't get lost in the math. So let me just rewrite this
expression over here. So this whole video is just
going to be rewriting this over and over again. Just simplifying it a
bit with algebra. So this first term right over
here, y1 minus mx1 plus b squared, this is all going
to be the squared error of the line. So this first term over here,
I'll keep it in blue, is going to be if we just expand it, y1
squared minus 2 times y1 times mx1 plus b, plus mx1
plus b squared. All I did is I just squared
this binomial right here. You can imagine if this was a
minus b, it would be a squared minus 2ab plus b squared. That's all I did. Now I'll just have to do that
for each of the terms. And each term is only different by
the x and the y coordinates right over here. And I'll go down so that we
can kind of combine like terms. So this term over here
squared is going to be y2 squared minus 2 times
y2 times mx2 plus b plus mx2 plus b squared. Same exact thing up here. Except now it was with x2 and
y2, as opposed to x1 and y1. And then we're just going to
keep doing that n times. We're going to do it for the
third, x3, y3, keep going, keep going. All the way until we get the
this nth term over here. And this nth term over here when
we square it is going to be yn squared minus 2yn
times mxn plus b, plus mxn plus b squared. Now, the next thing I want to
do is actually expand these out a little bit more. So let's actually scroll down. So this whole expression, I'm
just going to rewrite it, is the same thing as-- and remember
this is just the squared error of the line. So let me rewrite this
top line over here. This top line over here
is y1 squared. And then I'm going to
distribute this 2y1. So this is going to be
minus 2y1mx1, that's just that times that. Minus 2y1b. And then plus, and now let's
expand mx1 plus b squared. So that's going to be m squared
x1 squared, plus 2 times mx1 times b
plus b squared. All I did, if was a plus b
squared, this is a squared plus 2ab plus b squared. And we're going to do that for
each of these terms. Or for each of these colors, I
guess you could say. So now let's move to
the second term. It's going to be
the same thing. But instead of y1's and
x1's, it's going to be y2's and x2's. So it is y2 squared minus
2y2mx2 minus 2y2b plus m squared x2 squared, plus 2 times
mx2b plus b squared. And we're going to keep
doing this all the way to get the nth term. I guess color we should say. So this is going to be yn
squared minus 2ynmxn. And you don't even
have to think. You just have to kind of
substitute these with n's now. We could actually
look at this. But it's going to be the
exact same thing. Minus 2ynb plus m squared
xn squared, plus 2mxnb plus b squared. So once again, this is just the
squared error of that line with n points. Between those n points and the
line y equals mx plus b. So let's see if we can simplify
this somehow. And to do that what I'm going to
do is I'm going to kind of try to add up a bunch
of these terms here. So if I were to add up all of
these terms right here, if I were to add up this
column right over there, what do I get? It's going to be y1 squared plus
y2 squared all the way to all the way to yn squared. That's those terms
right over there. So I'm going to have that. And then have this common
2m amongst all of these terms over here. So let me write that down. So then you have this 2m
here, 2m here, 2m here. Let me put parentheses
around here. So you have these terms
all added up. Then you have minus 2m times all
of these terms. Actually, let me color code it so you
see what we're doing. I want to be very careful
with this math so nothing seems too confusing. Although this is really just
algebraic manipulation. If I had all of these up, I get
y1 squared plus y2 squared all the way to yn squared. I'll put some parentheses
around that. And then to that, we have this
common term, we have this minus 2m, minus 2m, minus 2m. And so we can distribute
those out. And so I should actually
write it like this. So we have a minus 2m, once we
distribute it out up here, we're just going to be
left with a y1x1. Or maybe I can call
it an x1y1. That's that over there with
the 2m factored out. Let me do that in
another color. I want to make this
easy to read. Plus x2y2. Plus xnyn. Well we're going to keep adding
up-- we're going to do this n times. All the way to plus xnyn. This last term over here,
ynxn, same thing. So that's the sum. So this stuff over here, the sum
of all of this stuff right over here, is the same thing as
this term right over here. And then we have to sum
this right over here. And you see again, we can factor
out here a minus 2b out of all of these terms. So we
have minus 2b times y1 plus y2 plus all the way to to yn. So this business. These terms right over here,
when you add them up, give you these terms, or this term,
right over there. And let's just keep going. And in the next video, we're
probably going to run out of time in this one, I'll simplify
this more and clean up the algebra a good bit. So then the next term, what
is this going to be? Same drill. We can factor out
an m squared. So we have m squared times
times x1 squared plus x2 squared-- actually, I want to
color code them, I forgot to color code these over here. Plus all the way
to xn squared. Let me color code these. This was a yn squared. And this over here
was a y2 squared. So this is exactly this. So in this last step we just
did, this thing over here is this thing right over here. And of course we
have to add it. So I'll put a plus out front. We're almost done with this
stage of the simplification. So over here, we have a common
2mb, so let's put a plus 2mb times, once again, x1 plus x2
plus all the way to xn. So this term right over here
this is the exact same thing as this term over here. And then finally, we have a b
squared in each of these. And how many of these b
squared do we have? Well we have n of these
lines, right? This is the first line, second
line, then bunch, bunch, bunch all the way to the nth line. So we have b squared added
to itself n times. So this right over here is
just b squared n times. So we'll just write that as
plus n times b squared. Let me remind ourselves what
this is all about. This is all just algebraic
manipulation of the squared error between those n points
and the line y equals mx plus b. It doesn't look like I've
simplified it much. And I'm going to stop in
the video right now. In the next video, we're just
going to take off right here and try to simplify
this thing.