If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

# Proof (part 1) minimizing squared error to regression line

Proof (Part 1) Minimizing Squared Error to Regression Line. Created by Sal Khan.

## Want to join the conversation?

• What is the point or the purpose of squaring the error line? Why not cubed, square root or even dot or cross product? I do not mean in a mathematical sense, but in a practical sense. What information does the square of the error line give us? •   There are a couple reasons to square the errors. Squaring the value turns everything positive, effectively putting negative and positive errors on equal footing. In other words, it treats any deviation away from the line of the same absolute size (in the positive or negative direction) as the same.

You can achieve the same result (of turning negative numbers into positive ones) by taking the absolute value of the number or raising the values by any positive exponent (like 4, 6, etc.). So, why was squaring the value chosen over taking the absolute value? The most simplistic answer is that dealing with exponents is mathematically and computationally easier than dealing with absolute values - this was particularly true back in the day when people did this work solely by hand. Because of the power of computers now days, that computational "problem" is much less of a problem and some people argue for (and use) the sum of absolute errors (instead of sum of squared errors) instead; however, those people are the minority (I will warn that the general expectation is using the sum of squared errors as the measure... people have seen it, they understand it, they know the various tests and statistics around it. So if a person wanted to use absolute errors instead, they would have to possibly derive and educate their audience).

You could also argue that using the square error instead of the absolute error allows you to place a greater emphasis on values that are relatively further away from the line. In other words, you are punished more for producing a line that is relatively farther away from points because those errors are squared. A potential problem, however, is that outliers can more easily skew the regression line using this methodology. And, that is most likely why you use the smallest multiple of 2 as your exponent instead of something like the "sum of errors raised to the 4th power" or something of that nature, because doing so would highlight the outliers (or near outliers) even more.
• I don't understand, why y1-(mx1+b) ?
It shouldn't be (mx1+b)-y1? • It can be! That is the advantage of using squared error instead of just simply 'linear error'.

Notice that some points end up above the line (where y1-(mx1+b)) and some below (where (mx1+b) - y1). To resolve this problem, statisticians have used a system to square the values, so that all values are positive.

Overall, you can use either version, they both work.
• What video should I go to when I don't understand why there he starts putting 2's in front of things and having extra brackets worth of stuff...he calls it algebraic equations...it's sooo much fun doing inferential statistics with only a grade 6 education.
(1 vote) • I assume you mean what he's talking about what he's writing at ? That's algebra (probably 2-3 years beyond your level). He's expanding the quadratic (the thing in parentheses that is squared). I'm not sure where that is explained on KhanAcademy, but: (a+b)^2 = a^2 + 2ab + b^2 . Then we could compare what is "a" from what Sal had wrote, and what is "b".

However, if your math is at 6th grade, then you should probably skip any of the videos that say "Proof." Generally the proofs in Statistics will be using math that's 5 or more years beyond that level. Once you learn Calculus (mainly, finding a minimum or maximum via derivatives), I imagine the proof will make perfect sense.

At your level, I would assume that the focus would be on applying Statistical methods (e.g. estimate the mean, compute a confidence interval, etc) instead of deriving anything.

If you're just doing the Stats of KhanAcademy on your own, then if you want to understand the proofs better, I'd suggest going over to the Calculus and Algebra sections, as Statistics makes heavy use of the both of them (Calculus mainly for the proofs).
• Okay, so squaring is done in order to have positive values, but what's the problem actually in having both positive and negative errors? I mean, if we need a line which fits the data, the one which has a 0 or close to 0 error is right between the data set right?
The only case where this method doesn't work seems me to be the one of aligned data points, but for other cases the "true" error seems not that bad to me. • Why would I need to do this? (Real life example)
(1 vote) • Why are all the terms (y1, y2, yn...etc) being added? How will that help us find the minimized squared error to the line? • at , sal sounded a bit cute when he said keep going 2 times • You call it SE.. But really you should likely be calling it SSE. Sum square errors.   