If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

# Proof (part 1) minimizing squared error to regression line

Proof (Part 1) Minimizing Squared Error to Regression Line. Created by Sal Khan.

## Want to join the conversation?

• What is the point or the purpose of squaring the error line? Why not cubed, square root or even dot or cross product? I do not mean in a mathematical sense, but in a practical sense. What information does the square of the error line give us?
• There are a couple reasons to square the errors. Squaring the value turns everything positive, effectively putting negative and positive errors on equal footing. In other words, it treats any deviation away from the line of the same absolute size (in the positive or negative direction) as the same.

You can achieve the same result (of turning negative numbers into positive ones) by taking the absolute value of the number or raising the values by any positive exponent (like 4, 6, etc.). So, why was squaring the value chosen over taking the absolute value? The most simplistic answer is that dealing with exponents is mathematically and computationally easier than dealing with absolute values - this was particularly true back in the day when people did this work solely by hand. Because of the power of computers now days, that computational "problem" is much less of a problem and some people argue for (and use) the sum of absolute errors (instead of sum of squared errors) instead; however, those people are the minority (I will warn that the general expectation is using the sum of squared errors as the measure... people have seen it, they understand it, they know the various tests and statistics around it. So if a person wanted to use absolute errors instead, they would have to possibly derive and educate their audience).

You could also argue that using the square error instead of the absolute error allows you to place a greater emphasis on values that are relatively further away from the line. In other words, you are punished more for producing a line that is relatively farther away from points because those errors are squared. A potential problem, however, is that outliers can more easily skew the regression line using this methodology. And, that is most likely why you use the smallest multiple of 2 as your exponent instead of something like the "sum of errors raised to the 4th power" or something of that nature, because doing so would highlight the outliers (or near outliers) even more.
• Can someone explain to me how he got y^2-2y(mx+b)+(mx+b)^2 at ?
• I don't understand, why y1-(mx1+b) ?
It shouldn't be (mx1+b)-y1?
• It can be! That is the advantage of using squared error instead of just simply 'linear error'.

Notice that some points end up above the line (where y1-(mx1+b)) and some below (where (mx1+b) - y1). To resolve this problem, statisticians have used a system to square the values, so that all values are positive.

Overall, you can use either version, they both work.
• What video should I go to when I don't understand why there he starts putting 2's in front of things and having extra brackets worth of stuff...he calls it algebraic equations...it's sooo much fun doing inferential statistics with only a grade 6 education.
• I assume you mean what he's talking about what he's writing at ? That's algebra (probably 2-3 years beyond your level). He's expanding the quadratic (the thing in parentheses that is squared). I'm not sure where that is explained on KhanAcademy, but: (a+b)^2 = a^2 + 2ab + b^2 . Then we could compare what is "a" from what Sal had wrote, and what is "b".

However, if your math is at 6th grade, then you should probably skip any of the videos that say "Proof." Generally the proofs in Statistics will be using math that's 5 or more years beyond that level. Once you learn Calculus (mainly, finding a minimum or maximum via derivatives), I imagine the proof will make perfect sense.

At your level, I would assume that the focus would be on applying Statistical methods (e.g. estimate the mean, compute a confidence interval, etc) instead of deriving anything.

If you're just doing the Stats of KhanAcademy on your own, then if you want to understand the proofs better, I'd suggest going over to the Calculus and Algebra sections, as Statistics makes heavy use of the both of them (Calculus mainly for the proofs).
• Okay, so squaring is done in order to have positive values, but what's the problem actually in having both positive and negative errors? I mean, if we need a line which fits the data, the one which has a 0 or close to 0 error is right between the data set right?
The only case where this method doesn't work seems me to be the one of aligned data points, but for other cases the "true" error seems not that bad to me.
• Sum of errors from the mean without squaring is always zero. Check it for yourself. If you have excel just put some numbers in a column of cells, calculate mean and in column next to this one subtract value from the mean. Then add those errors.
• Why are all the terms (y1, y2, yn...etc) being added? How will that help us find the minimized squared error to the line?
• Eventually we will find the derivative of the whole thing (find the function that finds the slope if you aren't familiar with calculus) and set that to zero, allowing us to solve the constants for the minimum possible error.
• Why would I need to do this? (Real life example)
(1 vote)
• Regression is a very common technique in economics to predict the behaviour of the market. So if you ever decide to sell something to a big group of people, you will probably end up using regressions to find the best price for what you want to sell.
• Anybody know how he squares the answer so fast at . What method did he use? Is there a video on it on Khan Academy?
• what is co-efficient of non-determination
• The term "coefficient of non-determination" doesn't have a standard meaning in statistics. It seems to be a misspelling or misunderstanding of the term "coefficient of determination," which is often denoted as R^2 (R-squared). The coefficient of determination represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s) in a regression model.
(1 vote)
• Correlation coefficient, means, standard deviations, and sample size of the variables can be used to
construct the regression equation, then why use Minimizing Squared Error to Regression Line approach?
• While correlation coefficients, means, standard deviations, and sample size can provide valuable insights into the relationship between variables, they don't directly yield the equation of the regression line. The approach of minimizing the squared error to the regression line is used because it allows us to find the line that best fits the data by minimizing the discrepancy between the observed data points and the predicted values from the regression line.

The regression equation obtained through this approach provides explicit estimates of the slope (m) and y-intercept (b) of the line, which allows for predicting the value of the dependent variable based on the value of the independent variable(s).

Additionally, the regression equation obtained through minimizing the squared error is based on a systematic mathematical optimization process that ensures the line provides the best possible fit to the data in terms of minimizing the overall error.
(1 vote)