If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

# Squared error of regression line

Introduction to the idea that one can find a line that minimizes the squared distances to the points. Created by Sal Khan.

## Want to join the conversation?

• Why is the error measured VERTICALLY? The shortest distance from a point to a line will be on a perpendicular to that line. If we want to find the "best" fitting line to a set of points then this distance is the one that should be minimized. •   Don't think about it as a problem of geometry, but as one of minimizing error.

In this kind of line fitting, the x-values are assumed to have no error - if your independent variable has been measured with errors then you need a different fitting method.

Since the error for each point is only in its y-value (the dependent measurements), we only examine the vertical distance from each point to the line.
• Why do we square the error? () I know that it exaggerates/differentiates between large and small errors, but if this is so, why not cube the errors? Or better yet, exponentiate the error to its thirteenth power? Is it because of the Gauss-Markov Theorem that Wikipedia mentions? •   the reason we choose squared error instead of 3rd or 4th power or 26th power of the error is because of the nice shape that squared errors will make when we make a graph of the squared error vs m and b. The graph will make a 3-d parabola with the smallest square error being at our optimally chosen m and b. Since this graph has only 1 minimum value it is really nice since we can always find this minimum, and the minimum will be unique. If we use higher exponents it would be harder to find the minimum value(s), and we could find possibly non unique minimums or only local minimums (values that look good compared to the neighbouring values but not the absolute best). So, in summary we used squared error because it gives us a minimum that is easy to find and is guaranteed to be the only minimum (this guarantees it is the best!).
• Is error the same thing as residual? • Not EXACTLY. The difference is that there is (in theory) a TRUE regression line which we will never know, and then there is the one that we estimate to be the regression line. The difference between the point and the TRUE regression line is your error. The difference between your point and the ESTIMATED regression line is your residual. When we fit a regression line, we make the sum of our residuals equal to 0 but that does not necessarily mean that the sum of our error is 0 (there will always be some error in statistics by its nature).

Here is an example that hopefully won't confuse.. You can see this if you simulate some data in a spreadsheet. If you had a function Y=5+10X as your TRUE regression and say we had 5 different possible inputs for x ( 1,2,3,4,5) now, for x=1, y=15 and for x=2, y=25 etc. Now, if for observation, we had some normally distributed error, and we fitted a line to that data, we might get a line that is reaaallly close to Y=5+10X but we probably would not get that exactly. We might actually get something like Y=5.3+9.5X. Lets say one of our randomly distributed observations were (3,30). the predicted value for Y, according to our regression line, is 5.3+9.5(3)=33.8 but the TRUE regression line (which we won't actually have) would have predicted 35 for Y so, our ERROR is 30-35=5, but our RESIDUAL is 30-33.8=3.8
• Is it just me or does this process look a lot like taking the "variance" of the errors, with the line of best fit being where the errors vary the least? • At around , why is the error at the different points the vertical distance, and not the horizontal distance between the point and the line? Or, for that matter, why not the vertical AND horizontal distances summed up? • A linear regression model assumes that the relationship between the variables y and x is linear (the measured variable y depends linearly of the input variable x).
Basically, y = mx + b.
A disturbance term (noise) is added (error variable "e").
So, we have y = mx + b + e. So the error is e = y - (mx +b).
So, we try to find m and b (for the line of best fit) that minimize the error, that is the sum of the vertical squared distance Sum(||e||^2) = Sum(||y - (mx +b)||^2).

There are different ways of trying to find a line of best fit. It depends of what x and y represent.
For example, if both x and y are observations with errors and x and y have equal variances then what is called the "Deming regression" measured the deviations perpendicularly to the line of best fit.
• is there a lesson on nonlinear regressions? • Why are you interested in the squared error of the line as opposed to just the error of the line? • the reason we choose squared error instead of 3rd or 4th power or 26th power of the error is because of the nice shape that squared errors will make when we make a graph of the squared error vs m and b. The graph will make a 3-d parabola with the smallest square error being at our optimally chosen m and b. Since this graph has only 1 minimum value it is really nice since we can always find this minimum, and the minimum will be unique. If we use higher exponents it would be harder to find the minimum value(s), and we could find possibly non unique minimums or only local minimums (values that look good compared to the neighbouring values but not the absolute best). So, in summary we used squared error because it gives us a minimum that is easy to find and is guaranteed to be the only minimum (this guarantees it is the best!).
• why is this called linear "regression"? What does regression mean? • The term "regression" was used by Francis Galton in his 1886 paper "Regression towards mediocrity in hereditary stature". To my knowledge he only used the term in the context of regression toward the mean. The term was then adopted by others to get more or less the meaning it has today as a general statistical method.  