Statistics and probability
- Squared error of regression line
- Proof (part 1) minimizing squared error to regression line
- Proof (part 2) minimizing squared error to regression line
- Proof (part 3) minimizing squared error to regression line
- Proof (part 4) minimizing squared error to regression line
- Regression line example
- Second regression example
- Calculating R-squared
- Covariance and the regression line
Proof (part 3) minimizing squared error to regression line
Proof (Part 3) Minimizing Squared Error to Regression Line. Created by Sal Khan.
Want to join the conversation?
- Why can't you divide "mean of x^2" with the "mean of x" like normal "x^2/x" and end up with just "mean of x"?(4 votes)
- In notation, the mean of x is:
xbar = Σ(xi) / n
That is: we add up all the numbers
xi, and divide by how many there are.
But the "mean of x^2" is not the square of the mean of x. We square each value, then add them up, and then divide by how many there are. Let's call it x2bar:
x2bar = Σ(xi^2) / n
Now, x2bar is not the same as xbar^2. The reason for this is because we're squaring, and then adding up. Those two operations are not interchangeable: the "sum of the squares" is not equal to the "square of the sum". You can try it out with a really small exercise in algebra. Take two numbers
b, and check whether
(a + b)^2is equal to
(a^2 + b^2). The second expression is pretty simple - we just have the square of
aand the square of
b. The first expression needs to be expanded first:
(a + b)^2 = (a + b)*(a + b)
(a + b)^2 = a^2 + 2*a*b + b^2
Now compare that to
(a^2 + b^2). We have the two squared terms, but we also have that
2*a*bterm, or the "crossproduct". That crossproduct is what makes the "sum of the squares" and the "square of the sum" not be equal, and hence why "mean of x^2" divided by "mean of x" doesn't give us "mean of x".(6 votes)
- Intuitively it makes sense that there would only be one best fit line. But isn't it true that the idea of setting the partial derivatives equal to zero with respect to m and b would only locate a REGIONAL minimum in the 3D "bowl." There could be other minima present with partial derivatives both equal to zero. Correct? And who's to say which minima is the one minimum, if it exists?
Also, intinutively, there is no maximum for the best fit line, but the partial derivates would equal zero at a maximum point in the 3D surface as well, right?(3 votes)
- Under the assumptions of linear regression, that won't happen. The "loss function" (that is, how we measure the closeness of the predictions, in this case the sum of squared residuals) is convex, so the surface won't be bumpy like you're envisioning. It will be a smooth curve.
And yes, for any local maximum or local minimum, the derivative will be zero.(5 votes)
- The 3D surface is explained to be parabolic. Who do we know that it's going to be parabolic? Why not any other 3 dimensional surface like the cylinder or conical?(4 votes)
- Thanks for the extremely helpful video series. At9:08Sal is dividing the equation by mean of x. What happens when mean of x's is zero. Is the derivation/formula valid is such cases?(4 votes)
- Likewise, at4:07, how did Sal take the partial derivative of nb^2 and get 2nb (or 2bn)? Also, I thought the object here was to factor out the b?(3 votes)
- It's just a basic rule for derivatives. Deriv. of x^2 is 2*x.
He's finding the derivative w.r.t. b when everything else in the term is held constant.(3 votes)
- I don't understand the part at7:14. xy-bar does not equal x-bar * y-bar. How is he dividing out an x-bar to obtain this y-bar?(2 votes)
- oh I got it. He didn't derive it like that. He used the other partial derivative starting point.(4 votes)
- at4:15how did he get 2bn , shouldn't it be just bn by taking out one b(3 votes)
- i think I am missing some basic knowledge here. Can you please direct me to more basic videos on partial derivatives. Thanks a ton for your answer(2 votes)
- Where can I find more information regarding the surface that Sal drew at the beginning?(2 votes)
- Why is the derivative is used. What does that mean?(2 votes)
- It's a rate of change. Velocity is the rate of change of distance covered. That is, the greater your velocity, the faster the distance covered is changing.
Sal showed that as you change the slope (m) of the best fit line, the error changes; and as you change the y-intercept (b) the error also changes. He found algebraic expressions for these rates of change.(2 votes)
- how did the 3-d surfaces come into play(2 votes)
All right, so where we left off, we had simplified our algebraic expression for the squared error to the line from the n data points. We kind of visualized it. This expression right here would be a surface, I guess you could view it as a surface in three dimensions, where for any m and b is going to be a point on that surface that represents the squared error for that line. Our goal is to find the m and the b, which would define an actual line, that minimize the squared error. The way that we do that, is we find a point where the partial derivative of the squared error with respect to m is 0, and the partial derivative with respect to b is also equal to 0. So it's flat with respect to m. So that means that the slope in this direction is going to be flat. Let me do it in the same color. So the slope in this direction, that's the partial derivative with respect to m, is going to be flat. It's not going to change in that direction. The partial derivative with respect to b is going to be flat. So it will be a flat point right over there. The slope at that point in that direction will also be 0, and that is our minimum point. So let's figure out the m and b's that give us this. So if I were to take the partial derivative of this expression with respect to m. Well this first term has no m terms in it. So it's a constant from the point of view of m. Just as a reminder, partial derivatives, it's just like taking a regular derivative. You're just assuming that everything but the variable that you're doing the partial derivative with respect to, you're assuming everything else is a constant. So in this expression, all the x's, the y's, the b's, the n's, those are all constant. The only variable, when we take the partial derivative with respect to m, that matters is the m. So this is a constant. There's no m here. This term right over here, we're taking with respect to m. So the derivative of this with respect to m, it's kind of the coefficients on the m. So negative 2 times n times the mean of the xy's, that's the partial of this with respect to m. Then this term or right here has no m's in it. So it's constant with respect to m. So its partial derivative with respect to m is 0. Then this term here, you have n times the mean of the x squared times m squared. So this is going to be-- we're talking about a partial derivative with respect to m-- so it's going to be 2 times n times the mean of the x [? squareds ?] times m. The derivative of m squared is 2m, and then you just have this coefficient there as well. Now this term, you also have an m over there. So let's see, everything else is just kind of a coefficient on this m. So the derivative with respect to m is 2bn times the mean of the x's. If I took the derivative of 3m, the derivative is just 3. It's just the coefficient on it. Then finally, this is a constant with respect to m. So we don't see it. So this is the partial derivative with respect to m. That's that right over there. We want to set this equal to 0. Now let's do the same thing with respect to b. This term, once again, is a constant from the perspective of b. There's no b here. There's no b over here. So the partial derivatives of either of these with respect to b is 0. Then over here you have a negative 2n times the mean of y's as a coefficient on a b. So the partial derivative with respect to b is going to be minus 2n, or negative 2n, times the mean of the y's. Then there's no b over here. Then we do have a b over here. So it's plus 2mn times the mean of the x's. This is essentially the coefficient on the b over here. It was written in a mixed order, but all of these are constants from the point of view of b. They are the coefficient in front of the b. The partial derivative of that with respect to b is just going to be the coefficient. Then finally, the partial derivative of this with respect to b is going to be 2nb, Or 2nb to the first you could even say. We want to set this equal to 0. So it looks very complicated. But remember, we're just trying to solve for the m's and the b 's. We have two equations with two unknowns here. We have the m's and then we have the b's. To simplify this, both of these equations, actually the top one and the bottom one, both sides are divisible by 2n. I mean 0 is divisible by anything. It'll be just 0. So let's divide the top equation and by 2n and see what we get. If we divide the top equation by 2n, this'll become just 1. That goes away, and then those go away. You would just be left with negative times the mean, the negative mean of the xy's plus m times the mean of the x squareds, plus b times the mean of the x's is equal to 0. That's this first expression when you divide both sides by negative 2n. The second expression will be, this will go away. This is when you divide it by 2n. I don't want to say negative 2n. When you divide this by 2n, that'll go away, that will go away, and then those will go away. You're just left with the negative mean of the y's plus m times the mean of the x's plus b is equal to 0. So if we find the m and the b values that satisfy the system of equations, we have minimized the squared error. We could just solve it in a traditional way. But I want to rewrite this, because I think it's kind of interesting to see what these really represents. So let's add this mean of the xy's to both sides of this top equation. What do we get? We get m times the mean of the x [? squareds ?] plus b times the mean of the x's is equal to, these are going to cancel out, is equal to the mean of the xy's. That's that top equation. This bottom equation, right here, let's add the mean of y to both sides of this equation. I do that so that that cancels out. And then we're left with m-- I'll do that in the blue color to show you the same equation-- we have m times the mean of the x's plus b is equal to the mean of the y's. Now, I actually want to get both of these into mx plus b form. This is actually already there. Actually you can see, that if our best-fitting line is going to be y is equal to mx plus b-- we still have to find the m and the b-- but we see on that best-fitting line, because the m and the b that satisfy both of these equations are going to be the m and the b on that best-fitting line. So that best-fitting line actually contains the point, and we get this from the second equation right here. It contains the point. I should write it this way. The coordinate mean of x mean of y lies on the line. And you could see it right over here. If you put the mean of x in this for the optimal m and b, you are going to get the mean of the y. So that's interesting. This optimal line. Let's never forget what we're even trying to do. This optimal line is going to contain some point on it-- let me do that in a new color-- it's going to contain some point on it right here that is the mean of all of the x values and the mean of all the y values. That's just interesting. It kind of makes sense. It kind of makes intuitive sense. Now this other thing, just to kind of get it in the same point of view. Then it will actually become a kind of an easier way to solve the system. You could solve this a million different ways. But just to give us an intuition of what even is going on here, what's another point that's on the line? Because if you have two points on the line, you know what the equation of the line is going to be. Well the other point, we want this to be in mx plus b form. So let's divide both sides of this equation by this term right here, by the mean of the x 's. If we do that, we get m times the mean of the x [? squareds ?] divided by the mean of the x's plus b is equal to the mean of the xy's divided by the mean of the x's. So when you write it in this form, this is the exact same equation as that, I just divided both sides by the mean of the x's, you get another interesting point that will lie on this optimal fitting line, at least from the point of view of the squared distances. So another point that will lie on it, on this optimal line, the x value is going to be this, the mean of the x [? squareds ?] divided by the mean of the x's. Then the y value is going to be the mean of the xy's divided by the mean of the x's. I'll let you think about that a little bit more. But already, this is actually the two points that lie on the line, so both of these on the best-fitting line based on how we're measuring a good fit, which is the squared distance. These are on the line that minimize that squared distance. What I'm going to do in next video, and this is turning into like a six or seven video saga on trying to prove the best-fitting line or finding the formula for the best-fitting line. But it's interesting. There's all sorts of kind of neat little mathematical things to ponder over here. But in the next video, we can actually use this information. We could have just solved the system straight up. But we can actually use this information right here to solve for our m and b's. Maybe we'll do it both ways depending on my mood.