Gradient descent is a general-purpose algorithm that numerically finds minima of multivariable functions.
So what is it?
Gradient descent is an algorithm that numerically estimates where a function outputs its lowest values. That means it finds local minima, but not by setting like we've seen before. Instead of finding minima by manipulating symbols, gradient descent approximates the solution with numbers. Furthermore, all it needs in order to run is a function's numerical output, no formula required.
This distinction is worth emphasizing because it's what makes gradient descent useful. If we had a simple formula like , then we could easily solve to find that minimizes . Or we could use gradient descent to get a numerical approximation, something like . Both strategies arrive at the same answer.
But if we don't have a simple formula for our function, then it becomes hard or impossible to solve . If our function has a hundred or a million variables, then manipulating symbols isn't feasible. That's when an approximate solution is valuable, and gradient descent can give us these estimates no matter how elaborate our function is.
How gradient descent works
The way gradient descent manages to find the minima of functions is easiest to imagine in three dimensions.
Think of a function that defines some hilly terrain when graphed as a height map. We learned that the gradient evaluated at any point represents the direction of steepest ascent up this hilly terrain. That might spark an idea for how we could maximize the function: start at a random input, and as many times as we can, take a small step in the direction of the gradient to move uphill. In other words, walk up the hill.
To minimize the function, we can instead follow the negative of the gradient, and thus go in the direction of steepest descent. This is gradient descent. Formally, if we start at a point and move a positive distance in the direction of the negative gradient, then our new and improved will look like this:
More generally, we can write a formula for turning into :
Starting from an initial guess , we keep improving little by little until we find a local minimum. This process may take thousands of iterations, so we typically implement gradient descent with a computer.
Consider the function .
As we can see from the graph, this function has many local minima. Gradient descent will find different ones depending on our initial guess and our step size.
If we choose and , for example, gradient descent moves as shown in the graph below. The first point is , and lines connect each point to the next in the sequence. After only steps, we have converged to the minimum near .
If we use the same but , it seems as if the step size is too large for gradient descent to converge on the minimum. We'll return to this when we discuss the limitations of the algorithm.
If we start at with , we descend into a completely different local minimum.
Let's use gradient descent to solve the following problem: how can we best approximate with a degree polynomial within the range ?
In order to use gradient descent, we need to phrase the problem in terms of minimizing a function . Intuitively, our goal is to find the coefficients that make as close to as possible while is between and . There are many ways we can turn this idea into a function to minimize. One is called least squares:
In short, we define our as the sum of how incorrect is at each point. For example, if near , then will increase by a lot because not . Gradient descent will try to get as low as possible, which is the same as making closer to . We square the difference so the integral is always positive.
Here's what happens if we start with as random numbers and then move along the negative gradient. The number in the top left shows how many steps we've taken so far, where we use a step size of .
We get a pretty good approximation of !
Technically, the animation above uses something called momentum gradient descent, which is a variant that helps it avoid getting stuck in local minima.
Gradient descent has applications whenever we have a function we want to minimize, which is common in machine learning, for example. But it's important to know its shortcomings so that we can plan around them.
One of its limitations is that it only finds local minima (rather than the global minimum). As soon as the algorithm finds some point that's at a local minimum, it will never escape as long as the step size doesn't exceed the size of the ditch.
In the graph above, each local minimum has its own valley that would trap a gradient descent algorithm. After all, the algorithm only ever tries to go down, so once it finds a point where every direction leads up, it will stop. Looking at the graph from a different perspective, we also see that one local minimum is lower than the other.
When we minimize a function, we want to find the global minimum, but there is no way that gradient descent can distinguish global and local minima.
Another limitation of gradient descent concerns the step size . A good step size moves toward the minimum rapidly, each step making substantial progress.
Good step size converges quickly.
If the step size is too large, however, we may never converge to a local minimum because we overshoot it every time.
Large step size diverges.
If we are lucky and the algorithm converges anyway, it still might take more steps than it needed.
Large step size converges slowly.
If the step size is too small, then we'll be more likely to converge, but we'll take far more steps than were necessary. This is a problem when the function we're minimizing has thousands or millions of variables, and evaluating it is cumbersome.
Tiny step size converges slowly.
A final limitation is that gradient descent only works when our function is differentiable everywhere. Otherwise we might come to a point where the gradient isn't defined, and then we can't use our update formula.
Gradient descent fails for non-differentiable functions.
- Gradient descent minimizes differentiable functions that output a number and have any amount of input variables.
- It does this by taking a guess and successively applying the formula . In words, the formula says to take a small step in the direction of the negative gradient.
- Gradient descent can't tell whether a minimum it has found is local or global.
- The step size controls whether the algorithm converges to a minimum quickly or slowly, or whether it diverges.
- Many real world problems come down to minimizing a function.
Want to join the conversation?
- In the section of momentum gradient descent:
x1 = x0 + v1
The first equation, it will be gradient steepness,
So, the correction is:
v1=λv0 - α∇f(x0)
to be gradient descent(5 votes)
- You are correct! That sneaky minus sign makes all the difference. Thanks for pointing it out -- the typo should be fixed pretty soon.(4 votes)
- Could the gradient descent be trapped by a saddle point. Assume we are trying to minimize f = x^3, in the interval [-1,1]. let x_0 = 1/2 then the gradient descent will stop at x = 0 which is not even a local minimum.(4 votes)
- its unlikely you would land on such an exact even number such as x=0. You would typically initialise from random values as well(1 vote)
- Why all the focus on minimizing functions? Can we also use a sort of "gradient ascent" to maximize (differentiable real-valued) functions? I don't see why not...(of course though, any problem about maximizing a function f can be reframed in terms of minimizing its negation -f)(3 votes)
- Hi, tau is indeed better than pi, though I did celebrate pi day today! You are absolutely right. By considering -f(x), we turn any maximization problem into a minimization problem. Thus, the ideas are all the same really, and we pick either to minimize or maximize mostly arbitrarily. In the context of machine learning, the custom is to define a loss function because it is often easier to count how many mistakes an algorithm makes than to count how well it is doing. Thus we minimize the loss.(3 votes)
- Any python code written for approximating the sinx function?(2 votes)
It should be fairly straightforward to translate the ideas into Python. Note: if you want the best approximation for sin(x), there are methods that will produce better results than gradient descent. Taylor series or least-squares approximations come to mind. But it's just really cool to see gradient descent in action! Let me know if you have other questions :)(4 votes)
- does gradient descent work on multi-output functions?(2 votes)
- No it will not on its own; however most applications of the algorithm can be turned into single variable with a couple of equations. In fact the gradient descent algorithm is extremely effective for deep-learning because of this fact.(1 vote)
- Does anyone have a link to a coding exercise for understanding the algorithm practically? It would be really helpful(2 votes)
- Could you algebraically show the first few iterations for example 2 so it becomes a little clearer?
Edit: OK, part of my confusion is that I am thinking of x as the variable, when actually its a 6 variable function: f(a0,a1,a2,a3,a4,a5). So I guess the gradient has 6 elements, and so on? ie, for a given x, we are operating in a 6-D input space?(1 vote)
- Hi bkmurthy99, you are absolutely correct. The function f(a0, a1, a2, a3, a4, a5) has six inputs, so the analogy is we're rolling a ball down a hill in six dimensions to find where f is minimized. Like you said, the variable x is not related to how good of an approximation we have -- it's just a dummy variable we use to evaluate any given approximation p. It's a lot to keep in your head. That's one reason gradient descent is so cool. It doesn't care whether we need to optimize one variable, two, six, or a million!(3 votes)
- This might be asking a lot, but you could add another article, or extend this one, to include stochastic gradient descent and mini-batch gradient descent?(1 vote)
- Dear Shanie, I don't know of any plans to add another article on SGD or mini-batch methods. These would go a little beyond the scope of a multivariable calculus course, though it would be amazing of course if Khan Academy ever developed a machine learning course. In the meantime, there are many good resources online to learn about the methods you mentioned.(2 votes)
- It's not clear to me that the negative of the gradient is the direction of steepest descent from the fact that stepping in the direction of the gradient is the steepest ascent... I can imagine being at a point where orthogonally to the gradient is a steep drop off.
I am actually sort of picturing a saddle point in my mind... does anyone have any thoughts?(1 vote)
- At a saddle point, the gradient is zero. Orthogonally to the directions in which the function is increasing equally, the function is decreasing equally. Just like when the function increases the most equally in several directions, when the function decreases the most in two directions the negative gradient should be zero. At a saddle point, both the gradient and the negative gradient are zero, as there are two directions of "steepest" ascent or descent. The same logic applies to other critical points.(1 vote)
- Hello. Can you explain briefly why we are using integral in the least-squares method?(1 vote)