If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

Main content

Gradient descent

Gradient descent is a general-purpose algorithm that numerically finds minima of multivariable functions.

So what is it?

Gradient descent is an algorithm that numerically estimates where a function outputs its lowest values. That means it finds local minima, but not by setting f=0 like we've seen before. Instead of finding minima by manipulating symbols, gradient descent approximates the solution with numbers. Furthermore, all it needs in order to run is a function's numerical output, no formula required.
This distinction is worth emphasizing because it's what makes gradient descent useful. If we had a simple formula like f(x)=x24x, then we could easily solve f=0 to find that x=2 minimizes f(x). Or we could use gradient descent to get a numerical approximation, something like x1.99999967. Both strategies arrive at the same answer.
But if we don't have a simple formula for our function, then it becomes hard or impossible to solve f=0. If our function has a hundred or a million variables, then manipulating symbols isn't feasible. That's when an approximate solution is valuable, and gradient descent can give us these estimates no matter how elaborate our function is.

How gradient descent works

The way gradient descent manages to find the minima of functions is easiest to imagine in three dimensions.
Think of a function f(x,y) that defines some hilly terrain when graphed as a height map. We learned that the gradient evaluated at any point represents the direction of steepest ascent up this hilly terrain. That might spark an idea for how we could maximize the function: start at a random input, and as many times as we can, take a small step in the direction of the gradient to move uphill. In other words, walk up the hill.
To minimize the function, we can instead follow the negative of the gradient, and thus go in the direction of steepest descent. This is gradient descent. Formally, if we start at a point x0 and move a positive distance α in the direction of the negative gradient, then our new and improved x1 will look like this:
x1=x0αf(x0)
More generally, we can write a formula for turning xn into xn+1:
xn+1=xnαf(xn)
Starting from an initial guess x0, we keep improving little by little until we find a local minimum. This process may take thousands of iterations, so we typically implement gradient descent with a computer.

Example 1

Consider the function f(x)=x2cos(x)x10.
As we can see from the graph, this function has many local minima. Gradient descent will find different ones depending on our initial guess and our step size.
If we choose x0=6 and α=0.2, for example, gradient descent moves as shown in the graph below. The first point is x0, and lines connect each point to the next in the sequence. After only 10 steps, we have converged to the minimum near x=4.
If we use the same x0 but α=1.5, it seems as if the step size is too large for gradient descent to converge on the minimum. We'll return to this when we discuss the limitations of the algorithm.
If we start at x0=7 with α=0.2, we descend into a completely different local minimum.

Example 2

Let's use gradient descent to solve the following problem: how can we best approximate sin(x) with a degree 5 polynomial within the range 3<x<3?
p(x)=a0+a1x++a5x5
In order to use gradient descent, we need to phrase the problem in terms of minimizing a function f. Intuitively, our goal is to find the coefficients a0,,a5 that make p(x) as close to sin(x) as possible while x is between 3 and 3. There are many ways we can turn this idea into a function to minimize. One is called least squares:
f(a0,,a5)=33(p(x)sin(x))2dx
In short, we define our f as the sum of how incorrect p(x) is at each point. For example, if p(x)=2 near x=0, then f will increase by a lot because sin(0)=0 not 2. Gradient descent will try to get f as low as possible, which is the same as making p(x) closer to sin(x). We square the difference so the integral is always positive.
Here's what happens if we start with a0,,a5 as random numbers and then move along the negative gradient. The number in the top left shows how many steps we've taken so far, where we use a step size of α=0.001.
We get a pretty good approximation of sin(x)!
p(x)=0.00177+0.88974x+0.00287x20.11613x30.00048x4+0.00224x5
Technically, the animation above uses something called momentum gradient descent, which is a variant that helps it avoid getting stuck in local minima.

Limitations

Gradient descent has applications whenever we have a function we want to minimize, which is common in machine learning, for example. But it's important to know its shortcomings so that we can plan around them.
One of its limitations is that it only finds local minima (rather than the global minimum). As soon as the algorithm finds some point that's at a local minimum, it will never escape as long as the step size doesn't exceed the size of the ditch.
In the graph above, each local minimum has its own valley that would trap a gradient descent algorithm. After all, the algorithm only ever tries to go down, so once it finds a point where every direction leads up, it will stop. Looking at the graph from a different perspective, we also see that one local minimum is lower than the other.
When we minimize a function, we want to find the global minimum, but there is no way that gradient descent can distinguish global and local minima.
Another limitation of gradient descent concerns the step size α. A good step size moves toward the minimum rapidly, each step making substantial progress.

Good step size converges quickly.
If the step size is too large, however, we may never converge to a local minimum because we overshoot it every time.

Large step size diverges.
If we are lucky and the algorithm converges anyway, it still might take more steps than it needed.

Large step size converges slowly.
If the step size is too small, then we'll be more likely to converge, but we'll take far more steps than were necessary. This is a problem when the function we're minimizing has thousands or millions of variables, and evaluating it is cumbersome.

Tiny step size converges slowly.
A final limitation is that gradient descent only works when our function is differentiable everywhere. Otherwise we might come to a point where the gradient isn't defined, and then we can't use our update formula.

Gradient descent fails for non-differentiable functions.

Summary

  • Gradient descent minimizes differentiable functions that output a number and have any amount of input variables.
  • It does this by taking a guess x0 and successively applying the formula xn+1=xnαf(xn). In words, the formula says to take a small step in the direction of the negative gradient.
  • Gradient descent can't tell whether a minimum it has found is local or global.
  • The step size α controls whether the algorithm converges to a minimum quickly or slowly, or whether it diverges.
  • Many real world problems come down to minimizing a function.

Want to join the conversation?

  • blobby green style avatar for user Muhammed El-Yamani
    In the section of momentum gradient descent:
    The equation
    v1​=λv0 +α∇f(x0)
    x1 = x0 + v1
    The first equation, it will be gradient steepness,
    So, the correction is:
    v1​=λv0 - α∇f(x0)
    to be gradient descent
    (5 votes)
    Default Khan Academy avatar avatar for user
  • leaf green style avatar for user 𝜏 Is Better Than 𝝅
    Why all the focus on minimizing functions? Can we also use a sort of "gradient ascent" to maximize (differentiable real-valued) functions? I don't see why not...(of course though, any problem about maximizing a function f can be reframed in terms of minimizing its negation -f)
    (3 votes)
    Default Khan Academy avatar avatar for user
    • blobby green style avatar for user lakern
      Hi, tau is indeed better than pi, though I did celebrate pi day today! You are absolutely right. By considering -f(x), we turn any maximization problem into a minimization problem. Thus, the ideas are all the same really, and we pick either to minimize or maximize mostly arbitrarily. In the context of machine learning, the custom is to define a loss function because it is often easier to count how many mistakes an algorithm makes than to count how well it is doing. Thus we minimize the loss.
      (4 votes)
  • blobby green style avatar for user Amy
    Could the gradient descent be trapped by a saddle point. Assume we are trying to minimize f = x^3, in the interval [-1,1]. let x_0 = 1/2 then the gradient descent will stop at x = 0 which is not even a local minimum.
    (4 votes)
    Default Khan Academy avatar avatar for user
  • leaf green style avatar for user siddiqr67
    Any python code written for approximating the sinx function?
    (2 votes)
    Default Khan Academy avatar avatar for user
  • piceratops seedling style avatar for user Jonathan MFRC
    does gradient descent work on multi-output functions?
    (2 votes)
    Default Khan Academy avatar avatar for user
  • blobby green style avatar for user ShanieGlean
    This might be asking a lot, but you could add another article, or extend this one, to include stochastic gradient descent and mini-batch gradient descent?
    (1 vote)
    Default Khan Academy avatar avatar for user
    • blobby green style avatar for user lakern
      Dear Shanie, I don't know of any plans to add another article on SGD or mini-batch methods. These would go a little beyond the scope of a multivariable calculus course, though it would be amazing of course if Khan Academy ever developed a machine learning course. In the meantime, there are many good resources online to learn about the methods you mentioned.
      (3 votes)
  • piceratops seedling style avatar for user Ronald
    Does anyone have a link to a coding exercise for understanding the algorithm practically? It would be really helpful
    (2 votes)
    Default Khan Academy avatar avatar for user
  • blobby green style avatar for user bkmurthy99
    Could you algebraically show the first few iterations for example 2 so it becomes a little clearer?
    Edit: OK, part of my confusion is that I am thinking of x as the variable, when actually its a 6 variable function: f(a0,a1,a2,a3,a4,a5). So I guess the gradient has 6 elements, and so on? ie, for a given x, we are operating in a 6-D input space?
    (1 vote)
    Default Khan Academy avatar avatar for user
    • blobby green style avatar for user lakern
      Hi bkmurthy99, you are absolutely correct. The function f(a0, a1, a2, a3, a4, a5) has six inputs, so the analogy is we're rolling a ball down a hill in six dimensions to find where f is minimized. Like you said, the variable x is not related to how good of an approximation we have -- it's just a dummy variable we use to evaluate any given approximation p. It's a lot to keep in your head. That's one reason gradient descent is so cool. It doesn't care whether we need to optimize one variable, two, six, or a million!
      (3 votes)
  • blobby green style avatar for user nick.soberon
    It's not clear to me that the negative of the gradient is the direction of steepest descent from the fact that stepping in the direction of the gradient is the steepest ascent... I can imagine being at a point where orthogonally to the gradient is a steep drop off.

    I am actually sort of picturing a saddle point in my mind... does anyone have any thoughts?
    (1 vote)
    Default Khan Academy avatar avatar for user
    • ohnoes default style avatar for user Charles Morelli
      Picture a tangent plane at a point on your function's 'surface'...
      On that plane (in general), some directions are steeper than others, 2 directions are level (these are opposite to each other); one direction will be the steepest ascent and one (the opposite) direction will be the steepest descent (these are perpendicular to the level directions)...
      The gradient is the direction of steepest ascent on that plane (it is perpendicular (orthogonal, if you prefer) to 'contour' lines of the function's surface...
      Given that the gradient gives the vector (direction) of steepest ascent on the tangent plane at the point of interest, it seems logical (to me, at least) that the negative gradient vector must give the direction of steepest descent.
      If there is a sudden change of 'steepness' (i can only picture this as a 'corner' or 'edge' on the surface) at the point in question, the function is not differentiable at that point, so its gradient won't be defined.
      It might be that you're visualising the gradient vector as a small (finite) step in the direction of steepest ascent - i can see how that might lead to the idea you described... however, the gradient is an infinitessimal step... if this is the issue, try to visualise what happens as the step you take gets ever smaller, approaching zero-length... then i think you're more likely to see that the opposite direction (i.e., the negative) to the gradient vector (remembering that the gradient is only valid at the exact point it's calculated for) must be the direction of steepest descent for an infinitessimal step.
      If you're point is the special (i.e., non-general) case of a saddle point, then as Quinn says: the gradient is zero (and all directions on the tangent plane at that point are level.
      (1 vote)
  • blobby green style avatar for user Dzmitry Dauhalevich
    Hello. Can you explain briefly why we are using integral in the least-squares method?
    (1 vote)
    Default Khan Academy avatar avatar for user