Main content

## Multivariable calculus

### Course: Multivariable calculus > Unit 3

Lesson 4: Optimizing multivariable functions (articles)# Gradient descent

Gradient descent is a general-purpose algorithm that numerically finds minima of multivariable functions.

## Background

## So what is it?

Gradient descent is an algorithm that numerically estimates where a function outputs its lowest values. That means it finds local minima, but not by setting del, f, equals, 0 like we've seen before. Instead of finding minima by manipulating symbols, gradient descent approximates the solution with numbers. Furthermore, all it needs in order to run is a function's numerical output, no formula required.

This distinction is worth emphasizing because it's what makes gradient descent useful. If we had a simple formula like f, left parenthesis, x, right parenthesis, equals, x, squared, minus, 4, x, then we could easily solve del, f, equals, 0 to find that x, equals, 2 minimizes f, left parenthesis, x, right parenthesis. Or we could use gradient descent to get a numerical approximation, something like x, approximately equals, 1, point, 99999967. Both strategies arrive at the same answer.

But if we don't have a simple formula for our function, then it becomes hard or impossible to solve del, f, equals, 0. If our function has a hundred or a million variables, then manipulating symbols isn't feasible. That's when an approximate solution is valuable, and gradient descent can give us these estimates no matter how elaborate our function is.

## How gradient descent works

The way gradient descent manages to find the minima of functions is easiest to imagine in three dimensions.

Think of a function f, left parenthesis, x, comma, y, right parenthesis that defines some hilly terrain when graphed as a height map. We learned that the gradient evaluated at any point represents the direction of steepest ascent up this hilly terrain. That might spark an idea for how we could

*maximize*the function: start at a random input, and as many times as we can, take a small step in the direction of the gradient to move uphill. In other words, walk up the hill.To

*minimize*the function, we can instead follow the negative of the gradient, and thus go in the direction of steepest descent. This is gradient descent. Formally, if we start at a point x, start subscript, 0, end subscript and move a positive distance alpha in the direction of the negative gradient, then our new and improved x, start subscript, 1, end subscript will look like this:More generally, we can write a formula for turning x, start subscript, n, end subscript into x, start subscript, n, plus, 1, end subscript:

Starting from an initial guess x, start subscript, 0, end subscript, we keep improving little by little until we find a local minimum. This process may take thousands of iterations, so we typically implement gradient descent with a computer.

## Example 1

Consider the function f, left parenthesis, x, right parenthesis, equals, start fraction, x, squared, cosine, left parenthesis, x, right parenthesis, minus, x, divided by, 10, end fraction.

As we can see from the graph, this function has many local minima. Gradient descent will find different ones depending on our initial guess and our step size.

If we choose x, start subscript, 0, end subscript, equals, 6 and alpha, equals, 0, point, 2, for example, gradient descent moves as shown in the graph below. The first point is x, start subscript, 0, end subscript, and lines connect each point to the next in the sequence. After only 10 steps, we have converged to the minimum near x, equals, 4.

If we use the same x, start subscript, 0, end subscript but alpha, equals, 1, point, 5, it seems as if the step size is too large for gradient descent to converge on the minimum. We'll return to this when we discuss the limitations of the algorithm.

If we start at x, start subscript, 0, end subscript, equals, 7 with alpha, equals, 0, point, 2, we descend into a completely different local minimum.

## Example 2

Let's use gradient descent to solve the following problem:

*how can we best approximate sine, left parenthesis, x, right parenthesis with a degree 5 polynomial within the range minus, 3, is less than, x, is less than, 3?*In order to use gradient descent, we need to phrase the problem in terms of minimizing a function f. Intuitively, our goal is to find the coefficients a, start subscript, 0, end subscript, comma, dots, comma, a, start subscript, 5, end subscript that make p, left parenthesis, x, right parenthesis as close to sine, left parenthesis, x, right parenthesis as possible while x is between minus, 3 and 3. There are many ways we can turn this idea into a function to minimize. One is called least squares:

In short, we define our f as the sum of how incorrect p, left parenthesis, x, right parenthesis is at each point. For example, if p, left parenthesis, x, right parenthesis, equals, 2 near x, equals, 0, then f will increase by a lot because sine, left parenthesis, 0, right parenthesis, equals, 0 not 2. Gradient descent will try to get f as low as possible, which is the same as making p, left parenthesis, x, right parenthesis closer to sine, left parenthesis, x, right parenthesis. We square the difference so the integral is always positive.

Here's what happens if we start with a, start subscript, 0, end subscript, comma, dots, comma, a, start subscript, 5, end subscript as random numbers and then move along the negative gradient. The number in the top left shows how many steps we've taken so far, where we use a step size of alpha, equals, 0, point, 001.

We get a pretty good approximation of sine, left parenthesis, x, right parenthesis!

Technically, the animation above uses something called momentum gradient descent, which is a variant that helps it avoid getting stuck in local minima.

## Limitations

Gradient descent has applications whenever we have a function we want to minimize, which is common in machine learning, for example. But it's important to know its shortcomings so that we can plan around them.

One of its limitations is that it only finds local minima (rather than the global minimum). As soon as the algorithm finds some point that's at a local minimum, it will never escape as long as the step size doesn't exceed the size of the ditch.

In the graph above, each local minimum has its own valley that would trap a gradient descent algorithm. After all, the algorithm only ever tries to go down, so once it finds a point where every direction leads up, it will stop. Looking at the graph from a different perspective, we also see that one local minimum is lower than the other.

When we minimize a function, we want to find the global minimum, but there is no way that gradient descent can distinguish global and local minima.

Another limitation of gradient descent concerns the step size alpha. A good step size moves toward the minimum rapidly, each step making substantial progress.

*Good step size converges quickly.*

If the step size is too large, however, we may never converge to a local minimum because we overshoot it every time.

*Large step size diverges.*

If we are lucky and the algorithm converges anyway, it still might take more steps than it needed.

*Large step size converges slowly.*

If the step size is too small, then we'll be more likely to converge, but we'll take far more steps than were necessary. This is a problem when the function we're minimizing has thousands or millions of variables, and evaluating it is cumbersome.

*Tiny step size converges slowly.*

A final limitation is that gradient descent only works when our function is differentiable everywhere. Otherwise we might come to a point where the gradient isn't defined, and then we can't use our update formula.

*Gradient descent fails for non-differentiable functions.*

## Summary

- Gradient descent minimizes differentiable functions that output a number and have any amount of input variables.
- It does this by taking a guess x, start subscript, 0, end subscript and successively applying the formula x, start subscript, n, plus, 1, end subscript, equals, x, start subscript, n, end subscript, minus, alpha, del, f, left parenthesis, x, start subscript, n, end subscript, right parenthesis. In words, the formula says to take a small step in the direction of the negative gradient.
- Gradient descent can't tell whether a minimum it has found is local or global.
- The step size alpha controls whether the algorithm converges to a minimum quickly or slowly, or whether it diverges.
- Many real world problems come down to minimizing a function.

## Want to join the conversation?

- In the section of momentum gradient descent:

The equation

v1=λv0 +α∇f(x0)

x1 = x0 + v1

The first equation, it will be gradient steepness,

So, the correction is:

v1=λv0 - α∇f(x0)

to be gradient descent(5 votes)- You are correct! That sneaky minus sign makes all the difference. Thanks for pointing it out -- the typo should be fixed pretty soon.(4 votes)

- Could the gradient descent be trapped by a saddle point. Assume we are trying to minimize f = x^3, in the interval [-1,1]. let x_0 = 1/2 then the gradient descent will stop at x = 0 which is not even a local minimum.(4 votes)
- its unlikely you would land on such an exact even number such as x=0. You would typically initialise from random values as well(1 vote)

- Why all the focus on minimizing functions? Can we also use a sort of "gradient ascent" to maximize (differentiable real-valued) functions? I don't see why not...(of course though, any problem about maximizing a function f can be reframed in terms of minimizing its negation -f)(3 votes)
- Hi, tau is indeed better than pi, though I did celebrate pi day today! You are absolutely right. By considering -f(x), we turn any maximization problem into a minimization problem. Thus, the ideas are all the same really, and we pick either to minimize or maximize mostly arbitrarily. In the context of machine learning, the custom is to define a loss function because it is often easier to count how many mistakes an algorithm makes than to count how well it is doing. Thus we minimize the loss.(3 votes)

- Any python code written for approximating the sinx function?(2 votes)
- Hello siddiqr67, here's the original JavaScript code used to create the animation in the article.

https://www.khanacademy.org/computer-programming/gradient-descent-animator-polynomial-approximation/4973855889768448

It should be fairly straightforward to translate the ideas into Python. Note: if you want the best approximation for sin(x), there are methods that will produce better results than gradient descent. Taylor series or least-squares approximations come to mind. But it's just really cool to see gradient descent in action! Let me know if you have other questions :)(4 votes)

- does gradient descent work on multi-output functions?(2 votes)
- No it will not on its own; however most applications of the algorithm can be turned into single variable with a couple of equations. In fact the gradient descent algorithm is extremely effective for deep-learning because of this fact.(1 vote)

- Does anyone have a link to a coding exercise for understanding the algorithm practically? It would be really helpful(2 votes)
- Could you algebraically show the first few iterations for example 2 so it becomes a little clearer?

Edit: OK, part of my confusion is that I am thinking of x as the variable, when actually its a 6 variable function: f(a0,a1,a2,a3,a4,a5). So I guess the gradient has 6 elements, and so on? ie, for a given x, we are operating in a 6-D input space?(1 vote)- Hi bkmurthy99, you are absolutely correct. The function f(a0, a1, a2, a3, a4, a5) has six inputs, so the analogy is we're rolling a ball down a hill in six dimensions to find where f is minimized. Like you said, the variable x is not related to how good of an approximation we have -- it's just a dummy variable we use to evaluate any given approximation p. It's a lot to keep in your head. That's one reason gradient descent is so cool. It doesn't care whether we need to optimize one variable, two, six, or a million!(3 votes)

- This might be asking a lot, but you could add another article, or extend this one, to include stochastic gradient descent and mini-batch gradient descent?(1 vote)
- Dear Shanie, I don't know of any plans to add another article on SGD or mini-batch methods. These would go a little beyond the scope of a multivariable calculus course, though it would be amazing of course if Khan Academy ever developed a machine learning course. In the meantime, there are many good resources online to learn about the methods you mentioned.(2 votes)

- It's not clear to me that the negative of the gradient is the direction of steepest descent from the fact that stepping in the direction of the gradient is the steepest ascent... I can imagine being at a point where orthogonally to the gradient is a steep drop off.

I am actually sort of picturing a saddle point in my mind... does anyone have any thoughts?(1 vote)- At a saddle point, the gradient is zero. Orthogonally to the directions in which the function is increasing equally, the function is decreasing equally. Just like when the function increases the most equally in several directions, when the function decreases the most in two directions the negative gradient should be zero. At a saddle point, both the gradient and the negative gradient are zero, as there are two directions of "steepest" ascent or descent. The same logic applies to other critical points.(1 vote)

- Hello. Can you explain briefly why we are using integral in the least-squares method?(1 vote)