Why Gradient Descent Works
Gradient descent is very well known optimization tool to estimate an algorithm's parameters minimizing the loss function. Often we don't not fully know the shape and complexity of the loss function and where the minimum is. That's where gradient descent comes to the rescue: if we step in the opposite direction of the gradient, the value of the loss function will decrease.
This concept is shown in Figure 1. We start at some initial parameters, w0, (usually randomly initialized) and we iteratively walk downhill the loss function until we hit the minimum of the loss function.
Figure 1. Gradient descent intuition
The update rule for w at each step is:
Where w is the parameters vector, alpha a small number called learning rate and g(w) is the gradient of the loss function w.r.t. w.
Intuitively, gradient descent makes totally sense: we start from w0, to decrease the value of the loss function we need to increase w (move to the right). Because alpha is positive, the gradient at w0 is negative (the slope is pointing downwards) and thanks to the minus sign, applying (1) will increases the value of w and minimize L(w).
Grasping the intuition of an algorithm is a good start (and very often enough), but let's close the circle and add the mathematical demonstration to it.
Let's look at the figure below:
Figure 1. Gradient descent setting
w0 is the initial parameters' vector; w1 the parameters' vector after one step s; s is the step vector (w1- w0); L'(w0) is the derivative of L(w) at w0. We need to prove that L(w1) > L(w0) per every infinitesimally small s.
Assuming L(w) is continuous and differentiable, we can approximate L(w0+s) with its 1st order Taylor approximation:
Let's use the gradient descent intuition and define s as:
We want to prove that:
Let's start with plugging in the definition of s
Now, let's use the first order Taylor expansion:
The square of a vector is a positive number:
alpha is small but positive, thus we are subtracting a positive value from L(w0), hence:
And there it is, the proof that no matter where you start, subtracting the gradient multiplied by the learning rate, from the model's parameters is equivalent to a small step downhill and reduces the value of loss function. When this process is applied a sufficient number of times, eventually, we will reach the minimum of L (w).
Comments