Concept of Gradient Descent in Machine Learning.

VARSHITHA GUDIMALLA
7 min readAug 25, 2021

--

Gradient Descent is a very generic optimization algorithm capable of finding optimal solutions to a wide range of problems. The general idea of Gradient Descent is to tweak parameters iteratively to minimize a cost function.
Suppose we are lost in the mountains in a dense fog; we can only feel the slope of the ground below our feet. A good strategy to get to the bottom of the valley quickly is to go downhill in the direction of the steepest slope. This is exactly what Gradient Descent does: it measures the local gradient of the error function with regards to the parameter vector θ, and it goes in the direction of descending gradient. Once the gradient is zero, we have reached a minimum!
Concretely, we start by filling θ with random values, and then we improve it gradually, taking small steps at a time, each step attempting to decrease the cost function until the algorithm converges to a minimum.

An important parameter in Gradient Descent is the size of the steps, determined by the learning rate hyperparameter. If the learning rate is too small, then the algorithm will have to go through many iterations to converge, which will take a long time.

On the other hand, if the learning rate is too high, we might jump across the valley and end up on the other side, possibly even higher up than we were before. This might make the algorithm diverge, with larger and larger values, failing to find a good solution.

Finally, not all cost functions look like nice regular bowls. There may be holes, ridges, plateaus, and all sorts of irregular terrains, making convergence to the minimum very difficult. As shown below, The two main challenges with Gradient Descent are: if the random initialization starts the algorithm on the left, then it will converge to a local minimum, which is not as good as the global minimum. If it starts on the right, then it will take a very long time to cross the plateau, and if we stop too early we will never reach the global minimum.

These two facts have a great consequence: Gradient Descent is guaranteed to approach arbitrarily close the global minimum.

Hence, we can understand the fact that training a model means searching for a combination of model parameters that minimize a cost function. It is a search in the model’s parameter space: the more parameters a model has, the more dimensions this space has, and the harder the search is.

Batch Gradient Descent

To implement Gradient Descent, we need to calculate how much the cost function will change if we change θj just a little bit. This is called a partial derivative. It is like asking “what is the slope of the mountain under my feet if we face east?” and then asking the same question facing north. The equation below computes the partial derivative of the cost function with regards to parameter θj:

Instead of computing these partial derivatives individually, we can use the equation below to compute them all at once. The gradient vector, noted ∇θMSE(θ), contains all the partial derivatives of the cost function.

Once we have the gradient vector, which points uphill, just go in the opposite direction. This means subtracting ∇θMSE(θ) from θ. This is where the learning rate η comes into play: multiply the gradient vector by η to determine the size of the downhill step.

The figure below shows the first 10 steps of Gradient Descent using three different learning rates:

On the left, the learning rate is too low: the algorithm will eventually reach the solution, but it will take a long time. In the middle, the learning rate looks pretty good: in just a few iterations, it has already converged to the solution. On the right, the learning rate is too high: the algorithm diverges, jumping all over the place and getting further and further away from the solution at every step. To find a good learning rate, we can use grid search. However, we may want to limit the number of iterations so that grid search can eliminate models that take too long to converge.
You may wonder how to set the number of iterations. If it is too low, we will still be far away from the optimal solution when the algorithm stops, but if it is too high, we will waste time while the model parameters do not change anymore. A simple solution is to set a very large number of iterations but to interrupt the algorithm when the gradient vector becomes tiny — that is, when its norm becomes smaller than a tiny number ϵ (called the tolerance) — because this happens when Gradient Descent has reached the minimum.

Stochastic Gradient Descent

The main problem with Batch Gradient Descent is that it uses a complete training set to compute the gradients at each step, which makes it very slow when the training set is very large. On the other hand, SGD just picks a random instance in the training set at each step and computes the gradients based only on that particular instance. This makes the algorithm much faster because it has very little data to manipulate at each iteration. It also makes it possible to train on large training sets, since only a single instance needs to be in memory at every iteration. On the other hand, due to its stochastic (i.e., random) nature, this algorithm is much less regular than BGD: instead of gently decreasing until it reaches the minimum, the cost function will bounce up and down, decreasing only on average. Over time it will end up very close to the minimum, but once it gets there it will continue to bounce around, never settling down (show below). So once the algorithm stops, the final parameters are good, but not optimal.

When the cost function is irregular , that helps the algorithm jump out of local minima, so SGD has a better chance of finding the global minimum than BGD does.
Therefore randomness is good to escape from local optima, but not good because it means that the algorithm can never settle at the minimum. One solution to this confusion is to gradually decrease the learning rate. The steps start large, then get smaller and smaller, which allows the algorithm to settle at the global minimum. The function that determines the learning rate at every iteration is called the learning schedule. If the learning rate is reduced too quickly, we may get stuck in a local minimum, or even end up stuck halfway to the minimum. If the learning rate is reduced too slowly, we may be moving around the minimum for a long time and end up with a suboptimal solution, if we stop training too early.

Mini-batch Gradient Descent

It is quite simple to understand once we know Batch and Stochastic Gradient Descent: at each step, instead of computing the gradients based on the full training set or based on just one instance, Mini-batch GD computes the gradients on small random sets of instances called mini-batches. The main advantage of Mini-batch GD over Stochastic GD is that we can get a performance boost from hardware optimization of matrix operations.
The algorithm’s progress in parameter space is less erratic than with SGD, especially with fairly large mini-batches. As a result, the Mini-batch Gradient Descent algorithm will end up walking around a bit closer to the minimum than SGD. But it may be harder for it to escape from local minima. The figure below shows the paths taken by different Gradient Descent algorithms in parameter space during training. They all end up near the minimum, but BGD’s path stops at the minimum, while both SGD and Mini-batch Gradient Descent continues to move around. However, don’t forget that Batch Gradient Descent takes a lot of time to take each step, and SGD and Mini-batch Gradient Descent would also reach the minimum if we used a good learning schedule.

--

--

VARSHITHA GUDIMALLA

Computer Science Engineering Graduate || Machine Learning enthusiast.