The blog post introduces some optimizers in deep learning.

Gradient Descent
We have been using gradient descent as a learning algorithm or optimizer without much scrutiny, but there is a significant flaw in the gradient descent optimizer that contributes to poor learning curves in FNNs, as discussed in the previous article. The problem is that it cannot escape from local minima to explore the global minimum.

The above contour plot shows the loss function with respect to the learnable parameters and . Depending on whether you start from the top right half or bottom left half, gradient descent will result in either a local minimum or a global minimum and will get stuck there as the gradient becomes zero in both cases. The previous example using the Iris dataset seemed to have a similar problem, where the model took a while to escape from the point where the loss stayed around 1.1. In fact, if you try different sets of hyperparameters, the models often stop learning from that point.
Stochastic Gradient Descent
One simple way to address this problem is by introducing randomness into the algorithm by using only
one randomly chosen data point to compute the gradient, which is called stochastic gradient descent
(SGD). It sometimes works quite well and is easy to implement in code. Let's see the implementation
in our FNNClassifier.fit
function:
def fit(self, X, y, epochs=100, verbose=True):
for epoch in range(epochs):
if self.optimizer == "SGD":
randId = np.random.choice(X.shape[0]-1)
X_batch = X[randId].reshape(1, X.shape[1])
y_batch = y[randId]
if y.shape[0] != 1:
y_batch = y_batch.reshape(1, y.shape[1])
else:
X_batch = X
y_batch = y
pred, activations = self.predict(X_batch, train=True)
self.history.append(log_loss(y_batch, pred))
if verbose:
print(f"Epoch: {epoch}, Loss: {self.history[-1]}")
delta = pred - y_batch
...
However, SGD has a critical drawback: gradients need to be computed one by one, so we cannot parallelize the process like we can with normal gradient descent. The gradients can also be unstable, making the learning process erratic and significantly longer. Below is the result of using SGD on the Iris dataset.

The learning curve is quite volatile, although it is generally converging to the global minimum.
Mini-Batch Gradient Descent
To get the best of both worlds, we can use mini-batch gradient descent, where we create a
random mini-batch containing 16 or 32 data points and use them to compute the gradients. This
allows us to compute gradients of multiple data points in parallel and obtain relatively stable
gradients while exploring the global minimum. Below is the implementation of mini-batch GD in
the context of the FNNClassifier.fit
function:
def fit(self, X, y, epochs=100, verbose=True):
for epoch in range(epochs):
if self.optimizer == "SGD":
randId = np.random.choice(X.shape[0]-1)
X_batch = X[randId].reshape(1, X.shape[1])
y_batch = y[randId]
if y.shape[0] != 1:
y_batch = y_batch.reshape(1, y.shape[1])
elif self.optimizer == "MiniBatchGD":
randId = np.random.choice(X.shape[0]-1, self.batch_size, replace=False)
X_batch = X[randId]
y_batch = y[randId]
else:
X_batch = X
y_batch = y
...
Due to the above changes, you will need to include additional hyperparameter batch_size
in the
FNNClassifier
class. The learning curve for mini-batch GD with a batch size of 16 looks like the
following:

The curve is less volatile and converges more quickly to the global minimum. However, the curve is still somewhat volatile. This volatility may be due to overshooting, where the optimizer descends too far and crosses to the opposite side of the curve. Let's look at an example.

The above figure shows a loss function with a narrow trough and the path of mini-batch gradient descent. The optimizer initially oscillates between the sides of the trough due to the high gradient and learning rate. Ideally, we want the optimizer to minimize this oscillation and converge to the minimum more quickly.
Momentum
One approach to address the issue is to keep track of the momentum of the optimizer's previous movements and take that into account when computing the gradient. This can help prevent the optimizer from oscillating too much. The concept is implemented using the following equations:
where is the hyperparameter controlling how much we consider the previous gradient. Not only can this approach help reduce the gradient when the optimizer is changing direction, but it can also increase the gradient when it is moving in the same direction, thereby accelerating learning. By choosing the right values for and , we aim to achieve smooth and efficient learning.
RMSProp
Another approach is to adjust the learning rate using momentum, a technique known as RMSProp. The equations for RMSProp are:
RMSProp is similar to the momentum approach, but with one key difference: the gradient is squared to compute , and this result is applied to the learning rate. The squaring of the gradient and then taking the square root ensures the value is positive, adjusting the scale of the learning rate without altering the direction of the gradient. The small value prevents the denominator from becoming zero.
Adam
Both momentum and RMSProp are effective techniques, so why not combine them to adjust both the gradient and the learning rate? This combination is achieved with the following equations:
Although the above formulation seems to work fine, the gradients are initially hesitant to move from 0, making the learning process slower. Therefore, we use bias correction.
This approach, which combines momentum and RMSProp with bias correction, is called Adaptive Momentum
Estimation, or Adam. It is one of the most widely used optimizers in deep learning. Let's implement this in
our FNNClassifier
.
def __init__(self, input_dim, hidden_dims, output_dim, lr=0.001, batch_size=16, beta_1=0.9, beta_2=0.99, optimizer="SGD"):
self.lr = lr
self.history = []
self.optimizer = optimizer
self.batch_size = batch_size
self.beta_1 = beta_1
self.beta_2 = beta_2
self.W = []
self.b = []
...
# Initialize Adam variables
if self.optimizer == "Adam":
self.epsilon = 1e-8
self.m_W = [np.zeros_like(w) for w in self.W]
self.m_b = [np.zeros_like(b) for b in self.b]
self.s_W = [np.zeros_like(w) for w in self.W]
self.s_b = [np.zeros_like(b) for b in self.b]
First, we need to add additional hyperparameters for the Adam optimizer and initialize the momentum and second momentum for weights and biases. Then, we can update the fit function using the equations above.
def fit(self, X, y, epochs=100, verbose=True):
for epoch in range(epochs):
if self.optimizer == "SGD":
randId = np.random.choice(X.shape[0], self.batch_size, replace=False)
X_batch = X[randId]
y_batch = y[randId]
elif self.optimizer == "MiniBatchGD" or self.optimizer == "Adam":
randId = np.random.choice(X.shape[0], self.batch_size, replace=False)
X_batch = X[randId]
y_batch = y[randId]
else:
X_batch = X
y_batch = y
...
# Adam-specific parameter updates
if self.optimizer == "Adam":
t = epoch + 1
self.m_W[i] = self.beta_1 * self.m_W[i] + (1 - self.beta_1) * grad_W
self.s_W[i] = self.beta_2 * self.s_W[i] + (1 - self.beta_2) * (grad_W ** 2)
self.m_b[i] = self.beta_1 * self.m_b[i] + (1 - self.beta_1) * grad_b
self.s_b[i] = self.beta_2 * self.s_b[i] + (1 - self.beta_2) * (grad_b ** 2)
m_W_hat = self.m_W[i] / (1 - self.beta_1 ** t)
s_W_hat = self.s_W[i] / (1 - self.beta_2 ** t)
m_b_hat = self.m_b[i] / (1 - self.beta_1 ** t)
s_b_hat = self.s_b[i] / (1 - self.beta_2 ** t)
self.W[i] -= self.lr * m_W_hat / (np.sqrt(s_W_hat) + self.epsilon)
self.b[i] -= self.lr * m_b_hat / (np.sqrt(s_b_hat) + self.epsilon)
else:
self.W[i] -= self.lr * grad_W
self.b[i] -= self.lr * grad_b
return self.history
The Adam optimizer works with mini-batches, similar to mini-batch gradient descent. When fitting the model to the Iris dataset, the learning curve looks like this:

Unfortunately, in this particular case, we may not observe much difference compared to mini-batch GD,
but Adam should make the model more robust in general. We will continue making improvements to the
FNNClassifier
to achieve better results in the next article.