Road to ML Engineer #14

Gradient Descent

We have been using gradient descent as a learning algorithm or optimizer without much scrutiny, but there is a significant flaw in the gradient descent optimizer that contributes to poor learning curves in FNNs, as discussed in the previous article. The problem is that it cannot escape from local minima to explore the global minimum.

The above contour plot shows the loss function with respect to the learnable parameters $x$ and $y$ . Depending on whether you start from the top right half or bottom left half, gradient descent will result in either a local minimum or a global minimum and will get stuck there as the gradient becomes zero in both cases. The previous example using the Iris dataset seemed to have a similar problem, where the model took a while to escape from the point where the loss stayed around 1.1. In fact, if you try different sets of hyperparameters, the models often stop learning from that point.

Stochastic Gradient Descent

One simple way to address this problem is by introducing randomness into the algorithm by using only one randomly chosen data point to compute the gradient, which is called stochastic gradient descent (SGD). It sometimes works quite well and is easy to implement in code. Let's see the implementation in our FNNClassifier.fit function:

def fit(self, X, y, epochs=100, verbose=True):
    for epoch in range(epochs):
 
      if self.optimizer == "SGD":
          randId = np.random.choice(X.shape[0]-1)
          X_batch = X[randId].reshape(1, X.shape[1])
          y_batch = y[randId]
          if y.shape[0] != 1:
            y_batch = y_batch.reshape(1, y.shape[1])
      else:
        X_batch = X
        y_batch = y
    
      pred, activations = self.predict(X_batch, train=True)
      self.history.append(log_loss(y_batch, pred))
      if verbose:
        print(f"Epoch: {epoch}, Loss: {self.history[-1]}")
 
      delta = pred - y_batch
 
    ...

However, SGD has a critical drawback: gradients need to be computed one by one, so we cannot parallelize the process like we can with normal gradient descent. The gradients can also be unstable, making the learning process erratic and significantly longer. Below is the result of using SGD on the Iris dataset.

The learning curve is quite volatile, although it is generally converging to the global minimum.

Mini-Batch Gradient Descent

To get the best of both worlds, we can use mini-batch gradient descent, where we create a random mini-batch containing 16 or 32 data points and use them to compute the gradients. This allows us to compute gradients of multiple data points in parallel and obtain relatively stable gradients while exploring the global minimum. Below is the implementation of mini-batch GD in the context of the FNNClassifier.fit function:

def fit(self, X, y, epochs=100, verbose=True):
    for epoch in range(epochs):
 
      if self.optimizer == "SGD":
          randId = np.random.choice(X.shape[0]-1)
          X_batch = X[randId].reshape(1, X.shape[1])
          y_batch = y[randId]
          if y.shape[0] != 1:
            y_batch = y_batch.reshape(1, y.shape[1])
      elif self.optimizer == "MiniBatchGD":
          randId = np.random.choice(X.shape[0]-1, self.batch_size, replace=False)
          X_batch = X[randId]
          y_batch = y[randId]
      else:
        X_batch = X
        y_batch = y
    
      ...

Due to the above changes, you will need to include additional hyperparameter batch_size in the FNNClassifier class. The learning curve for mini-batch GD with a batch size of 16 looks like the following:

The curve is less volatile and converges more quickly to the global minimum. However, the curve is still somewhat volatile. This volatility may be due to overshooting, where the optimizer descends too far and crosses to the opposite side of the curve. Let's look at an example.

The above figure shows a loss function with a narrow trough and the path of mini-batch gradient descent. The optimizer initially oscillates between the sides of the trough due to the high gradient and learning rate. Ideally, we want the optimizer to minimize this oscillation and converge to the minimum more quickly.

Momentum

One approach to address the issue is to keep track of the momentum of the optimizer's previous movements and take that into account when computing the gradient. This can help prevent the optimizer from oscillating too much. The concept is implemented using the following equations:

\nu_t = \beta \nu_{t-1} + (1 - \beta) \frac{\partial L}{\partial w} \\ w_t = w_{t-1} - \alpha \nu_t

where $\beta$ is the hyperparameter controlling how much we consider the previous gradient. Not only can this approach help reduce the gradient when the optimizer is changing direction, but it can also increase the gradient when it is moving in the same direction, thereby accelerating learning. By choosing the right values for $\beta$ and $\alpha$ , we aim to achieve smooth and efficient learning.

RMSProp

Another approach is to adjust the learning rate using momentum, a technique known as RMSProp. The equations for RMSProp are:

\nu_t = \beta \nu_{t-1} + (1 - \beta) \frac{\partial L}{\partial w}^2 \\ w_t = w_{t-1} - \frac{\alpha}{\sqrt{\nu_t + \epsilon}} \frac{\partial L}{\partial w}

RMSProp is similar to the momentum approach, but with one key difference: the gradient is squared to compute $\nu$ , and this result is applied to the learning rate. The squaring of the gradient and then taking the square root ensures the value is positive, adjusting the scale of the learning rate without altering the direction of the gradient. The small value $\epsilon$ prevents the denominator from becoming zero.

Adam

Both momentum and RMSProp are effective techniques, so why not combine them to adjust both the gradient and the learning rate? This combination is achieved with the following equations:

\nu_t = \beta \nu_{t-1} + (1 - \beta) \frac{\partial L}{\partial w} \\ s_t = \beta s_{t-1} + (1 - \beta) \frac{\partial L}{\partial w}^2 \\ w_t = w_{t-1} - \frac{\alpha}{\sqrt{s_t + \epsilon}} \nu_t

Although the above formulation seems to work fine, the gradients are initially hesitant to move from 0, making the learning process slower. Therefore, we use bias correction.

\nu_t = \beta \nu_{t-1} + (1 - \beta) \frac{\partial L}{\partial w} \\ s_t = \beta s_{t-1} + (1 - \beta) \frac{\partial L}{\partial w}^2 \\ \hat{\nu_t} = \frac{\nu_t}{1-\beta^t} \\ \hat{s_t} = \frac{s_t}{1-\beta^t} \\ w_t = w_{t-1} - \frac{\alpha}{\sqrt{\hat{s_t} + \epsilon}} \hat{\nu_t}

This approach, which combines momentum and RMSProp with bias correction, is called Adaptive Momentum Estimation, or Adam. It is one of the most widely used optimizers in deep learning. Let's implement this in our FNNClassifier.

def __init__(self, input_dim, hidden_dims, output_dim, lr=0.001, batch_size=16, beta_1=0.9, beta_2=0.99, optimizer="SGD"):
        self.lr = lr
        self.history = []
        self.optimizer = optimizer
        self.batch_size = batch_size
        self.beta_1 = beta_1
        self.beta_2 = beta_2
        self.W = []
        self.b = []
        
        ...
 
        # Initialize Adam variables
        if self.optimizer == "Adam":
            self.epsilon = 1e-8
            self.m_W = [np.zeros_like(w) for w in self.W]
            self.m_b = [np.zeros_like(b) for b in self.b]
            self.s_W = [np.zeros_like(w) for w in self.W]
            self.s_b = [np.zeros_like(b) for b in self.b]

First, we need to add additional hyperparameters for the Adam optimizer and initialize the momentum $m$ and second momentum $s$ for weights and biases. Then, we can update the fit function using the equations above.

    def fit(self, X, y, epochs=100, verbose=True):
        for epoch in range(epochs):
            if self.optimizer == "SGD":
                randId = np.random.choice(X.shape[0], self.batch_size, replace=False)
                X_batch = X[randId]
                y_batch = y[randId]
            elif self.optimizer == "MiniBatchGD" or self.optimizer == "Adam":
                randId = np.random.choice(X.shape[0], self.batch_size, replace=False)
                X_batch = X[randId]
                y_batch = y[randId]
            else:
                X_batch = X
                y_batch = y
            
            ...
 
                # Adam-specific parameter updates
                if self.optimizer == "Adam":
                    t = epoch + 1
                    self.m_W[i] = self.beta_1 * self.m_W[i] + (1 - self.beta_1) * grad_W
                    self.s_W[i] = self.beta_2 * self.s_W[i] + (1 - self.beta_2) * (grad_W ** 2)
                    self.m_b[i] = self.beta_1 * self.m_b[i] + (1 - self.beta_1) * grad_b
                    self.s_b[i] = self.beta_2 * self.s_b[i] + (1 - self.beta_2) * (grad_b ** 2)
 
                    m_W_hat = self.m_W[i] / (1 - self.beta_1 ** t)
                    s_W_hat = self.s_W[i] / (1 - self.beta_2 ** t)
                    m_b_hat = self.m_b[i] / (1 - self.beta_1 ** t)
                    s_b_hat = self.s_b[i] / (1 - self.beta_2 ** t)
 
                    self.W[i] -= self.lr * m_W_hat / (np.sqrt(s_W_hat) + self.epsilon)
                    self.b[i] -= self.lr * m_b_hat / (np.sqrt(s_b_hat) + self.epsilon)
                else:
                    self.W[i] -= self.lr * grad_W
                    self.b[i] -= self.lr * grad_b
 
        return self.history

The Adam optimizer works with mini-batches, similar to mini-batch gradient descent. When fitting the model to the Iris dataset, the learning curve looks like this:

Unfortunately, in this particular case, we may not observe much difference compared to mini-batch GD, but Adam should make the model more robust in general. We will continue making improvements to the FNNClassifier to achieve better results in the next article.