Road to ML Engineer #13 - Artificial Neural Networks

Last Edited: 8/20/2024

The blog post introduces the basics of feedforward neural networks in machine learning.

ML

Deep Learning

We want machine learning models to be as intelligent as possible so that they can capture complex relationships between variables. How can we accomplish this? We can take inspiration from the most intelligent mechanism we know: our brains.

Much about how the brain works is still largely unknown, but we do know that the brain contains a vast number of interconnected neurons. These neurons receive activations from other neurons, activate depending on the activations of others, and send activations to subsequent neurons. In formulating an artificial neural network, we can replace neurons in this neural network architecture with functions.

Deep learning is the subset of machine learning that studies artificial neural networks, and it's no exaggeration to say that the rapid improvement in this field, combined with impressive performance made possible by hardware innovations, has sparked the recent AI spring.

Feedforward Neural Networks

The most basic artificial neural network is the feedforward neural network (FNN), where the flow of information is unidirectional. An FNN consists of an input layer, hidden layers, and an output layer. The input layer contains neurons associated with the features of the data. All input neurons are connected to each neuron in the next hidden layer of the FNN, which applies a linear function to the input neurons.

They apply a non-linear activation function, typically a sigmoid function, to mimic the activation of neurons by converting the range from -\infty to \infty to 00 to 11. We hope that the neurons in the hidden layer capture the underlying features of the data that can help perform predictions. An arbitrary number of hidden layers with an arbitrary number of neurons are connected to perform the same computation, which can be expressed mathematically as:

zt=ϕ1ht1+ϕ2ht=σ(z) z_t = \phi_1 h_{t-1} + \phi_2 \\ h_{t} = \sigma(z)

where hth_t represents the activations of neurons in layer tt, σ\sigma is the non-linear activation function, and ztz_t is the result of applying the linear function to the activations of the previous layer ht1h_{t-1} using the weights and biases of the layer ϕ\phi. We hope that as we stack the hidden layers, the model picks up higher-level features from the lower-level features.

Then, the output layer can take the activations of the last hidden layer and apply the same computations with appropriate activation functions depending on the task. We might avoid using an activation function for regression to keep the output value unbounded, use a sigmoid function to output probabilities for binary classification, and use a softmax function to output probabilities of multiple categories for multiclass classification.

Example

Let's look at an example of a feedforward neural network for the Iris classification to further clarify our understanding.

ANN Example

The input layer has 4 neurons corresponding to sepal length, sepal width, petal length, and petal width. Then, there is one hidden layer containing 5 neurons, each of them connected to all the 4 input neurons, which performs the computation shown above with the sigmoid function to get the activations. There is another identical hidden layer with the same number of neurons.

Lastly, the output layer, with 3 neurons corresponding to the class probabilities for an iris being Setosa, Versicolor, or Virginica, is connected to all 5 neurons in the hidden layer. This output layer will perform the same computations with the softmax function to get the final prediction. The process of making a prediction by propagating the network forward is called forward propagation, and it can be expressed in a nested function as follows:

ho(x)=σo(zo(x))zo(x)=wohh2(x)+bohhi(x)=σhi(zhi(x))zhi(x)=whix+bhiho(x)=σo(wo(σh2(wh2(σh1(wh1x+bh1))+bh2))+bo) h_{o}(x) = \sigma_o(z_o(x)) \\ z_o(x) = w_{o} h_{h2}(x) + b_{o} \\ h_{hi}(x) = \sigma_{hi}(z_{hi}(x)) \\ z_{hi}(x) = w_{hi} x + b_{hi} \\ h_o(x) = \sigma_o(w_{o} (\sigma_{h2}(w_{h2} (\sigma_{h1}(w_{h1}x + b_{h1})) + b_{h2})) + b_{o})

where ho(x)h_o(x) is the activation of the output layer or prediction and hh(x)h_h(x) is the activation of the hidden layer given the input xx. The above equation implies why the non-linear activation function needs to be used in FNNs—not only to mimic neurons but also for practical reasons. Let's set both σ(x)\sigma(x) to be axax and see what happens:

ho(x)=ao(wo(ah2(wh2(ah1(wh1x+bh1))+bh2))+bo)=ao(wo(ah2(wh2(ah1wh1x+ah1bh1)+bh2))+bo)=ao(wo(ah2(wh2ah1wh1x+wh2ah1bh1+bh2))+bo)=ao(wo(ah2wh2ah1wh1x+ah2wh2ah1bh1+ah2bh2)+bo)=ao(woah2wh2ah1wh1x+woah2wh2ah1bh1+woah2bh2+bo)=aowoah2wh2ah1wh1x+aowoah2wh2ah1bh1+aowoah2bh2+aobo=(aowoah2wh2ah1wh1)x+(aowoah2wh2ah1bh1+aowoah2bh2+aobo) h_o(x) = a_o(w_{o} (a_{h2}(w_{h2} (a_{h1}(w_{h1}x + b_{h1})) + b_{h2})) + b_{o}) \\ = a_o(w_{o} (a_{h2}(w_{h2} (a_{h1}w_{h1}x + a_{h1}b_{h1}) + b_{h2})) + b_{o}) \\ = a_o(w_{o} (a_{h2}(w_{h2}a_{h1}w_{h1}x + w_{h2}a_{h1}b_{h1} + b_{h2})) + b_{o}) \\ = a_o(w_{o} (a_{h2}w_{h2}a_{h1}w_{h1}x + a_{h2}w_{h2}a_{h1}b_{h1} + a_{h2}b_{h2}) + b_{o}) \\ = a_o( w_{o}a_{h2}w_{h2}a_{h1}w_{h1}x + w_{o}a_{h2}w_{h2}a_{h1}b_{h1} + w_{o}a_{h2}b_{h2} + b_{o}) \\ = a_ow_{o}a_{h2}w_{h2}a_{h1}w_{h1}x + a_ow_{o}a_{h2}w_{h2}a_{h1}b_{h1} + a_ow_{o}a_{h2}b_{h2} + a_ob_{o} \\ = (a_ow_{o}a_{h2}w_{h2}a_{h1}w_{h1})x + (a_ow_{o}a_{h2}w_{h2}a_{h1}b_{h1} + a_ow_{o}a_{h2}b_{h2} + a_ob_{o})

As you can see, using a linear function for the activation function turns the model into a linear regression model, which cannot fit complex non-linear data. To ensure the model remains capable of fitting complex data, using a non-linear activation function is crucial.

Backpropagation

We need to train the model described above to minimize the loss by adjusting the weights and biases, wws and bbs. This can be accomplished using gradient descent by taking the derivatives of the loss function with respect to wws and bbs and gradually adjusting the parameters. Hence, we start by computing woL(y,ho(x))\frac{\partial}{\partial w_o} L(y, h_o(x)) and boL(y,ho(x))\frac{\partial}{\partial b_o} L(y, h_o(x)), which can be rewritten using the chain rule as:

woL(y,ho(x))=dLdhowoho(x)boL(y,ho(x))=dLdhoboho(x) \frac{\partial}{\partial w_o} L(y, h_o(x)) = \frac{d L}{d h_o} \frac{\partial}{\partial w_o} h_o(x) \\ \frac{\partial}{\partial b_o} L(y, h_o(x)) = \frac{d L}{d h_o} \frac{\partial}{\partial b_o} h_o(x)

The first term, dLdho\frac{d L}{d h_o}, can be computed using only the predicted value and the output value of the model. Using the chain rule, you can further decompose the derivative as follows:

woL(y,ho(x))=dLdhodhodzowozo(x)boL(y,ho(x))=dLdhodhodzobozo(x) \frac{\partial}{\partial w_o} L(y, h_o(x)) = \frac{d L}{d h_o} \frac{d h_o}{d z_o} \frac{\partial}{\partial w_o} z_o(x) \\ \frac{\partial}{\partial b_o} L(y, h_o(x)) = \frac{d L}{d h_o} \frac{d h_o}{d z_o} \frac{\partial}{\partial b_o} z_o(x)

The second term, dhodzo\frac{d h_o}{d z_o}, can also be easily computed by taking the gradient of the activation function using the output before applying the activation function. Hence, the gradient of the loss function with respect to wow_o and bob_o is:

woL(y,ho(x))=dLdhodhodzohh2(x)boL(y,ho(x))=dLdhodhodzo \frac{\partial}{\partial w_{o}} L(y, h_o(x)) = \frac{dL}{d h_o} \frac{d h_o}{d z_o} h_{h2}(x) \\ \frac{\partial}{\partial b_{o}} L(y, h_o(x)) = \frac{dL}{d h_o} \frac{d h_o}{d z_o}

All of the components in the above equation are easily obtainable by storing the values of the output and the activations of the previous hidden layer. Now that we have computed the gradient with respect to wow_o and bob_o, we need to find those for wh2w_{h2} and bh2b_{h2}:

wh2L(y,ho(x))=dLdhowh2ho(x)=dLdhodhodzowh2zo(x)=dLdhodhodzodzodhh2wh2hh2(x)bh2L(y,ho(x))=dLdhodhodzodzodhh2bh2hh2(x) \frac{\partial}{\partial w_{h2}} L(y, h_o(x)) = \frac{d L}{d h_o} \frac{\partial}{\partial w_{h2}} h_o(x) \\ = \frac{d L}{d h_o} \frac{d h_o}{d z_o} \frac{\partial}{\partial w_{h2}} z_o(x) \\ = \frac{d L}{d h_o} \frac{d h_o}{d z_o} \frac{d z_o}{d h_{h2}} \frac{\partial}{\partial w_{h2}} h_{h2}(x) \\ \frac{\partial}{\partial b_{h2}} L(y, h_o(x)) = \frac{d L}{d h_o} \frac{d h_o}{d z_o} \frac{d z_o}{d h_{h2}} \frac{\partial}{\partial b_{h2}} h_{h2}(x)

Notice here that the first two terms have already been calculated as the gradient for bob_o, and the third term dzodhh2\frac{d z_o}{d h_{h2}} is just wow_o. Hence, we only need to compute the last term additionally to get the partial gradient for wh2w_{h2} and bh2b_{h2}, which can be further broken down using the chain rule as follows:

wh2L(y,ho(x))=dLdhodhodzodzodhh2dhh2dzh2wh2zh2(x)bh2L(y,ho(x))=dLdhodhodzodzodhh2dhh2dzh2bh2zh2(x) \frac{\partial}{\partial w_{h2}} L(y, h_o(x)) = \frac{d L}{d h_o} \frac{d h_o}{d z_o} \frac{d z_o}{d h_{h2}} \frac{d h_{h2}}{d z_{h2}}\frac{\partial}{\partial w_{h2}} z_{h2}(x) \\ \frac{\partial}{\partial b_{h2}} L(y, h_o(x)) = \frac{d L}{d h_o} \frac{d h_o}{d z_o} \frac{d z_o}{d h_{h2}} \frac{d h_{h2}}{d z_{h2}}\frac{\partial}{\partial b_{h2}} z_{h2}(x)

Given the above, the partial derivatives of the loss function with respect to wh2w_{h2} and bh2b_{h2} are:

wh2L(y,ho(x))=dLdhodhodzodzodhh2dhh2dzh2hh1(x)bh2L(y,ho(x))=dLdhodhodzodzodhh2dhh2dzh2 \frac{\partial}{\partial w_{h2}} L(y, h_o(x)) = \frac{dL}{d h_o} \frac{d h_o}{d z_o}\frac{d z_o}{d h_{h2}} \frac{d h_{h2}}{d z_{h2}} h_{h1}(x) \\ \frac{\partial}{\partial b_{h2}} L(y, h_o(x)) = \frac{dL}{d h_o} \frac{d h_o}{d z_o}\frac{d z_o}{d h_{h2}} \frac{d h_{h2}}{d z_{h2}}

We can keep applying the chain rule for wh1w_{h1} and bh1b_{h1} in the same way:

wh2L(y,ho(x))=dLdhodhodzodzodhh2dhh2dzh2dzh2dhh1dhh1dzh1xbh2L(y,ho(x))=dLdhodhodzodzodhh2dhh2dzh2dzh2dhh1dhh1dzh1 \frac{\partial}{\partial w_{h2}} L(y, h_o(x)) = \frac{dL}{d h_o} \frac{d h_o}{d z_o}\frac{d z_o}{d h_{h2}} \frac{d h_{h2}}{d z_{h2}} \frac{d z_{h2}}{d h_{h1}} \frac{d h_{h1}}{d z_{h1}} x \\ \frac{\partial}{\partial b_{h2}} L(y, h_o(x)) = \frac{dL}{d h_o} \frac{d h_o}{d z_o}\frac{d z_o}{d h_{h2}} \frac{d h_{h2}}{d z_{h2}} \frac{d z_{h2}}{d h_{h1}} \frac{d h_{h1}}{d z_{h1}} \\

Can you see the pattern here? For all the hidden layers, we use the chain rule and multiply by dzhdhh1dhh1dzh1\frac{d z_h}{d h_{h-1}} \frac{d h_{h-1}}{d z_{h-1}}. Regardless of how many hidden layers there are, we can keep utilizing the chain rule to compute the gradients and update the weights accordingly. As we start computing gradients from the output and propagate backward through the network to apply the chain rule, this mechanism is called backpropagation, and it is essential for applying gradient descent and other similar learning mechanisms that require gradients to update the weights of an artificial neural network.

Code Implementation

Let's try implementing a simple feedforward neural network from scratch using the concepts discussed above. The task is to classify the species of Iris in the Iris dataset (Steps 1 and 2 are omitted as we have covered them in past articles).

Step 3. Model

First, you need to initialize the necessary parameters and activation functions as shown below:

class FNNClassifier():
  def __init__(self, input_dim, hidden_dims, output_dim, lr=0.001):
    self.lr = lr
    self.history = []
    self.W = []
    self.b = []
    for i in range(len(hidden_dims)):
      if (i == 0):
        self.W.append(np.random.rand(input_dim, hidden_dims[i]))
      else:
        self.W.append(np.random.rand(hidden_dims[i-1], hidden_dims[i]))
      self.b.append(np.random.rand(hidden_dims[i]))
    
    self.W.append(np.random.rand(hidden_dims[len(hidden_dims)-1], output_dim))
    self.b.append(np.random.rand(output_dim))
 
  def sigmoid(self, X):
    return 1 / (1 + np.exp(-X))
 
  def softmax(self, X):
    # log-sum-exp trick
    max_logits = np.max(X, axis=1, keepdims=True)
    stabilized_logits = X - max_logits
 
    odds = np.exp(stabilized_logits)
    total_odds = np.sum(odds, axis=1, keepdims=True)
    return odds / total_odds

The number of hidden layers and neurons in each layer can be specified by the input parameters input_dim, hidden_dims, and output_dim. If you are not familiar with the activation functions, refer to previous articles on logistic regression and softmax regression. Next, we define the predict function, which performs forward propagation using matrix multiplications.

def predict(self, X, train=False):
    # keep track of activations for calculating the gradients
    activations = [X]
    for i in range(len(self.W)):
        X = np.matmul(X, self.W[i]) + self.b[i]
        if i == len(self.W) - 1:
        # either softmax or sigmoid depending on the number of classes
        if self.b[i].shape[0] > 1:
            X = self.softmax(X)
        else:
            X = self.sigmoid(X)
            X = X.flatten()
        activations.append(X) if train else None
        else:
        # use sigmoid activation functions for hidden layers
        X = self.sigmoid(X)
        activations.append(X) if train else None
    if not train:
        return X
    else:
        return X, activations
 
FNNClassifer.predict = predict

Depending on whether the task is multiclass or binary classification, we need to change the activation function of the output layer. For the hidden layers, we will stick to the sigmoid function. Then, we define the fit function for training the model using backpropagation and gradient descent.

def fit(self, X, y, epochs=100, verbose=True):
    for epoch in range(epochs):
        pred, activations = self.predict(X, train=True)
        self.history.append(log_loss(y, pred))
        if verbose:
            print(f"Epoch: {epoch}, Loss: {self.history[-1]}")
 
        # gradient of loss with respect to last hidden layer
        # works both for softmax and sigmoid
        delta = pred - y
 
        for i in range(len(self.W)-1, -1, -1):
            grad_W = np.matmul(activations[i].T, delta)
            grad_b = np.sum(delta, axis=0)
 
            # Backpropagate the error
            if i > 0:
                # dz_t/dh_t-1 = self.W[i].T
                # dh_t-1/dz_t-1 = activations[i] * (1 - activations[i]) for h = sigmoid
                delta = np.matmul(delta, self.W[i].T) * activations[i] * (1 - activations[i])
 
            # Gradient Descent
            self.W[i] -= self.lr * grad_W
            self.b[i] -= self.lr * grad_b
 
    return self.history
 
FNNClassifer.fit = fit

Using the above definition, we can initialize and fit a feedforward neural network to the training data as follows:

fnn = FNNClassifier(4, [32], 3, lr=0.001)
 
r = fnn.fit(X_train, y_train, epochs=1000, verbose=False)

The loss over epochs during the training can be plotted as shown below:

import matplotlib.pyplot as plt
plt.plot(r)
plt.title("Loss vs Epoch")
plt.ylabel("Loss")
plt.xlabel("Epoch")
plt.show()
FNN Loss over Epoch

You can see from the above plot that the model has successfully reduced the loss, although it struggled a bit along the way.

Step 4. Model Evaluation

Let's make predictions on the test dataset using the neural network we just trained.

pred = fnn.predict(X_test, train=False)
 
pred = np.argmax(pred, axis=1)
y_test = np.argmax(y_test, axis=1)

Then, we can use a confusion matrix to visualize the results of the predictions.

FNN Confusion Matrix

As you can see from the above, the model achieved perfect predictions, demonstrating the potential of artificial neural networks.

However, if you are following along and running the code in your notebook, you might observe that the results are very sensitive to the model's hyperparameters and can vary each time you train the model. (In fact, the above implementation often performs worse than a simple softmax regression model.) This is due to several factors, which I will cover and address in future articles.