The blog post introduces the basics of feedforward neural networks in machine learning.

Deep Learning
We want machine learning models to be as intelligent as possible so that they can capture complex relationships between variables. How can we accomplish this? We can take inspiration from the most intelligent mechanism we know: our brains.
Much about how the brain works is still largely unknown, but we do know that the brain contains a vast number of interconnected neurons. These neurons receive activations from other neurons, activate depending on the activations of others, and send activations to subsequent neurons. In formulating an artificial neural network, we can replace neurons in this neural network architecture with functions.
Deep learning is the subset of machine learning that studies artificial neural networks, and it's no exaggeration to say that the rapid improvement in this field, combined with impressive performance made possible by hardware innovations, has sparked the recent AI spring.
Feedforward Neural Networks
The most basic artificial neural network is the feedforward neural network (FNN), where the flow of information is unidirectional. An FNN consists of an input layer, hidden layers, and an output layer. The input layer contains neurons associated with the features of the data. All input neurons are connected to each neuron in the next hidden layer of the FNN, which applies a linear function to the input neurons.
They apply a non-linear activation function, typically a sigmoid function, to mimic the activation of neurons by converting the range from to to to . We hope that the neurons in the hidden layer capture the underlying features of the data that can help perform predictions. An arbitrary number of hidden layers with an arbitrary number of neurons are connected to perform the same computation, which can be expressed mathematically as:
where represents the activations of neurons in layer , is the non-linear activation function, and is the result of applying the linear function to the activations of the previous layer using the weights and biases of the layer . We hope that as we stack the hidden layers, the model picks up higher-level features from the lower-level features.
Then, the output layer can take the activations of the last hidden layer and apply the same computations with appropriate activation functions depending on the task. We might avoid using an activation function for regression to keep the output value unbounded, use a sigmoid function to output probabilities for binary classification, and use a softmax function to output probabilities of multiple categories for multiclass classification.
Example
Let's look at an example of a feedforward neural network for the Iris classification to further clarify our understanding.

The input layer has 4 neurons corresponding to sepal length, sepal width, petal length, and petal width. Then, there is one hidden layer containing 5 neurons, each of them connected to all the 4 input neurons, which performs the computation shown above with the sigmoid function to get the activations. There is another identical hidden layer with the same number of neurons.
Lastly, the output layer, with 3 neurons corresponding to the class probabilities for an iris being Setosa, Versicolor, or Virginica, is connected to all 5 neurons in the hidden layer. This output layer will perform the same computations with the softmax function to get the final prediction. The process of making a prediction by propagating the network forward is called forward propagation, and it can be expressed in a nested function as follows:
where is the activation of the output layer or prediction and is the activation of the hidden layer given the input . The above equation implies why the non-linear activation function needs to be used in FNNs—not only to mimic neurons but also for practical reasons. Let's set both to be and see what happens:
As you can see, using a linear function for the activation function turns the model into a linear regression model, which cannot fit complex non-linear data. To ensure the model remains capable of fitting complex data, using a non-linear activation function is crucial.
Backpropagation
We need to train the model described above to minimize the loss by adjusting the weights and biases, s and s. This can be accomplished using gradient descent by taking the derivatives of the loss function with respect to s and s and gradually adjusting the parameters. Hence, we start by computing and , which can be rewritten using the chain rule as:
The first term, , can be computed using only the predicted value and the output value of the model. Using the chain rule, you can further decompose the derivative as follows:
The second term, , can also be easily computed by taking the gradient of the activation function using the output before applying the activation function. Hence, the gradient of the loss function with respect to and is:
All of the components in the above equation are easily obtainable by storing the values of the output and the activations of the previous hidden layer. Now that we have computed the gradient with respect to and , we need to find those for and :
Notice here that the first two terms have already been calculated as the gradient for , and the third term is just . Hence, we only need to compute the last term additionally to get the partial gradient for and , which can be further broken down using the chain rule as follows:
Given the above, the partial derivatives of the loss function with respect to and are:
We can keep applying the chain rule for and in the same way:
Can you see the pattern here? For all the hidden layers, we use the chain rule and multiply by . Regardless of how many hidden layers there are, we can keep utilizing the chain rule to compute the gradients and update the weights accordingly. As we start computing gradients from the output and propagate backward through the network to apply the chain rule, this mechanism is called backpropagation, and it is essential for applying gradient descent and other similar learning mechanisms that require gradients to update the weights of an artificial neural network.
Code Implementation
Let's try implementing a simple feedforward neural network from scratch using the concepts discussed above. The task is to classify the species of Iris in the Iris dataset (Steps 1 and 2 are omitted as we have covered them in past articles).
Step 3. Model
First, you need to initialize the necessary parameters and activation functions as shown below:
class FNNClassifier():
def __init__(self, input_dim, hidden_dims, output_dim, lr=0.001):
self.lr = lr
self.history = []
self.W = []
self.b = []
for i in range(len(hidden_dims)):
if (i == 0):
self.W.append(np.random.rand(input_dim, hidden_dims[i]))
else:
self.W.append(np.random.rand(hidden_dims[i-1], hidden_dims[i]))
self.b.append(np.random.rand(hidden_dims[i]))
self.W.append(np.random.rand(hidden_dims[len(hidden_dims)-1], output_dim))
self.b.append(np.random.rand(output_dim))
def sigmoid(self, X):
return 1 / (1 + np.exp(-X))
def softmax(self, X):
# log-sum-exp trick
max_logits = np.max(X, axis=1, keepdims=True)
stabilized_logits = X - max_logits
odds = np.exp(stabilized_logits)
total_odds = np.sum(odds, axis=1, keepdims=True)
return odds / total_odds
The number of hidden layers and neurons in each layer can be specified by the input parameters input_dim
,
hidden_dims
, and output_dim
. If you are not familiar with the activation functions, refer to previous
articles on logistic regression and softmax regression. Next, we define the predict function, which
performs forward propagation using matrix multiplications.
def predict(self, X, train=False):
# keep track of activations for calculating the gradients
activations = [X]
for i in range(len(self.W)):
X = np.matmul(X, self.W[i]) + self.b[i]
if i == len(self.W) - 1:
# either softmax or sigmoid depending on the number of classes
if self.b[i].shape[0] > 1:
X = self.softmax(X)
else:
X = self.sigmoid(X)
X = X.flatten()
activations.append(X) if train else None
else:
# use sigmoid activation functions for hidden layers
X = self.sigmoid(X)
activations.append(X) if train else None
if not train:
return X
else:
return X, activations
FNNClassifer.predict = predict
Depending on whether the task is multiclass or binary classification, we need to change the activation
function of the output layer. For the hidden layers, we will stick to the sigmoid function. Then,
we define the fit
function for training the model using backpropagation and gradient descent.
def fit(self, X, y, epochs=100, verbose=True):
for epoch in range(epochs):
pred, activations = self.predict(X, train=True)
self.history.append(log_loss(y, pred))
if verbose:
print(f"Epoch: {epoch}, Loss: {self.history[-1]}")
# gradient of loss with respect to last hidden layer
# works both for softmax and sigmoid
delta = pred - y
for i in range(len(self.W)-1, -1, -1):
grad_W = np.matmul(activations[i].T, delta)
grad_b = np.sum(delta, axis=0)
# Backpropagate the error
if i > 0:
# dz_t/dh_t-1 = self.W[i].T
# dh_t-1/dz_t-1 = activations[i] * (1 - activations[i]) for h = sigmoid
delta = np.matmul(delta, self.W[i].T) * activations[i] * (1 - activations[i])
# Gradient Descent
self.W[i] -= self.lr * grad_W
self.b[i] -= self.lr * grad_b
return self.history
FNNClassifer.fit = fit
Using the above definition, we can initialize and fit a feedforward neural network to the training data as follows:
fnn = FNNClassifier(4, [32], 3, lr=0.001)
r = fnn.fit(X_train, y_train, epochs=1000, verbose=False)
The loss over epochs during the training can be plotted as shown below:
import matplotlib.pyplot as plt
plt.plot(r)
plt.title("Loss vs Epoch")
plt.ylabel("Loss")
plt.xlabel("Epoch")
plt.show()

You can see from the above plot that the model has successfully reduced the loss, although it struggled a bit along the way.
Step 4. Model Evaluation
Let's make predictions on the test dataset using the neural network we just trained.
pred = fnn.predict(X_test, train=False)
pred = np.argmax(pred, axis=1)
y_test = np.argmax(y_test, axis=1)
Then, we can use a confusion matrix to visualize the results of the predictions.

As you can see from the above, the model achieved perfect predictions, demonstrating the potential of artificial neural networks.
However, if you are following along and running the code in your notebook, you might observe that the results are very sensitive to the model's hyperparameters and can vary each time you train the model. (In fact, the above implementation often performs worse than a simple softmax regression model.) This is due to several factors, which I will cover and address in future articles.