Road to ML Engineer #1 - Linear Regression

In the last article, "What Machine Learning Truly Looks Like," we touched on linear regression. This article will get you into the coding and cover additional important concepts in machine learning.

Recap

In the last article, we discussed the basic machine learning pipeline. This article will go through the same pipeline to demonstrate how a linear regression model can be used. When you read this article, make sure you open a Google Colab notebook and follow along. At the top of the notebook, copy and paste the code below:

import sklearn
import pandas as pd
import seaborn as sns

The code above allows you to have access to useful machine learning packages in Python.

Linear Regression

Step 1. Data Exploration

There are some datasets notoriously used in the field of machine learning, and the Iris dataset is one of them. The dataset contains 50 rows of data for 3 species in the Iris genus, namely Iris Setosa, Iris Versicolor, and Iris Virginica. To load the dataset, use the following code:

# Import Dataset from sklearn
from sklearn.datasets import load_iris
 
# Load Iris Data
iris = load_iris()
 
# Creating pd DataFrames
# The below is converting species category from numerical values to characters
iris_df = pd.DataFrame(data= iris.data, columns= iris.feature_names)
target_df = pd.DataFrame(data= iris.target, columns= ['species'])
def converter(specie):
    if specie == 0:
        return 'setosa'
    elif specie == 1:
        return 'versicolor'
    else:
        return 'virginica'
target_df['species'] = target_df['species'].apply(converter)
 
# Concatenate the DataFrames
iris_df = pd.concat([iris_df, target_df], axis= 1)
 
# Display the data
iris_df

If you run the code above in Google Colab, you should see something like the following:

Let's say our task is to predict the sepal length from the other variables. As we did last time, let's visualize the data to gain some insights. To do so, you can use the following code:

sns.pairplot(iris_df, hue= 'species')

You can observe that sepal length might have different distributions for different species, and it has a mild positive correlation with other variables too. We can use this information to choose which variables to use for making a model.

Step 2. Data Preprocessing

Let's move on to data preprocessing based on the information we got from the previous step. We saw that sepal length could be better predicted if we use all the variables. Hence, we do not have to drop any column from the data. However, we still have some work to do.

First, we need to convert the species variable back to a numerical variable. Also, we need to isolate the variable that we want to predict, in this case, sepal length.

# Converting Species to Numerical Values
iris_df.drop('species', axis= 1, inplace= True)
target_df = pd.DataFrame(columns= ['species'], data= iris.target)
iris_df = pd.concat([iris_df, target_df], axis= 1)
 
# Seperate Sepal Length
X= iris_df.drop(labels= 'sepal length (cm)', axis= 1)
y= iris_df['sepal length (cm)']

We tend to use X for explanatory variables and y for the outcome variable we want to predict. Additionally, we need to split the dataset into training data and testing data.

from sklearn.model_selection import train_test_split
 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.33, random_state= 101)

In the code above, we are splitting the dataset randomly so that the training data and testing data will have 67% and 33% of the original dataset, respectively.

Step 3. Model

Conveniently, sklearn already has a predefined object LinearRegression that allows us to abstract away a lot of the math behind the scenes. However, the math is basically the same as what we covered in the previous article. This time, we have multiple features:

sl = f(sw, pl, pw, s, \phi) \\ = \phi_1sw + \phi_2pl+ \phi_2 pw + \phi_3 s + \phi_4

, and we have to optimize for more coefficients by taking partial derivatives and solving the system of equations. The following is how we achieve this in code.

from sklearn.linear_model import LinearRegression
# Instantiating LinearRegression() Model
lr = LinearRegression()
 
# Training/Fitting the Model
lr.fit(X_train, y_train)

As we covered in the last article, we only use the training dataset for model training.

Step 4. Evaluation

Now that we have finished training the model, we are ready to test it. First, we have to make predictions of sepal length using the model so that they can be compared against the real values.

# Making Predictions
pred = lr.predict(X_test)

Let's use both MSE (Mean Squared Error) and MAE (Mean Absolute Error) as metrics for the task.

from sklearn.metrics import mean_absolute_error, mean_squared_error
 
print('Mean Absolute Error:', mean_absolute_error(y_test, pred))
print('Mean Squared Error:', mean_squared_error(y_test, pred))
 
# Result
# Mean Absolute Error: 0.26498350887555133
# Mean Squared Error: 0.10652500975036944

This is it for the linear regression model. It's pretty simple, isn't it?

Linear Regression w Gradient Descent

We used a predefined model from sklearn, but we can build it ourselves as well. Here, I will demonstrate how to make your own model using a different learning algorithm called Gradient Descent. If you have noticed already, solving a system of equations to find the best set of parameters becomes increasingly difficult as the number of parameters increases. Hence, we want to use an easier way of getting to the best set of parameters, and one of the easier ways is gradient descent.

What is Gradient Descent?

Let's say we have a cost function that looks like this.

What we want to find are the parameters that lead to the minimum point. Let's say we chose a random parameter and got 8.5. If we compute the gradient, we can find a line that touches 8.5, or a tangent line like the one below.

Here, we can notice that by going left and moving downward in slope, we can get closer to the local minimum. If the slope was negative, we could go right to move downward in slope, and get closer to the local minimum. This means that we can always approach the local minimum by going in the opposite direction of the slope. (If the slope is positive, go in the negative direction, and vice versa.) By going through these steps iteratively, we can get quite close to the local minimum. This is the basic idea behind gradient descent. We can express this in the following equation.

\phi_t = \phi_{t-1} - \frac{d}{d\phi}J(\phi)

, where $t$ is the time step, and $J$ is the cost function. However, if we go in the opposite direction too far, we will cross to the other side of the curve and never reach the local minimum. Ideally, we want to take small steps in the opposite direction so that we gradually descend to the local minimum. This is achieved by adjusting the learning rate or $\alpha$ . We multiply the learning rate by the gradient so that the steps become smaller.

\phi_t = \phi_{t-1} - \alpha\frac{d}{d\phi}J(\phi)

Computing Gradient

Then, how do we determine the gradient of the function? We can do so by taking the partial derivative with respect to the parameters. Let's use MSE as a cost function and determine the partial derivatives. The equation for MSE is:

MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i-f(x_i,\phi))^2 = J

Let's use the chain rule to take partial derivatives.

\frac{\partial}{\partial\phi} J = \frac{2}{n}\sum_{i=1}^{n}(y_i-f(x_i,\phi))\frac{d}{d\phi}f(x_i,\phi)

As $f$ is a linear function, the partial derivatives of $f$ are just $\phi$ for slopes and 1 for intercept. Thus, the partial derivatives are

\frac{\partial}{\partial\phi} J = \frac{2}{n}\sum_{i=1}^{n}(y_i-f(x,\phi))x_i

for slopes and

\frac{\partial}{\partial\phi} J = \frac{2}{n}\sum_{i=1}^{n}(y_i-f(x,\phi))

for an intercept. We can multiple the learning rate to the above values and subtract them from the parameters.

Coding Implementation

Using the above concept, we can make our own linear regression model with gradient descent.

class LinearRegressionGD():
  def __init__(self, lr=0.01):
    self.W = np.zeros(X.shape[1])
    self.b = 0
    self.lr = lr # Learning rate
    self.history = [] # History of loss
 
  def predict(self, X):
    return np.sum(self.W*X + self.b, axis=1)
 
  def fit(self, X, y, epochs=100):
    for i in range(epochs):
      pred = self.predict(X)
      n = len(y)
 
      self.history.append(mean_squared_error(y, pred))
 
      diff = pred - y
      grad_W = np.sum((1/n)*diff[:, np.newaxis]*X, axis=0)
      grad_b = np.sum((1/n)*diff)
 
      self.W -= self.lr * grad_W
      self.b -= self.lr * grad_b
    return self.history

Let's fit the above model to the training data and see the result.

# Instantiating LinearRegressionGD() Model
lrgd = LinearRegressionGD()
 
# Training/Fitting the Model
history = lrgd.fit(X_train, y_train)

As we keep track of the MSE over each epoch, we can plot it to see how gradient descent is working.

import matplotlib.pyplot as plt
plt.plot(history)
plt.title("MSE vs Epoch")
plt.ylabel("MSE")
plt.xlabel("Epoch")
plt.show()

We can see from the above plot that the MSE gradually decreased over each epoch, which is a good sign that the gradient descent is working. Let's evaluate the model with MSE and MAE.

pred = lrgd.predict(X_test)
print('Mean Absolute Error:', mean_absolute_error(y_test, pred))
print('Mean Squared Error:', mean_squared_error(y_test, pred))
 
# Result
# Mean Absolute Error: 0.2809225334352517
# Mean Squared Error: 0.11678672963320082

We can see that both MAE and MSE from LinearRegressionGD are close to those of LinearRegression. As we don't have to solve the system of equations, gradient descent comes in handy when dealing with more complex models that we will start learning, so you should understand the concept fully.

Resources

Shen, S. W. Darryl. 2020. Linear Regression using Iris Dataset — ‘Hello, World!’ of Machine Learning. Medium.