What Machine Learning Truly Looks Like

I am writing this blog to reflect on my experiences with machine learning so far.

According to Wikipedia, the definition of machine learning is the following:

Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data and thus perform tasks without explicit instructions.

There seem to be many jargons in the above definition, so let's put it simply. Here is my version (I know it's far from perfect, but just for the sake of simplification):

Machine learning is all about finding patterns in data and/or making predictions about unseen data by using existing data and learning algorithms.

I hope the above made it a little clearer what machine learning is about. If you still don't see what it is all about, don't worry. In this article, I would like to walk through the simplest example of a machine learning pipeline to illustrate what machine learning truly looks like.

WARNING: The pipeline I will walk through here does not generalize to all ML pipelines. It is just for building a basic intuition of machine learning.

Machine Learning Pipline

Step 1. Data Exploration

To repeat, machine learning is all about finding patterns in data and/or making predictions about unseen data by using existing data and learning algorithms. Hence, we first need to look for data related to something you want to learn patterns of and/or make predictions about. Let's say you want to learn the relationship between the time people spend studying and the score they receive on a test, in order to predict how many hours you will need to spend studying to pass that test. Then, the first step is to look for appropriate data and observe that data.

You found out that your school kept records of all the test results and survey results of students who took that test in a computer system. (This could be an unsafe practice for storing data.) You talked to the school, and they agreed to send you the data. (This is unsafe and unethical.) The first step is to observe the data. The following is how it looks:

Students Survey Results on Test A
Student ID	Test Score (0-100)	Study Hour (0-24)	Sleep Hour (0-24)
0001	32	NA	NA
0002	90	21	11
0003	12	1	6
0004	60	12	8
0005	36	7	7
0006	60	9.5	7.5
0007	24	3	4
0008	75	NA	NA
0009	86	23.5	11
0010	14	NA	NA
0011	90	19	10
0012	95	19	9
0013	90	1	12
0014	80	16	5

You can see that the data includes unnecessary data and missing values. This is an important thing to note. Let's plot it to visualize the data.

We can see that there is one data point that is not following the general trend. This is called an outlier. This student might be a genius or a liar, but regardless, this is also another important observation.

Step 2. Data Preprocessing

We found that the data included unnecessary information, missing values, and an outlier. In this step, we need to handle them correctly for proper analysis using the algorithms that follow. For simplicity, let's remove them all. This step is called data cleaning. After the data cleaning step, the data looks like this.

Cleaned Data
Test Score (0-100)	Study Hour (0-24)
90	21
12	1
60	12
36	7
60	9.5
24	3
86	23.5
90	19
95	19
80	16

In the next step, we will pass this to some kind of learning algorithm executed by a machine or computer, hence the name machine learning. However, machines typically don't perform well when the scales of numbers are different between variables, such as 0 to 100 and 0 to 24. (This is not necessarily true, but please understand that this is for the sake of understanding the general framework.) Thus, let's convert both to a scale of 0 to 1 by dividing all the numbers by the largest numbers, 100 and 24. This process is called normalization.

Preprocessed Data
Test Score (0-1)	Study Hour (0-1)
0.90	0.88
0.12	0.04
0.60	0.50
0.36	0.29
0.60	0.40
0.24	0.13
0.86	0.98
0.90	0.79
0.95	0.79
0.80	0.67

Also, we need to split the data into training and testing data so that we can evaluate how well the predictions will perform on unseen data. Hence, we only work with the training data when we make the machine learn from the data.

Training Data
Test Score (0-1)	Study Hour (0-1)
0.90	0.88
0.12	0.04
0.60	0.50
0.60	0.40
0.24	0.13
0.86	0.98
0.90	0.79

Testing Data
Test Score (0-1)	Study Hour (0-1)
0.36	0.29
0.95	0.79
0.80	0.67

Now, we are ready for the next step. The entire process is called data preprocessing, and the best practices for preparing data depend on what kind of data we are analyzing (image, text, video, audio, etc.) and what kind of machines we are trying to use.

Step 3. Model

I am sure you have been waiting for this part. We are at the stage where we build machine learning model! Let's make a model like this.

y = f(x, \phi) \\ f(x, \phi) = \phi_1 x + \phi_{2}

, where $x$ is the hour spent on studying, $y$ is the test score, $f$ is the function (linear), $\phi$ is a set parameter (slope( $\phi_1$ ) and intercept( $\phi_2$ )). Here, the task should be to find the best set of parameters so that the function can best predict the test score. How do we do that?

Cost Function

We firstly need to quantify what we mean by the best set of parameters. Well, in this case, it should be how well a line fits to the training data, which can be measured with Mean Square Error (MSE). The following is how we calculate MSE:

MSE = \frac{\sum_{i=1}^{n}(y_i-f(x_i, \phi))^2}{n}

This might look complicated, but it isn't. It is just the average of squaring the difference between the actual test scores and predicted test scores. We square the difference to make all the values positive. If we can find a set of parameters that can minimize MSE, we have found the best set of parameters. The function we use to evaluate the fit to the training data is called cost/loss function.

Minimize Cost Function

If you have learned calculus before, you might remember that to find a local minimum of a function, we can take a derivative and set it equal to 0. This is what we can do here. We can take a partial derivative of cost function with respect to $\phi_1$ and $\phi_2$ , set them equal to 0, solve the equations to get the best $\phi_1$ and $\phi_2$ that minimize cost function.

Let's first expand the MSE so that we can take derivatives easier.

MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i-(\phi_1x_i+\phi_2))^2 \\ = \frac{1}{n}\sum_{i=1}^{n}(-\phi_1x_i+(y_i-\phi_2))^2 \\ = \frac{1}{n}\sum_{i=1}^{n}(\phi_1^2x_i^2-2\phi_1x_i(y_i-\phi_2)+(y_i-\phi_2)^2) \\ = \frac{1}{n}\sum_{i=1}^{n}(\phi_1^2x_i^2-2\phi_1x_iy_i \\ +2\phi_1x_i\phi_2+y_i^2-2y_i\phi_2+\phi_2^2)

Let's take a partial derivative of MSE with respect to $\phi_1$ .

\frac{\partial MSE}{\partial \phi_1} = \frac{1}{n}\sum_{i=1}^{n}(2x_i^2\phi_1-2x_iy_i+2x_i\phi_2)

Let's take a partial derivative of MSE with respect to $\phi_2$ .

\frac{\partial MSE}{\partial \phi_2} = \frac{1}{n}\sum_{i=1}^{n}(2\phi_1x_i-2y_i+2\phi_2)

Let's set them to equal to 0 and try solving them. For $\phi_1$ ,

\frac{1}{n}\sum_{i=1}^{n}(2x_i^2\phi_1-2x_iy_i+2x_i\phi_2) = 0 \\ \sum_{i=1}^{n}(x_i^2\phi_1-x_iy_i+x_i\phi_2) = 0 \\ \phi_1 = \sum_{i=1}^{n}\frac{x_iy_i-x_i\phi_2}{x_i^2}

For $\phi_2$ ,

\frac{1}{n}\sum_{i=1}^{n}(2\phi_1x_i-2y_i+2\phi_2) = 0 \\ \sum_{i=1}^{n}(\phi_1x_i-y_i+\phi_2) = 0 \\ \phi_2 = \sum_{i=1}^{n}{y_i-\phi_1x_i}

We can go further and eventually solve them. I am going to stop here for the presentation purposes. If we have a small amount of data or complex functions, solving the equation becomes infeasible. Thus, ML engineers have developed clever ways of finding a set of parameters that lead to a minimum cost. Anyways, this is essentially how we can leverage math to help machines learn the best parameters.

What is Machine Learning Model?

The set of functions, parameters, cost function, and learning mechanism is called a machine learning model. The example above is called a Linear Regression Model, but there are many radically different approaches to it. In this scenario, we treated $\phi$ s as the only parameters in the model, but technically, function can be regarded as parameter that model can learn to optimize for as well. For example, we can make the model chose between linear, quadratic, and multivariate functions based on MSE. We will later learn that many building blocks of the model can be automatically optimized based on cost function or some other functions if done correctly. This part is where "without explicit instructions" comes from in Wikipedia definition.

Step 4. Evaluate Model

Okay, let's get back to the machine learning pipeline. Do you remember we split the data into training data and test data, and we only worked with the training data in Step 3? This is where we use test data to truly evaluate how the model performs on unseen data. The quantity we use to measure the quality of predictions by the model is called metrics, and there are many metrics we can choose from. Here, let's use Mean Absolute Error (MAE) for the sake of simplicity. (We take the absolute value instead of the squared value.)

MAE = \frac{\sum_{i=1}^{n}|(y_i-f(x_i, \phi))|}{n}

Great. We can interpret it as "the model is likely to make $\pm$ MAE on average on unseen data." There is an appropriate set of metrics we use depending on the tasks at hand (for example, we might also want to take speed into account), and ML engineers around the world are competing with each other to find ways to achieve the best results in those metrics.

Research Questions in Machine Learning

The above is the simplest example of a machine learning pipeline, and a similar pipeline is used everywhere for different tasks and data (natural language processing, computer vision, audio processing, etc.). The pipeline naturally gives rise to many research questions like:

Step 1: What is the most ethical approach to collecting and handling data?
Step 1: How can we best visualize or represent the data?
Step 2: How can we best handle invalid, missing values, and outliers?
Step 2: How can we best prepare data the for the models?
Step 3: How can we choose the best $f$ for different tasks?
Step 3: How can we choose the best cost function for different tasks?
Step 3: How can we best design a learning mechanisms?
Step 4: How can we best measure the performance of the model for different tasks?

The field of machine learning is rapidly evolving to find better and better answers to the questions above. Even the fancy machine learning models like Convolutional Neural Networks, LSTMs, and Transformers are just incredibly clever version of $f$ , invented to in the pursuit of the above research questions. In this blog post series, I would like to walk you through how the reseachers have tackled those questions and incredible insights they've found.