Road to ML Engineer #2 - Logistic Regression Prerequisites

There are two major tasks in machine learning: regression and classification. In the last article, we learned how to use linear regression to tackle regression tasks in linearly correlated data. Next up, we will talk about how we can perform binary classification with logistic regression. However, logistic regression requires a good understanding of probability and statistics. Hence, this article is dedicated to learning the prerequisites to ensure we are all on the same page.

Log-Odds

I'm sure you know about probability. If you play 5 games of rock-paper-scissors and win 3 games, you can say you had a 60% probability of winning. But how about the odds in favor of winning? To calculate the odds in favor of winning, you divide the number of times you won by the number of times you lost.

odds = \frac{3}{2} = 1.5

We can calculate the odds from the probability too, because the denominator (the total number of games played) is the same for both winning and losing.

odds = \frac{3 / 5}{2/ 5} \\ = \frac{3}{2} = 1.5

Hence, we can establish that $odds = \frac{p}{1-p}$ . Let's look at another example of playing 500 games and winning 499 games. The odds in favor of winning is 499, while the odds against winning is 1/499 or around 0.002. We can see from this example that the odds are not symmetric around 1 at all, even though they represent the same event. To make it symmetric around 1, we can use the log function.

log(odds_{winning}) = log(499) = 2.7 \\ log(odds_{losing}) = log(0.002) = -2.7

When we express this in terms of probability $p$ , we get

log(odds) = log(\frac{p}{1-p})

This is called the logit function and is used in various domains.

Logistic Function

We now know how to go from probability to log-odds. Let's take the inverse of the function to see how we can go from log-odds to probability.

x = log(\frac{p}{1-p})

We can set the log-odds as $x$ like the above, and isolate $p$ to get the inverse of the logit function. Let's continue the computation.

e^x = \frac{p}{1-p} \\ p = (1-p)e^x \\ p = e^x - pe^x \\ (1+e^x)p = e^x \\ p = \frac{e^x}{1+e^x}

Multiplying $e^{-x}$ to both the numerator and denominator, we get

p = \frac{1}{e^{-x}+1}

This is the equation that can convert log-odds to probability, which is called the standard logistic function or sometimes called the sigmoid function. It can take log-odds ranging from $-\infty$ to $\infty$ and map it to the probability range of 0 to 1 like the following:

The normal logistic function has more parameters:

f(x) = \frac{L}{1+e^{-k(x-x_0)}}

, where $L$ is the upper bound of the function, $k$ is the steepness of the curve, and $x_0$ is the midpoint of the function. You might already be able to imagine fitting this logistic function to the data, generated by plotting an explanatory variable and a binary outcome variable represented in 0 and 1, by changing the parameters $k$ and $x_0$ .

Likelihood

In English, we use probability and likelihood interchangeably, but they mean different things in the math world. While probability concerns the possibility of the observations given the distribution or the model (P(Observations | Model)), likelihood considers how likely it is that the distribution or the model is the right one that generated the observations (L(Model | Observations)).

Maximum Likelihood Estimation

The likelihood can be used as a measure of how well the model fits the data; the higher the likelihood, the more likely that the model is the correct one that generated the data. The process of finding parameter(s) of the model that leads to maximum likelihood is called Maximum Likelihood Estimation (MLE). There are multiple ways of achieving this.

For example, we can get the likelihood function with respect to the parameters of the model, compute the derivative of the likelihood function, set it to 0, and solve the system of equations. We can also use gradient descent by setting the negative likelihood function as the cost function.

KL Divergence

When you want to quantify how different two distributions or models are, you can compare the probability of observations given the model as follows:

\frac{P(Observations | Model_p)}{P(Observations | Model_q)} = \frac{p_1^np_2^{(t-n)}}{q_1^nq_2^{(t-n)}}

The above equation assumes that both models $p$ (actual model) and $q$ (predicted model) have probability distributions for binary outcomes ( $p_1, p_2, q_1, q_2$ ) and there were $t$ number of observations which resulted in observing the first outcome $n$ times. By examining the above, we can compare it to 1 and get an intuition of how different the two distributions are.

However, we have the same problem as before: the distance from 1 is different depending on whether the value is smaller or bigger than 1. To solve this problem, we can use the log function. (The actual reason for applying the log function is related to information theory, which is beyond the scope of this article.)

log(\frac{p_1^np_2^{(t-n)}}{q_1^nq_2^{(t-n)}})

We can also divide it by the total number of observations $t$ as it is a monotonic function. (This also relates to information theory.)

\frac{1}{t}log(\frac{p_1^np_2^{(t-n)}}{q_1^nq_2^{(t-n)}})

We can rearrange this equation further and get:

\frac{n}{t}log(p_1) + \frac{t-n}{t}log(p_2) - \frac{n}{t}log(q_1) - \frac{t-n}{t}log(q_2) \\ = p_1log(p_1) + p_2log(p_2) - p_1log(q_1) - p_2log(q_2) \\ = \sum_{i=1}^{2}p_ilog(p_i) - \sum_{i=1}^{2}p_ilog(q_i)

Now, by deriving the log ratio of the probability of observations given the models, we arrive at the KL divergence formula for binary outcomes. The general formula for KL divergence is:

D_{KL}(P || Q) = \sum_iP(i)log(P(i)) - \sum_iP(i)log(Q(i))

The larger the KL divergence gets, the more different the models $P$ and $Q$ are.

Cross-Entropy

We can use the above KL divergence formula as a way of comparing how different a predicted model $Q$ are from the actual model $P$ . We can try minimizing the KL divergence by adjusting the parameters of $Q$ , so that we can get the model $Q$ that approximates $P$ well.

However, when we look closely at the KL divergence formula, we can notice that the first part of the equation does not change depending on the parameters of model $Q$ . (The first part corresponds to the entropy of $P$ , but this is beyond the scope of this article. If you are interested, check out a YouTube video from StatQuest on entropy.) Hence, when we want to optimize for KL divergence by changing the parameter(s) of the model, we only need the latter part. The latter part of KL divergence corresponds to cross-entropy between $P$ and $Q$ .

H(P, Q) = - \sum_iP(i)log(Q(i))

By minimizing the cross-entropy, we can minimize KL divergence and result in having approximately the same model $Q$ as the one used to generate the observations $P$ .

Resources

Liusie, A. 2021. Intuitively Understanding the KL Divergence YouTube.
StatQuest. 2018. Odds and Log(Odds), Clearly Explained!!! YouTube.
StatQuest. 2018. Probability is not Likelihood. Find out why!!! YouTube.