Road to ML Engineer #4 - Softmax Regression

Before diving into how classification models can be evaluated, I would like to introduce softmax regression, which can perform multiclass classification, unlike logistic regression, which is only for binary classification.

Softmax Function

One intuitive way of classifying multiple classes is to fit multiple logistic functions for all the classes and take the one with the highest probability. However, this is not what we want in multiclass classification because we want to get the categorical distribution of all classes, not argmax of different distributions for each class. Then, you might be wondering why we don't divide the probability of each class by the total probability, like below:

p_k = \frac{p_k(x)}{\sum p_i(x)} \\ = \frac{\frac{1}{1+e^{-\beta_k x}}}{\sum \frac{1}{1+e^{-\beta_i x}}}

, where $p_k$ is the probability of the category being $k$ given $x$ , derived by dividing the probability of category being $k$ computed by the logistic function for $k$ , by the total probabilities of the logistic functions for all categories. However, as you can see, the above equation is...messy.

Ok, then how about log-odds? We took a linear regression of log-odds and the explanatory variables for logistic regression. Surely, we can simplify it with log-odds here right? Let's see.

p_k = \frac{\beta_k x}{\sum \beta_i x}

Well, it looks better at a first glance. However, this is not good because log-odds range from $-\infty$ to $\infty$ and can become negative. As probability must range between 0 and 1, you cannot use log-odds, unfortunately. Instead of probability and log-odds, we can use odds. When you take the exponent of log-odds, we will arrive at odds. ( $e^{log(odds)} = odds$ )

p_k = \frac{e^{\beta_k x}}{\sum e^{\beta_i x}}

Since the exponentiation can map numbers ranging from $-\infty$ to $\infty$ to a range from $0$ to $\infty$ , you can be sure that $p$ will range from 0 to 1. It is also much simpler than using multiple logistic functions. The above function is called softmax function, and it turns out that it is a generalized version of logistic regression and a convinient function for multiclass classification.

Softmax Function is Overparameterized

One unique feature of the softmax function is that it is overparameterized, meaning it has parameters that are redundant. Let's see why that is the case. Suppose we subtract a parameter vector $\psi$ from the parameter vectors of all the classes $\beta$ . Now, the softmax function becomes

p_k = \frac{e^{(\beta_k-\psi) x}}{\sum e^{(\beta_i-\psi) x}}

Let's rearrange the above equation.

p_k = \frac{e^{\beta_k x}e^{-\psi x}}{\sum e^{\beta_i x}e^{-\psi x}} \\ = \frac{e^{\beta_k x}}{\sum e^{\beta_i x}}

As $e^{-\psi x}$ cancels out, we get back to the original softmax function. What does that show about the softmax function? It means we have multiple set of parameters that result in the making the exact same predictions. This makes sense intuitively as well because we can have set of parameters that result in 10 times the odds of another for all classes, yet the result is the exactly the same due to the normalization.

More importantly though, it also means we can set $\psi$ to be the opposite vector of $\beta_k$ (such that $\beta_k - \psi = \overrightharpoon{0}$ ) and practically eliminate $\beta_k$ to arrive at the same result. It also makes intuitive sense because if we have probability distribution of all classes but 1 class, we can just deduce the probability of that missing class by subtracting the sum of all the other classes from 1.

Relationship to Logistic Function

We can show that the softmax function, when the number of classes is 2, is equivalent to the logistic function by using the fact that the softmax function is overparameterized. When the number of classes is 2 (with classes expressed as 1 and 2), we can express the probability for class 1 using the softmax function as follows:

p_1(x) = \frac{e^{\beta_1 x}}{e^{\beta_1 x}+e^{\beta_2 x}}

Now, let's set $\psi = \beta_2$ and subtract it from the parameter vectors, using the fact that softmax function is overparameterized.

p_1(x) = \frac{e^{(\beta_1 - \beta_2) x}}{e^{(\beta_1 - \beta_2) x}+e^{0 x}} \\ = \frac{e^{(\beta_1 - \beta_2) x}}{e^{(\beta_1 - \beta_2) x}+1}

Replacing $\beta_1 - \beta_2$ with $\beta$ , we get

p_1(x) = = \frac{e^{\beta x}}{e^{\beta x}+1} \\ = \frac{1}{1+e^{-\beta x}}

This is the same equation as the logistic function! Hence, we can say that the softmax function is a generalized version of the logistic function.

Step 1. Data Exploration

Here, we will use the same dataset we used for logistic regression: the Iris dataset. Let's skip this part since we've already done this in the previous article on logistic regression. Instead of classifying Setosa vs. others, we now aim to classify Setosa, Versicolor, and Virginica.

Step 2. Data Preprocessing

Instead of expressing the species as numerical values like 0, 1, and 2, we can use a vector with each index corresponding to each species. This process is called one-hot encoding, which can be done by to_categorical from keras.utils.

from keras.utils import to_categorical 
iris_df.drop('species', axis= 1, inplace= True)
target_df = pd.DataFrame(columns= ['species'], data= iris.target)
 
X = np.array(iris_df)
y = to_categorical(np.array(target_df['species']))

We tend to use X for explanatory variables and y for the outcome variable we want to classify. Additionally, we need to split the dataset into training data and testing data.

from sklearn.model_selection import train_test_split
 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.33, random_state= 101)

In the code above, we are splitting the dataset randomly so that the training data and testing data will have 67% and 33% of the original dataset, respectively.

Step 3. Model

Model Definition

As we discussed in the first part of the article, we can use the softmax function to perform multiclass classification. The function is:

p_k = \frac{e^{\beta_k x}}{\sum e^{\beta_i x}}

We can rewrite the above in vector form to express the entire probability distribution for this classification task:

\overrightharpoon{p} = \begin{bmatrix} p_0(x) \\ p_1(x) \\ p_2(x) \end{bmatrix} = \frac{1}{\sum_{i=0}^{2} e^{\beta_i x}}\begin{bmatrix} e^{\beta_0 x} \\ e^{\beta_1 x} \\ e^{\beta_2 x} \end{bmatrix}

Cost Function

You should remember by now that we can perform maximum likelihood estimation (MLE) to find the set of parameters that best approximate the model by which the data is generated. Let's determine the likelihood function for this situation.

The same logic we used in logistic regression applies here. (In fact, the likelihood function for multiclass classification generalizes to that of binary classification.) The more samples the model classifies correctly, the higher the likelihood of the model given the data should be. This means that the higher the probability associated with the correct label, the higher the likelihood of the model should be. The likelihood function then becomes the following:

L(\beta) = \prod_{y_i=0}(p_0(x_i)) * \prod_{y_i=1}(p_1(x_i)) * \prod_{y_i=2}(p_2(x_i))

Taking the negative log-likelihood for the cost function, we arrive at the following:

J(\beta) = -\sum_{i=1}^{n}\sum_{k=0}^{2}1\{y_i = k\}log(p_k(x_i))

Well, it does look familiar, doesn't it? Yes, this is exactly the same as cross-entropy loss in a multiclass situation! Similar to logistic regression, the negative log-likelihood coincides with the cross-entropy between the true and predicted probability distributions.

Learning Mechanism

Since the softmax function has multiple sets of parameters for the same predicted distributions, we cannot use certain numerical methods to analytically solve for the parameters to minimize the negative log-likelihood. Hence, we use gradient descent, which requires us to take the partial derivative of the cost function with respect to the parameters. Let's first express the cost function in terms of the parameters:

J(\beta) = -\sum_{i=1}^{n}\sum_{k=0}^{2}1\{y_i = k\}log(\frac{e^{\beta_k x_i}}{\sum_{j=0}^{2}e^{\beta_j x_i}}) \\ = -\sum_{i=1}^{n}\sum_{k=0}^{2}1\{y_i = k\}(log(e^{\beta_k x_i})-log(\sum_{j=0}^{2}e^{\beta_j x_i})) \\ = -\sum_{i=1}^{n}\sum_{k=0}^{2}1\{y_i = k\}(\beta_k x_i-log(\sum_{j=0}^{2}e^{\beta_j x_i}))

where $\beta x_i$ is an abbreviation for the linear equation for log-odds. Let's take the partial derivative of the above cost function with respect to the slopes and intercept using the chain rule. For the slopes for class $0$ ,

\frac{\partial}{\partial\beta_0} = -\sum_{i=1}^{n}1\{y_i = 0\}(x_i-\frac{1}{\sum_{j=0}^{2}e^{\beta_j x_i}}e^{\beta_0 x_i}x_i) \\ = -\sum_{i=1}^{n}1\{y_i = 0\}(x_i-\frac{e^{\beta_0 x_i}}{\sum_{j=0}^{2}e^{\beta_j x_i}}x_i) \\ = -\sum_{i=1}^{n}1\{y_i = 0\}(1 - p_0(x_i))x_i

For the intercept,

\frac{\partial}{\partial\beta_0} = -\sum_{i=1}^{n}1\{y_i = 0\}(1-\frac{1}{\sum_{j=0}^{2}e^{\beta_j x_i}}e^{\beta_0 x_i}) \\ = -\sum_{i=1}^{n}1\{y_i = 0\}(1-\frac{e^{\beta_0 x_i}}{\sum_{j=0}^{2}e^{\beta_j x_i}}) \\ = -\sum_{i=1}^{n}1\{y_i = 0\}(1-p_0(x_i))

The above applies to all the classes. These derivatives imply that the parameters for class $k$ can only be updated by the data with class $k$ , as the derivatives become 0 for the data with classes other than $k$ . This highlights the importance of having a well-balanced dataset when building and training a classification model.

Additionally, in gradient descent, we take the negative of the gradient and add it to the parameters, which means we just need to add the sum of the differences between 1 and the predicted probability multiplied by the data to the corresponding parameters. This makes the code implementation very straightforward.

Code Implementation

By using the above information, let's create the model SoftmaxRegressionGD.

class SoftmaxRegressionGD():
  def __init__(self, lr=0.001):
    self.W = np.zeros((X.shape[1], y.shape[1]))
    self.b = np.zeros(y.shape[1])
    self.lr = lr # Learning rate
    self.history = [] # History of loss
 
  def predict(self, X):
    odds = np.exp(np.matmul(X, self.W.T)+self.b)
    total_odds = np.sum(odds, axis=1)
    return odds / total_odds[:, np.newaxis]
 
  def fit(self, X, y, epochs=100):
    for i in range(epochs):
      pred = self.predict(X)
 
      self.history.append(log_loss(y, pred))
 
      diff = 1 - pred
      grad_W = np.matmul(X.T, diff*y)
      grad_b = np.sum(diff*y, axis=0)
 
      self.W += self.lr * grad_W
      self.b += self.lr * grad_b
    return self.history

The structure of the code is very similar to that of LogisticRegressionGD. The vectorization of the operations might be less intuitive compared to logistic regression, so I encourage you to take time to understand what is happening at each step.

You can initialize SoftmaxRegressionGD and fit it to the training data as follows:

sm = SoftmaxRegressionGD()
history = sm.fit(X_train, y_train, epochs=500)

As we keep track of the negative log-likelihood loss over each epoch, we can plot it to see how gradient descent is working.

plt.plot(history)
plt.title("Loss vs Epoch")
plt.ylabel("Loss")
plt.xlabel("Epoch")
plt.show()

We can see from the above plot that the loss gradually decreases over each epoch, which is a good sign that gradient descent is working.

In the next article, we will finally discuss how classification models can be evaluated. Stay tuned!

Resources

UFLDL Tutorial. n d. Softmax Regression Stanford University.