Road to ML Engineer #25 - Recurrent Neural Networks

The text is a type of sequential data with variable lengths, but the neural network layers we have covered so far, such as ANN and CNN, required a fixed input size to determine the appropriate size of the weights. To address the challenge posed by the sequential nature of the data, we can create a new architecture that accommodates sequences, and one such model is called a Recurrent Neural Network (RNN).

Recurrent Neural Networks

Unlike other neural networks that take all the data at once, an RNN processes input values one at a time and passes them to its neurons. In addition to the input value at the current time step, the RNN also takes the activation from the previous time step as an additional input. This allows the model to learn the relationships between past data and current data. The following image illustrates the structure of the RNN.

We can express the RNN mathematically in more detail as follows.

a^t = h_1(z_a^t) = h_1(W_{aa}a^{t-1} + W_{ax}x^t + b_a) \\ y^t = h_2(z_y^t) = h_2(W_{ya}a^t + b_y)

The above shows the structure of the RNN. Due to its sequential nature, the model can handle input sequences of variable length, and it shares the same weights for all inputs, allowing the model to remain compact regardless of the input size. Additionally, the model can produce either a sequence of values or a single value as output, making the architecture particularly well-suited for various tasks (such as language translation, spam detection, next-word prediction, etc.). However, because the activations at each time step depend on previous time steps, the sequential processing means that RNNs cannot take full advantage of parallelization, which is a key factor in the low-latency performance of large deep learning models.

Backpropagation

To optimize the weights and biases of the neurons in an RNN, we need to compute the gradient of the loss function with respect to the weights and biases. To understand how the gradients can be propagated back, let's use an example of spam detection, where the model has only one output at the end. We can start by computing the gradient of the loss function with respect to the last weight $W_{ya}$ and bias $b_y$ .

\frac{\partial L}{\partial W_{ya}} = \frac{\partial L}{\partial y^T} \frac{\partial y^T}{\partial z_y^T} a^T \\ \frac{\partial L}{\partial b_y} = \frac{\partial L}{\partial y^T} \frac{\partial y^T}{\partial z_y^T}

To propagate the gradient back, we also need to compute the gradient with respect to the activation $a^T$ .

\frac{\partial L}{\partial a^T} = \frac{\partial L}{\partial y^T} \frac{\partial y^T}{\partial z_y^T} W_{ya} \\

Using this, we can begin computing the gradients with respect to $W_{aa}$ , $W_{ax}$ , and $b_a$ at time step $T$ .

\frac{\partial L}{\partial W_{aa}} = \frac{\partial L}{\partial a^T} \frac{\partial a^T}{\partial z_a^T} a^{T-1} \\ \frac{\partial L}{\partial W_{ax}} = \frac{\partial L}{\partial a^T} \frac{\partial a^T}{\partial z_a^T} x^{T} \\ \frac{\partial L}{\partial b_a} = \frac{\partial L}{\partial a^T} \frac{\partial a^T}{\partial z_a^T}

Next, let's consider the gradient with respect to $a^{T-1}$ for the next layer.

\frac{\partial L}{\partial a^{T-1}} = \frac{\partial L}{\partial a^T} \frac{\partial a^T}{\partial z_a^T} W_{aa} \\ = \frac{\partial L}{\partial y^T} \frac{\partial y^T}{\partial z_y^T} W_{ya} \frac{\partial a^T}{\partial z_a^T} W_{aa}

From the above, the gradients with respect to $W_{aa}$ , $W_{ax}$ , and $b_a$ at time step $T-1$ will look as follows:

\frac{\partial L}{\partial W_{aa}} = \frac{\partial L}{\partial a^{T-1}} \frac{\partial a^{T-1}}{\partial z_a^{T-1}} a^{T-2} \\ \frac{\partial L}{\partial W_{ax}} = \frac{\partial L}{\partial a^{T-1}} \frac{\partial a^{T-1}}{\partial z_a^{T-1}} x^{T-1} \\ \frac{\partial L}{\partial b_a} = \frac{\partial L}{\partial a^{T-1}} \frac{\partial a^{T-1}}{\partial z_a^{T-1}}

From here, we can compute the gradient with respect to $a^{T-2}$ for the next layer.

\frac{\partial L}{\partial a^{T-2}} = \frac{\partial L}{\partial a^{T-1}} \frac{\partial a^{T-1}}{\partial z_a^{T-1}} W_{aa} \\ = \frac{\partial L}{\partial y^T} \frac{\partial y^T}{\partial z_y^T} W_{ya} \frac{\partial a^T}{\partial z_a^T} W_{aa} \frac{\partial a^{T-1}}{\partial z_a^{T-1}} W_{aa}

We can observe an emerging pattern that the gradients with respect to $W_{aa}$ , $W_{ax}$ , and $b_a$ can be determined by the multiplication of $\frac{\partial L}{\partial a^T} = \frac{\partial L}{\partial y^T} \frac{\partial y^T}{\partial z_y^T} W_{ya}$ , $\frac{\partial a^t}{\partial z_a^t} W_{aa}$ for all succeeding time steps, and the corresponding local gradients. The gradients at each time step are summed up to update the parameters. Hence, the following can express the backpropagation:

\frac{\partial L}{\partial W_{aa}} = \frac{\partial L}{\partial y^T} \frac{\partial y^T}{\partial z_y^T} W_{ya} \sum_{t=1}^T [ \prod_{s=t+1}^{T} (\frac{\partial a^s}{\partial z_a^s}W_{aa}) \frac{\partial a^t}{\partial z_a^t} a^{t-1}] \\ \frac{\partial L}{\partial W_{ax}} = \frac{\partial L}{\partial y^T} \frac{\partial y^T}{\partial z_y^T} W_{ya} \sum_{t=1}^T [ \prod_{s=t+1}^{T} (\frac{\partial a^s}{\partial z_a^s}W_{aa}) \frac{\partial a^t}{\partial z_a^t} x^{t}] \\ \frac{\partial L}{\partial b_a} = \frac{\partial L}{\partial y^T} \frac{\partial y^T}{\partial z_y^T} W_{ya} \sum_{t=1}^T [ \prod_{s=t+1}^{T} (\frac{\partial a^s}{\partial z_a^s}W_{aa}) \frac{\partial a^t}{\partial z_a^t}]

If the task involves sequential output, the summation term must incorporate the gradient from $y^t$ at every time step, and $W_{ya}$ and $b_y$ will also need to be updated at every layer. I encourage you to compute the gradients for that scenario. Although the above equations might seem complex, you can always track the product by working backwards from the last time step, making the implementation easier.

The reason we expressed the gradients mathematically in this example is to show that large time steps lead to a large number of products $\prod_{s=t+1}^{T} \left(\frac{\partial a^s}{\partial h_1} W_{aa}\right)$ being performed. This can make the model increasingly prone to the exploding/vanishing gradient problem, even with some countermeasures discussed in the article, Road to ML Engineer #15 - Unstable Gradients.

Moreover, the above reveals that the gradient at a time step depends on the gradients of succeeding time steps and that backpropagation, like forward propagation, cannot take advantage of parallel processing. This means training an RNN can be significantly slower compared to other neural networks like ANN and CNN.

Code Implementation

Now that we have covered the mechanism of RNN, we can implement an RNN model in both TensorFlow and PyTorch. For a toy problem, we can use the Brown Corpus, which was the first multi-word electronic corpus of English, created by Brown University in 1961 (NLTK, n.d.). The document files are categorized into 15 genres, such as news, editorials, reviews, and more. The task here is to classify the genre of the texts in the documents.

Step 1 & 2. Data Exploration & Preprocessing

Luckily, the NLTK library provides a convenient way to access various aspects of the data, as shown below.

from nltk.corpus import brown
 
fileids = brown.fileids() # file IDs
words = list(brown.words()) # Words
categories = brown.categories() # Categories
brown.raw(fileids[0]) # Raw text of a file ID

From here, we can preprocess and tokenize the text in various ways, such as using Byte Pair Encoding (BPE). In this case, we will use BPE subword tokenization, as we have done previously (the code and explanation are omitted since we have already covered BPE). After obtaining the subword-token map, we can use it to prepare for the document classification task.

# category -> index (int)
category_map = dict(zip(brown.categories(), range(len(brown.categories()))))
 
def generate_data():
  tokenized = []
  categories = []
  for fileid in fileids:
    corpus = text_preprocessing(brown.raw(fileid))
    category = brown.categories(fileid)[0]
    tokenized.append(tokenize(corpus, tokens))
    categories.append(category_map[category])
 
  tokenized_len = [len(i) for i in tokenized]
  max_len = max(tokenized_len)
  for t in tokenized:
    while (len(t) != max_len):
        # Padding added. Make sure to have '[PAD]' in tokens map.
        # Padding is added for converting to numpy arrays and to tensors (they do not expect irregular lists)
        # You can also use ragged tensors it seems like. 
        t.append(tokens['[PAD]']) 
        
  return np.array(tokenized), np.array(categories)
 
tokenized, categories = generate_data()
categories = tf.keras.utils.to_categorical(categories)

From the NumPy array of the tokenized corpus and the categories, we can prepare a dataset for both TensorFlow and PyTorch. (You can convert the tokens into word embeddings as part of data preprocessing, but the process of converting tokens into embeddings typically happens within the model by including an embedding layer.)

from sklearn.model_selection import train_test_split
 
X_train, X_test, y_train, y_text = train_test_split(tokenized, categories, test_size=0.2, random_state=101)
X_val, X_test, y_val, y_text = train_test_split(X_test, y_test, test_size=0.5, random_state=101)
 
# PyTorch
X_train, X_val, X_test = map(lambda X: torch.tensor(X, dtype=torch.int32), (X_train, X_val, X_test))
y_train, y_val, y_test = map(lambda y: torch.tensor(y, dtype=torch.float32), (y_train, y_val, y_test))
 
train_dataset = torch.utils.data.TensorDataset(X_train, y_train)
val_dataset = torch.utils.data.TensorDataset(X_val, y_val)
test_dataset = torch.utils.data.TensorDataset(X_test, y_test)
 
train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=8, shuffle=True)
val_loader = torch.utils.data.DataLoader(dataset=val_dataset, batch_size=4, shuffle=True)
test_loader = torch.utils.data.DataLoader(dataset=test_dataset, batch_size=1, shuffle=True)

Step 3. Model

The following is an example of how to implement an RNN model in both TensorFlow and PyTorch.

I will omit the training results and Step 4 (model evaluation) in this article, but it is worth mentioning that the model moderately struggles to learn to classify the genre of the documents (this also depends on factors like the number of times you run BPE, among others). You can try training a new word embedding for this particular task to see if the performance improves.

If you are following along, you might also notice how long it takes to train the model and make predictions. While the padding added to handle variable input lengths partially contributes to this, the primary cause is the sequential nature of RNN for both forward and backward propagation, which cannot be efficiently parallelized on GPUs like ANN and CNN.

Conclusion

In this article, we covered RNN, which is inherently sequential in its computations, allowing the model to process input sequences of variable lengths. Despite this advantage, we observed mathematically and empirically that RNN has critical issues with unstable gradients and lack of parallelizability. As a result, RNN is rarely used in practice today, but it serves as an important theoretical foundation for other models that are still widely used. In the next article, we will attempt to address the unstable gradient problem of RNNs.

Resources

Amidi, A. & Amidi, S. Recurrent Neural Networks cheatsheet. Stanford University.
NLTK. n.d. Accessing Text Corpora and Lexical Resources. NLTK.