Road to ML Engineer #26

In the last article on RNN, we discussed how RNNs typically face the unstable gradient problem during backpropagation. A similar problem also exists in forward propagation, where preceding inputs and activations that are relevant can be easily forgotten after a series of multiplications.

For example, when an RNN processes the sentence "I confirmed the validity of the document, so I put a seal on it," it can easily forget that the context was related to the "document", which is highly relevant to the meaning of the word "seal". Ideally, an RNN should remember the context and interpret the meaning of the word "seal" as an official mark or stamp rahter than a cute marine animal. To avoid the vanishing gradient problem and retain long-term memory, Long Short-Term Memory (LSTM) was invented as a new type of recurrent neural network.

LSTM

LSTM consists of four main gates: the relevance gate, input gate, forget gate, and output gate, each corresponding to tasks related to long- and short-term memory. Although the tasks of these gates are different, all of them perform the following operations with dedicated sets of weights and biases.

\Gamma = \sigma(W_x x^t + W_a a^{t-1} +b)

The relevance gate ( $\Gamma_r$ ) determines how much the input is relevant to long-term memory, given the short-term memory from the previous time step. The input gate ( $\Gamma_i$ ) determines how much long-term memory should be updated, given the input at the current time step and its relevance. The forget gate ( $\Gamma_f$ ) determines how much long-term memory from the past should be forgotten. The output gate ( $\Gamma_o$ ) determines the new short-term memory or the output of the cell. The activation functions are set to be sigmoid functions, except for the relevance gate, which has a tanh activation function to express negative values as well. The following shows a cell of the LSTM architecture:

The above can be expressed mathematically in more detail as follows:

\tilde{c}^t = \Gamma_r \\ c^t = \Gamma_f \odot c^{t-1} + \Gamma_i \odot \tilde{c}^t \\ a^t = \Gamma_o \odot tanh(c^t)

The long-term memory is called the cell state, often expressed as $c^t$ , and the short-term memory is the activation, expressed as $a^t$ . The $\odot$ symbol above indicates element-wise multiplication or the Hadamard product. From the above, we can see that the LSTM cell is designed with a clear intent to preserve long-term memory. Empirically, it has been shown to retain memory longer than a simple RNN.

You can also see from the architecture that the gradients can remain higher than in a simple RNN, due to the addition of the gradient from the long-term memory, which undergoes fewer multiplications of local gradients compared to a simple RNN. (LSTM only performs two element-wise multiplications and one tanh activation on $c^{t-1}$ .) Hence, the model is less prone to the vanishing gradient problem, although it can still suffer from the unstable gradient problem (especially, exploding gradient).

NOTE: The LSTM layer is predefined in both TensorFlow and PyTorch, working the same way as the predefined RNN layer we used in the previous article. If you're interested, you can swap the RNN layer with LSTM and see if there is any improvement in the previous model.

Other Solutions

In addition to LSTM, there are several ways to improve the performance of recurrent neural networks. A similar approach to LSTM is GRU, which also attempts to preserve long-term memory but uses two gates instead of four. I won’t go into too much detail here, but the GRU layer is also predefined in both TensorFlow and PyTorch. So, if you're interested, you can try it out by simply swapping the layer of the model we built in the previous article.

Aside from changing the layers of the model, an alternative approach is to modify the model's overall architecture. Bidirectional RNN is one such solution, where inputs are processed in both directions to utilize context from both the past and the future. This might be a suitable approach for sequence-to-sequence models in tasks like machine translation and speech recognition. You can easily implement this by using tf.keras.layers.Bidirectional in TensorFlow, or by setting the bidirectional parameter of a layer to True in PyTorch. Additionally, you can easily stack multiple RNN layers to enable each layer to learn hidden features at different levels, which is known as a deep RNN.

Conclusion

In this article, we discussed how LSTM was invented to achieve longer memory retention and prevent the vanishing gradient problem, as well as how LSTM functions. We also briefly covered other solutions like GRU, bidirectional RNN, and deep RNN, all of which can provide incremental improvements in model performance. However, the issue of unstable gradients still remains. In the next article, we will explore how normalization techniques can help address this problem further.

Resources

Amidi, A. & Amidi, S. Recurrent Neural Networks cheatsheet. Stanford University.