Road to ML Engineer #18 - Variational Autoencoders

Variational Autoencoders

In the last article, we discussed how images are organized in an unknown latent space, which prevents us from picking a vector and passing it to the decoder for generative purposes. One solution to this problem is to squeeze the latent space into a standard normal distribution ( $N(0, 1)$ ) so that we can sample any vector from the standard normal distribution to generate images. This approach is called variational autoencoder (VAE), and it is illustrated in the diagram below.

In the VAE, the encoder outputs the mean and variance that the image belongs to in the latent space. Random samples are then picked from the encoded normal distributions, and the decoder takes the sample to reconstruct the original image. Backpropagation is used to train the decoder to be robust enough to generate images from any sample from the standard normal distribution while training the encoder to encode images to their normal distributions within the standard normal distribution. Here, the target distribution (standard normal distribution) is called prior and the distribution produced by the encoder is called posterior.

Reparameterization Trick

While the approach makes sense conceptually, it is practically impossible to compute the gradient of the sampling process for backpropagation. Hence, we need a way of simulating the sampling process with computable gradients, $\frac{\partial z}{\partial \mu}$ and $\frac{\partial z}{\partial \sigma}$ . This is where the reparameterization trick comes into play. Instead of actually picking a sample from the normal distribution, we use the following equation:

z = g_{\mu, \sigma}(\epsilon) = \mu + \sigma \epsilon

, where $z$ is the sample, $\mu$ and $\sigma$ are the mean and the standard deviation (the square root of variance) produced by the encoder, and $\epsilon$ is the random sample from the standard normal distribution ( $\epsilon \sim N(0, 1)$ ). By applying the function $g_{\mu, \sigma}(\epsilon)$ instead of using $z \sim N(\mu, \sigma)$ , you can determine the gradients of the "sampling," $\frac{\partial z}{\partial \mu}$ and $\frac{\partial z}{\partial \sigma}$ , as $1$ and $\epsilon$ . This allows the gradients with respect to the loss to propagate back to the encoder.

The Log-Var Trick

The encoder is set up to output the variance of the normal distribution, but the variance needs to be positive. To allow the encoder to also produce negative values, we make the encoder output the log of the variance instead. This means we cannot simply take the square root of the variance to get the standard deviation as we've been doing. To get the standard deviation from the log variance, we solve the following:

log(\sigma^2) = 2log(\sigma) \\ log(\sigma) = \frac{log(\sigma^2)}{2} \\ \sigma = e^{\frac{log(\sigma^2)}{2}} \\

From this, we know that we can obtain the standard deviation by $e^{\frac{log(\sigma^2)}{2}}$ . Therefore, we can rewrite the function $g$ as follows:

g_{\mu, \sigma}(\epsilon) = \mu + e^{\frac{log(\sigma^2)}{2}} \epsilon

This trick of using the log variance is called the Log-Var trick.

Loss Function

Unlike a standard autoencoder, we want the posteriors to be as close as possible to the prior. To achieve this, we need an additional loss term aside from the MSE loss for the encoder that penalizes psoterior that deviate from the prior. If you recall from the article Road to ML Engineer #2 - Logistic Regression Prerequisites, you must know that one metric we can use to measure the distance between two distributions is KL divergence.

L = \alpha L_1 + L_2 L = \alpha MSE + KL

Here, $L_1$ is the MSE loss for the reconstruction of images by the decoder, and $L_2$ is the KL divergence between $N(\mu, \sigma)$ and $N(0, 1)$ . The $\alpha$ is a hyperparameter that specifies the relative importance of the MSE over the KL divergence. The KL divergence between $N(\mu, \sigma)$ and $N(0, 1)$ can be derived as $\frac{1}{2}\sum(1+log(\sigma^2)-\mu^2-\sigma^2)$ , which makes sense since the terms will be minimized (become zero) when $\mu$ and $\sigma$ are 0 and 1, respectively. (The log variance becomes 0 when $\sigma$ is 1, and the leftmost 1 cancels out when $\sigma$ is 1. If you're interested in the details of the derivation, check out this page.)

Code Implementation

Now that we have all the pieces needed to construct a VAE, let's implement it in code. We will be using the same MNIST dataset, and we’ll omit steps 1 and 2 (data preparation and preprocessing) as we have already covered those in the previous article.

Step 3. Models

The following is the implementation of an example VAE using PyTorch and TensorFlow.

Step 4. Model Evaluation

After training, let's observe how well the VAE can reconstruct images by plotting the test dataset alongside the predicted test data. To do this, we need to make predictions with the VAE and reshape both the test data and predicted test data.

# TensorFlow
preds = vae.predict(X_test)
_, preds = tf.split(preds, [20, 784], 1)
preds = preds.numpy()
X_test = X_test.reshape(X_test.shape[0], 28, 28)
preds = preds.reshape(preds.shape[0], 28, 28)
 
# PyTorch
for X, y in test_loader:
  Xs = X
  _, _, preds = vae(X)
Xs = Xs.numpy()
preds = preds.detach().numpy()
Xs = Xs.reshape(Xs.shape[0], 28, 28)
preds = preds.reshape(preds.shape[0], 28, 28)

Then, we can use the function below to plot 10 images from each.

def plotImgs (X):
    plt.figure(figsize=(10, 4))
    for i in range(10):
        plt.subplot(2, 5, i + 1)
        plt.imshow(X[i], cmap='gray')
        plt.axis('off')
    plt.tight_layout()
    plt.show()
 
plotImgs(X_test)
plotImgs(preds)

The following is the result of the VAE implementation in PyTorch.

We can see that the VAE successfully learned to encode and decode images. Now, let's observe how the decoder can generate images from random samples drawn from the standard normal distribution.

# TensorFlow
latent = np.random.normal(0, 1, (10, 10))
decoded = decoder.predict(latent)
decoded = decoded.reshape(decoded.shape[0], 28, 28)
 
# PyTorch
latent = torch.randn(10, 10)
decoded = vae.decoder(latent)
decoded = decoded.detach().numpy()
decoded = decoded.reshape(decoded.shape[0], 28, 28)
 
plotImgs(decoded)

The following is the result of the VAE decoder implemented in TensorFlow.

Although the images are still somewhat blurry, we can recognize the handwritten digits, unlike the images generated by the normal autoencoder's decoder in the previous article.

Conclusion

By having the encoder output posterior normal distributions within the prior standard normal distribution, we transformed the autoencoder into a variational autoencoder (VAE) that can be used for generative purposes. However, if you followed along and trained the model yourself, you might have noticed that the training takes a significant amount of time and can be quite unstable even on this small dataset. Therefore, we need to make some improvements to the model. (One hint is that the VAE is not limited to just using normal dense layers.)

Additionally, VAE is not the only approach we can use for generative purposes. There are more creative ways to design generative models. I encourage you to brainstorm other methods for creating generative models. (We will cover one of these in the next article.)

Resources

Raschka, S. 2021. L17.3 The Log-Var Trick. YouTube.
Raschka, S. 2021. L17.4 Variational Autoencoder Loss Function. YouTube.
Raschka, S. 2021. L17.5 A Variational Autoencoder for Handwritten Digits in PyTorch -- Code Example. YouTube.