Road to ML Engineer #24

Latent Semantic Analysis

In Word2Vec, we create a dataset containing the token of the central word, the tokens of context words, and their labels for each row. We can also combine these results to create a window-based co-occurrence matrix, where the rows are central words, the columns are context words, and the values are the co-occurrence counts. Then, we can apply Singular Value Decomposition (SVD) on the co-occurrence matrix to extract word embeddings or their low-rank approximation. This technique is called Latent Semantic Analysis (LSA) and can capture word similarity.

X = U \Sigma V^T

Here, $X$ is the co-ocurrence matrix, $U$ is the left singular vector (word vectors), $\Sigma$ contains the singular values, and $V$ is the right singular vector (context word vectors). Both $U$ and $V$ are orthogonal, and $\Sigma$ is a diagonal matrix, where its rank is the same as the rank of $X$ . (For more details regarding SVD, check out Singular Value Decomposition : Data Science Basics by ritvikmath.) To obtain the $U$ , we can multiply $X^T$ by itself as follows:

XX^T = U \Sigma V^TV \Sigma U^T = U \Sigma^2 U^{-1}

Since $V$ is orthogonal and $\Sigma$ is diagonal matrix, $VV^T$ becomes an identity matrix, and $\Sigma\Sigma$ becomes $\Sigma^2$ . Thus, $U$ and $\Sigma$ correspond to the eigenvectors and the square root of eigenvalues in the eigen decomposition of $XX^T$ . Therefore, we can obtain $U$ by solving this system of equations. (For more details regarding eigen decompositon, check out Eigendecomposition : Data Science Basics by ritvikmath. ) LSA does not require training a neural network and leverages global statistics to efficiently create word embeddings that capture syntactical and semantic similarities between words. However, the method can place disproportionate emphasis on high-frequency words and cannot capture certain nuances that Word2Vec successfully captures.

GloVe

GloVe aims to combine Word2Vec's ability to capture subtle linguistic nuances with LSA's efficient use of global statistics by performing matrix factorization on a window-based co-occurrence matrix. Matrix factorization seeks to capture latent representations of the rows and columns, mostly in lower dimensions, by approximating the original matrix as the product of smaller matrices for rows and columns.

X_{m,n} = U_{m,p}V_{n,p}^T

Matrix factorization typically uses learning algorithms like gradient descent to learn the best latent representations of $U$ and $V$ , and this technique is often used in recommender systems (I might discuss recommender systems later in this series). In GloVe, matrix factorization is performed on the window-based co-occurrence matrix to obtain the word embeddings $U+V$ .(In this case, both $U$ and $V$ share the same shape and represent words in a latent space, and it has been empirically shown that adding them produces good word embeddings.)

J(\theta) = \frac{1}{2} \sum_{i,j=1}^{W} f(X_{i,j}) (u_i^Tv_j - log(X_{i,j}))^2 \\ f(X) = \begin{cases} (\frac{X_{i, j}}{X_{max}})^\alpha &\text{if } X_{i,j} < X_{max} \\ 1 &\text{otherwise} \end{cases}

The above is the objective function we use for matrix factorization. It might look complicated, but it’s quite simple. We aim to approximate the product of the word vector and the context word vector ( $u^Tv_j$ ) to the logarithm of the co-occurrence count $X_{i,j}$ , using a squared loss for every word pair $W$ , with the weight $f(X_{i,j})$ to handle low- and high-frequency words.

Code Implementation

To implement GloVe, we need to create a window-based co-occurrence matrix, which can be done efficiently using the following method. (All previous steps, such as text preprocessing and tokenization, remain the same as those used in the previous article.)

def create_cooccurrence_matrix(tokenized_corpus, window_size=5):
    co_occurrence = defaultdict(float)
    for i, word in enumerate(tokenized_corpus):
        start = max(0, i - window_size)
        end = min(len(tokenized_corpus), i + window_size + 1)
        for j in range(start, end):
            if i != j:
                context_word = tokenized_corpus[j]
                co_occurrence[(word, context_word)] += 1
 
    return co_occurrence

From the co-occurrence matrix, we can create the training data as follows.

def create_training_data(co_occurrence):
    words = []
    contexts = []
    counts = []
    
    for (word, context_word), count in co_occurrence.items():
        words.append(word)
        contexts.append(context_word)
        counts.append(count)
    
    return np.array(words), np.array(contexts), np.array(counts, dtype=np.float32)

(The creation of datasets for TensorFlow and PyTorch is abbreviated here, as it is covered in the previous article.) Then, we can build our GloVe model with the objective function described above and train it. Below is the TensorFlow implementation of GloVe.

class GloVe(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim):
    super(GloVe, self).__init__()
    self.word_embedding = layers.Embedding(vocab_size,
                                      embedding_dim,
                                      embeddings_initializer="glorot_normal",
                                      embeddings_regularizer="l2",)
    self.context_embedding = layers.Embedding(vocab_size,
                                       embedding_dim,
                                       embeddings_initializer="glorot_normal",
                                       embeddings_regularizer="l2",)
 
  def call(self, pair):
    word, context = pair
    word_emb = self.word_embedding(word)
    context_emb = self.context_embedding(context)
    dots = tf.reduce_sum(word_emb * context_emb, axis=-1)
    return dots
 
def custom_loss(y_pred, y_true):
      y_true = tf.clip_by_value(y_true, clip_value_min=1e-5, clip_value_max=100)
      f = y_true / 100
      log_y_true = tf.math.log(y_true)
      return 0.5 * f * tf.math.square(y_pred - log_y_true)
 
embedding_dim = 1024
vocab_size = len(tokens)
glove = GloVe(vocab_size, embedding_dim)
glove.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001),
                 loss=custom_loss)
glove.fit(dataset, epochs=15)

You might observe that the training takes significantly less time due to the use of global statistics, while the word embeddings produced by GloVe are found to be as expressive as those produced by Word2Vec. As a challenge, you might consider adding a function that creates word embeddings by adding the word embedding and the context embedding together and implementing the model in PyTorch.

Conclusion

In this article, we covered Latent Semantic Analysis (LSA) as an alternative approach to creating word embeddings, and discussed its benefits and drawbacks compared to Word2Vec, which led to the motivation behind GloVe. There are other alternatives, such as FastText, that you might want to explore if you're interested. Now that we have word embeddings, we are ready to build language models.

Resources

Stanford University School of Engineering. 2017. Lecture 3 | GloVe: Global Vectors for Word Representation. YouTube.