Road to ML Engineer #23 - Tokenization & Word Embeddings

NLP, or Natural Language Processing, involves a variety of tasks related to natural languages, such as translation, sentiment analysis, question answering, and more. It is an area where we've seen remarkable results and societal impact with the rise of various large language models (LLMs). In this article, we will cover how natural languages are converted from strings or sequences of characters into numbers that are easier for algorithms to process.

Text Preprocessing

Before converting strings into numbers, we need to observe the text and clean it up. For example, we might have "1", "①", and "😘" in our data, and we need to decide how to handle them. We could eliminate "①", standardize all instances to "1", and convert "😘" into "(Face Blowing a Kiss)" or simply eliminate it from the data. We also face many challenges related to data quality, such as inconsistent whitespaces, grammatical errors, spelling mistakes, and so on, which researchers constantly deal with.

Tokenization

After preprocessing, we're ready to convert strings into numerical values or tokens, a process called tokenization. There are several approaches to tokenization, such as rule-based tokenization, where the text is split into words by whitespace or punctuation, and each new word is assigned a unique token ID.

This rule-based approach is simple and effective enough in some scenarios, especially when combined with stemming or lemmatization, which reduce word variations by removing suffixes like "ing", "able", etc. (Stemming simply cuts off suffixes, while lemmatization ensures the removed suffixes make sense.) However, this approach may not be sufficient when we want LLMs to understand subtle nuances and generate grammatically accurate sentences.

To understand why this approach may fall short, consider the word "fishing". After stemming or lemmatization, it becomes "fish", changing the action of catching fish into the noun itself. Without stemming or lemmatization, "fishing" and "fish" would be treated as entirely different words, even though they share a common root, making it harder for a machine learning model to recognize their relationship.

Subword Tokenization: BPE

A more intuitive solution is to create tokens for both "fish" and "ing", allowing the word "fishing" to be broken down into meaningful subwords. This approach, called subword tokenization, is often done using statistical methods like Byte Pair Encoding (BPE). First, characters are split and represented as bytes. Then, the algorithm recursively identifies the most frequent pairs of bytes and merges them into a single entity or subword. Ideally, frequent byte pairs like "n" and "g" or "i" and "ng" will merge to form the subword "ing".

aabaabfc: aa appears the most, X = aa
-> XbXbfc: Xb appears the most, Y = Xb or aab
-> YYfc

After recursively processing the text, the subwords are assigned new token IDs, which are then used to convert the character sequence into a sequence of tokens. For instance, if "fishing" isn't in the training data, but "fish" and "ing" have been encoded as subwords, "fishing" will be tokenized into two tokens: "fish" and "ing". If an unknown subword appears, like "buffling", the algorithm uses a special token for unknown words, typically "[UNK]", and tokenizes it as "[UNK]" and "ing". (There are many subword tokenization approaches like WordPiece, so if you're interested, I recommend checking them out. Many tokenizers are publicly available, allowing you to select the one that suits your needs.)

Word2Vec

While tokens have now been assigned to words or subwords, a single token ID carries no information about the meaning of the word. Ideally, we want to represent words as multidimensional vectors, where each number in the vector expresses a feature of the word, and similar words that appear in similar contexts have similar vectors. These vector representations are called word embeddings, and we can obtain them through self-supervised training using a neural network called Word2Vec.

Word2Vec takes a one-hot encoded vector of tokens and passes it through an embedding layer with an arbitrary dimension. There are two common approaches to training Word2Vec: CBOW (Continuous Bag of Words) and Skip-gram. CBOW captures word embeddings by predicting a masked word using its surrounding context, while Skip-gram predicts the context window from a word. As the model trains, the weights corresponding to each word adjust to capture the contexts in which the words are used, ultimately representing the meanings of the words. After training, we can convert a token into a one-hot encoded vector, then into a word embedding by taking the corresponding set of weights.

SubSampling & Negative Sampling

Training Word2Vec, however, is no easy task. When dealing with large datasets, there are often millions or even billions of tokens. This means the one-hot encoded vectors can have millions or billions of elements, and the output softmax layer can become equally large. Training such a model would be infeasible, so we rely on certain techniques to simplify the process. One such technique is subsampling, where high-frequency words (like "the", "a") and low-frequency words (like "gobbledygook") are ignored, as they tend to be less informative. Another technique is negative sampling, where only 10 to 20 negative samples are used per training step, rather than evaluating the entire vocabulary. (There's also hierarchical softmax, but that's beyond the scope of this article.)

Mathematical Formulation

From the above, we can finally formulate the objective function for Word2Vec. Here, we will take the Skipgram with subsampling and negative sampling.

J(\theta) = -log(\sigma(u_o^T v_c)) + \sum_{j \sim P(w)} log(\sigma(u_j^T v_c)) \\ P(w) = \frac{U(w)^{3/4}}{Z}

The input word embedding is expressed as $v_c$ , and the correct context word embeddings and negative samples of context word embeddings are $u_o$ and $u_j$ , respectively. We use the sigmoid function $\sigma$ to simulate probability, and we apply the monotonically increasing function $\log$ to simulate log probability. In Skipgram with negative sampling, we want to maximize the log probability of getting $u_o$ with $v_c$ and minimize the log probability of getting negative samples $u_j$ with $v_c$ .

The negative samples are drawn from the distribution $P(w)$ , which is a unigram distribution powered by $3/4$ . The unigram distribution is simply the probability of seeing a word, which can be determined by counting, and $3/4$ is obtained empirically to sample fewer high-frequency words for subsampling.

Code Implementation

Now that we understand the process, we can implement it in Python. For the demonstration, we will use the Reuters Corpus from the NLTK library.

import nltk
nltk.download('reuters')
from nltk.corpus import reuters

You can download the Reuters Corpus as shown above. The corpus already has many useful methods like fileids for obtaining file IDs, raw for getting the raw text, and words for viewing the list of words.

fileids = reuters.fileids()
words = list(reuters.words())
corpus = reuters.raw()

Preprocessing

In the text preprocessing phase, we only remove \n and unnecessary whitespaces from the raw text as follows.

def text_preprocessing(text):
    text = text.replace('\n', ' ')
    text = re.sub(r'\s+', ' ', text).strip()
    return text
 
corpus = text_preprocessing(corpus)

Tokenization (BPE)

We can start by implementing the Byte Pair Encoding algorithm. First, we need to separate the words with whitespace, add the end tag </w>, and put them in a dictionary.

def get_vocab(words):
    vocab = defaultdict(int)
    for word in words:
        vocab[' '.join(list(word)) + ' </w>'] += 1
    return vocab

After that, we count the frequency of every byte pair (character pair in English) and store it in the dictionary.

def get_pairs(vocab):
    pairs = defaultdict(int)
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols)-1):
            pairs[symbols[i],symbols[i+1]] += freq
    return pairs

Using the dictionary, we find the pair with the maximum frequency, which we then merge by eliminating the whitespace.

def merge_vocab(pair, v_in):
    v_out = {}
    bigram = re.escape(' '.join(pair))
    p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
    for word in v_in:
        w_out = p.sub(''.join(pair), word)
        v_out[w_out] = v_in[word]
    return v_out

Combining all of the above, we can create a BPE function.

def byte_pair_encoding(words, n):
    vocab = get_vocab(words)
    for i in range(n):
        pairs = get_pairs(vocab)
        max_pair = max(pairs, key=pairs.get)
        vocab = merge_vocab(max_pair, vocab)
    return vocab
 
vocab = byte_pair_encoding(words, 10000)
vocab = dict(sorted(vocab.items(), key=lambda x:x[1], reverse=True))

The n is the hyperparameter that dictates how many pairs to merge. Using the vocab dictionary, we can create subword token pairs in a dictionary.

def build_token_dictionary(vocab):
    token_to_id = {}
    id_to_token = {}
    id_frequency = {}
    current_id = 0
 
    for word, frequency in vocab.items():
        tokens = word.split()  # Split the word based on spaces
        for token in tokens:
            if token not in token_to_id:
                token_to_id[token] = current_id
                id_to_token[current_id] = token
                id_frequency[current_id] = frequency
                current_id += 1
            else:
                id_frequency[token_to_id[token]] += frequency
 
    token_to_id['[UNK]'] = current_id
    id_to_token[current_id] = '[UNK]'
    id_frequency[current_id] = 0
 
    return token_to_id, id_to_token, id_frequency
 
tokens, reverse_tokens, token_frequency = build_token_dictionary(vocab)

Using the dictionary that stores subword token pairs, we can create a tokenization function that checks for the longest subword match for every word and assigns tokens accordingly.

def tokenize(text, token_dict, verbose=False):
    tokens = []
    words = text.split()
 
    for word in words:
        word_with_marker = word + '</w>'
        i = 0
        while i < len(word_with_marker):
            for j in range(len(word_with_marker), i, -1):
                subword = word_with_marker[i:j]
                if subword in token_dict:
                    print(subword) if verbose else None
                    tokens.append(token_dict[subword])
                    i = j
                    break
            else:
                tokens.append(token_dict['[UNK]'])
                i += 1
 
    return tokens
 
tokenized_corpus = tokenize(corpus, tokens)

Word2Vec: Skip-gram

When training the Skip-gram Word2Vec model, we need to prepare negative samples. Using the frequency table, we can create probability tables and use those for negative sampling.

def transform_frequencies(id_frequency):
    transformed = {key: value**(3/4) for key, value in id_frequency.items()}
    
    mean = np.sum(list(transformed.values()))
    
    normalized = {key: value / mean for key, value in transformed.items()}
    
    return normalized
 
normalized_frequency = transform_frequencies(token_frequency)
 
def negative_sampling(normalized_frequency, num_samples, avoid_ids=[]):
    # Create a list of ids based on their normalized frequency
    ids = list(normalized_frequency.keys())
    probabilities = list(normalized_frequency.values())
    
    # Filter out avoided IDs
    filtered_ids = []
    filtered_probabilities = []
    
    for i, id in enumerate(ids):
        if id not in avoid_ids:
            filtered_ids.append(id)
            filtered_probabilities.append(probabilities[i])
 
    # Normalize the filtered probabilities
    filtered_probabilities = np.array(filtered_probabilities)
    filtered_probabilities /= np.sum(filtered_probabilities)
    
    # Sample `num_samples` IDs based on the filtered probabilities
    negative_samples = np.random.choice(filtered_ids, size=num_samples, p=filtered_probabilities)
 
    return list(negative_samples)

Based on the above, we can generate training data using the function below.

import tqdm
def generate_skipgram_training_data(tokens, token_frequency, window_size=2):    
    center_words = []
    contexts = []
    labels = []
 
    for center_idx in tqdm.tqdm(range(window_size, len(tokens) - window_size)):
        center_word = tokens[center_idx]
        context_words = [tokens[i] for i in range(center_idx - window_size, center_idx + window_size + 1) if i != center_idx]
        negative_samples = negative_sampling(token_frequency, 2*window_size)
        context = context_words + negative_samples
        label = [1 for _ in context_words] + [0 for _ in negative_samples]
 
        center_words.append(center_word)
        contexts.append(context)
        labels.append(label)
    
    return center_words, contexts, labels
 
center_words, contexts, labels = generate_skipgram_training_data(tokenized_corpus, token_frequency, window_size=5)
center_words, contexts, labels = np.array(center_words), np.array(contexts), np.array(labels)

From the numpy arrays, we can create datasets for either TensorFlow or PyTorch.

BATCH_SIZE = 1024
BUFFER_SIZE = 10000
 
# TensorFlow
import tensorflow as tf
dataset = tf.data.Dataset.from_tensor_slices(((center_words, contexts), labels))
 
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
dataset = dataset.cache().prefetch(buffer_size=tf.data.AUTOTUNE)
 
# PyTorch
import torch
import torch.nn as nn
 
center_words, contexts, labels = map(lambda X: torch.tensor(X, dtype=torch.float32), (center_words, contexts, labels))
dataset = torch.utils.data.TensorDataset(center_words, contexts, labels)
data_loader = torch.utils.data.DataLoader(dataset=dataset, batch_size=BATCH_SIZE, shuffle=True)

Finally, we can define the model in either TensorFlow or PyTorch and train it. The following is the TensorFlow approach, slightly modified from the TensorFlow official implementation to include weight initializers and L1 regularization to stabilize the training.

from tensorflow.keras import layers
class Word2Vec(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim):
    super(Word2Vec, self).__init__()
    self.word_embedding = layers.Embedding(vocab_size,
                                      embedding_dim,
                                      embeddings_initializer="he_normal",
                                      embeddings_regularizer="l1",
                                      name="w2v_embedding")
    self.context_embedding = layers.Embedding(vocab_size,
                                       embedding_dim,
                                       embeddings_initializer="he_normal",
                                       embeddings_regularizer="l1",)
 
  def call(self, pair):
    word, context = pair
    # word: (batch, dummy?)  # The dummy axis doesn't exist in TF2.7+
    # context: (batch, context)
    if len(word.shape) == 2:
      word = tf.squeeze(word, axis=1)
    # word: (batch,)
    word_emb = self.word_embedding(word)
    # word_emb: (batch, embed)
    context_emb = self.context_embedding(context)
    # context_emb: (batch, context, embed)
    dots = tf.einsum('be,bce->bc', word_emb, context_emb)
    # dots: (batch, context)
    return dots
 
def custom_loss(x_logit, y_true):
      return tf.nn.sigmoid_cross_entropy_with_logits(logits=x_logit, labels=y_true)
 
embedding_dim = 1024
vocab_size = len(tokens)
word2vec = Word2Vec(vocab_size, embedding_dim)
word2vec.compile(optimizer='adam',
                 loss=custom_loss,
                 metrics=['accuracy'])
word2vec.fit(dataset, epochs=20)
 
## Accessing to word embedding
# word2vec.word_embedding(center_words[0])

I encourage you to apply the same technique in PyTorch to gain hands-on experience in constructing a Word2Vec model to obtain word embeddings.

Conclusion

Converting text into numerical representations through preprocessing, tokenization, and word embeddings is key to enabling machine learning models to process language. Techniques like Byte Pair Encoding and models like Word2Vec help retain the structure while capturing the meaning of words. However, Word2Vec is not the only approach to generating word embeddings. Hence, in the next article, we'll explore an alternative approach to creating word embeddings.

Resources

Stanford University School of Engineering. 2017. Lecture 3 | GloVe: Global Vectors for Word Representation. YouTube.
Thakur, A. 2021. Subword Tokenization: Byte Pair Encoding. YouTube.
The Semicolon. 2019. Word2Vec - Skipgram and CBOW. YouTube.