This week we will be discussed natural language models (NLPs) and how they process text. NLP models are an interface for computers to understand, interpret, and generate human language. They achieve this by representing words, phrases, and sentences as vectors (also known as word embeddings), and training on large amounts of these vectors (a.k.a. large amounts of text) to learn their underlying patterns and relationships. These vectors capture the different meanings that a word can have and are used to relate words. Similar words, such as “cat” and “dog” will be close to each other in the vector space, while words like “car” and “kitchen” will be further apart. In summary, these vectors tell the models how all the words in a sentence relate to each other [1].

The process of producing these vectors is important to understand. Let’s consider the phrase “I love dogs.” for simplicity. As discussed in the previous paragraph, a computer understands 1’s and 0’s, not text. So we must figure out how to convert this text to numbers. First, we must tokenize the text. This means breaking the phrase up into single units. In our example, this would be broken up into the following:

“I love dogs.” -> [“I”, “love”, “ dogs”, “.”]

Notice that the spaces are not perserved and the punctuation is independent. Depending on the tokenizer, spaces may or may not be perserved. Each individual entry in the list will then be converted to a number called token IDs. These IDs are then converted into vectors using an embedding matrix. This is essentially a predefined lookup table that tells the model what vector corresponds to each token. All models have this built into them and can be found by doing something like model.embeddings.word_embeddings.

One may think, “I get that there is a table where one can see what vector corresponds to each token, but doesn’t context matter?” Yes! Context is important when building one of these models. A word being at the beginning of a phrase may have a completely different meaning when found towards the end of the phrase. Multiple things come into play. There is the concept of attention, which means that one word gets to see all other words in the sentence so that long-range relationships can be captured. The advantage to this is that the full context of a sentence can be understood, and this process can be parallelized so that all tokens are processed simultaneously. Additionally, positional embeddings also come into play. Once the vector from the lookup table is found, positional information is added to these vectors. This is to tell the model where in the phrase that specific token was at. So the word at position 0 in one phrase and at position 5 in a different phrase will get slightly different vectors. It is these vectors that are then used to fine-tune the model. An example of this is the following sentence:

“The dog chases the cat.”

If positional embeddings are not implemented, the model will not know the positions of the words since the tokens are processed in parallel so an identical sentence would be:

“The cat chases the dog.”

The meaning of the sentence has now been lost.

As a note, one token does not always correspond to one word. For a work like “running”, this would be split up into “run” and “ning”, hence becoming two tokens.

Once trained, the tasks that can be performed by NLP models range from (but not limited to):

  1. Translations
  2. Summarizing information
  3. Identifying sentiment
  4. Generating text (e.g. telling a story)

Some examples of popular NLP models that you may have heard of before Large Language Models (LLMs) such as ChatGPT from OpenAI and Gemini from Google. These models are both encoders and decoders. Encoders are used for understanding text. This is useful when you talk about identifying sentiment, classification, and recognition. On the other hand, decoders are used for text generation, such as summarization and speech-to-text conversion. Fundamentally, the difference between both is that encoders are bidirectional, which means that token $T_n$ is influenced by tokens $T_{n-1}$ and $T_{n+1}$, while decoders are autoreggresive, which means that token $T_n$ is influenced solely by tokens $T < T_{n}$. They can also work hand-in-hand when considering tasks such as translations and summarization because first the model has to understand the full text, then it has to generate the new text.

The process of working with NLPs primarily revolve around fine-tuning instead of training from scratch. The reason is that these models are pre-trained with huge amounts of general language data so that the model understands how to output language. This type of training is computationally expensive and may not be achieveable on a personal laptop. Instead, we can take a model that has already been trained on a huge amount of data and tweak it to fit our problem. This is the fine-tuning process. Fine-tuning takes a labeled dataset and a loss function, and uses them to tweak the weights in the model towards the problem you are trying to solve. Another question that may arise is: “why would we tweak the model weights instead of just fine-tuning the classifier?” This is because tweaking the model weights allows for the pretrained model to adapt to the new domain for better accuracy. If one were to just fine-tune the classifier, the model itself may not produce accurate results for the specific task. In other words, you are orienting the model to what you need it to do.

For this set of posts, we will be working with two different models: an encoder and a decoder. For the encoder, we will be performing sentiment classification using DistilBERT (Distilled BERT), which is available on the Hugging Face website [2-3]. This pre-trained encoder comes from a bigger model known as BERT (Bidirectional Encoder Representations from Transformers). Compared to BERT, DistilBERT is about $56$% of the size and $60$% faster, but only retains $97$% of the performance, hence it is a compressed version of BERT. Overall, it is a great choice for a small project since it’s lightweight, fast, and accurate. Since this model is already pre-trained using large datasets, our job is to fine-tune it to the specific problem we are considering.

For the decoder, we will be performing text generator using GPT-2 (Generative Pretrained Transformer 2). This pre-trained decoder is a transformer-based neural network that has been trained on a massive amount of text found on the internet. It’s purpose is to generate the next word in a sentence in a coherent and contextual way. This model was developed by OpenAI in 2019 and a smaller version (~124M parameters) was made available on the Hugging Face website [4]. Like the encoder, this model is already pre-trained using large datasets, so our job would be to fine-tune it to the specific problem we are considering.

We have now gone over what natural language models and how they process text. We have also discussed encoders versus decoders and where they are useful. We finished off by introducing the model we will use and the concept of fine-tuning. Overall, NLPs are important to the foundation of LLMs since they supply the methods and frameworks that make human-computer language interaction possible.

Feel free to reach out if you have any questions about what we covered this week. We will show a concrete coding example where we showcase how to fine-tune these models. Stay tuned!

  1. [1] Vinija Jain & Aman Chadha, Word Vectors, Distilled AI, 2021.
  2. [2] Victor Sanh, Lysandre Debut, Julien Chaumond, & Thomas Wolf, DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter, arXiv preprint arXiv:1910.01108, 2019.
  3. [3] Thomas Wolf, Lysandre Debut, Victor Sanh, et al., HuggingFace’s Transformers: State-of-the-art Natural Language Processing, arXiv preprint arXiv:1910.03771, 2019.
  4. [4] Alec Radford, Jeffrey Wu, Rewon Child, et al., Language models are unsupervised multitask learners, OpenAI blog, 2019.

⬅ Back to Home

Updated: