This week I will walk through the process of training a model using a low-rank adaptation (LoRA). For this, we will be using the GPT-2 model we’ve aleady used in previous posts. We cannot use the GPT-4.1 nano model since it does not come with adjustable weights.

LoRA is a parameter-efficient fine-tuning (PEFT) technique that introduces small trainable low-rank matrices into layers of a pretrained model while keeping the original model weights frozen (retains pre-trained knowledge) [1]. The purpose is to change the model behavior without retraining all the model parameters. It works by training the LoRA adapter (a.k.a. weights) on the information we want to include in the base model and then merging it into the model before inference. Essentially, we inject information into large language models (LLMs) by inserting neural layers between the transformer layers and modify the weights using LoRA, which are a very small fraction compared to the model weights. This is ideal when you want to take a model and make it domain specific instead of expending resources in retraining and adjusting the billions of parameters found in current models.

As always, the first step is to define our parameters in a .yaml file and use SimpleNamespace to store them. In this case, the parameters are the following:

{
  "model_name": "gpt2",
  "seed": 42,
  "max_token_length": 128,
  "lr": 0.0001,
  "warmup_ratio": 0.0,
  "epochs": 500,
  "train_batch_size": 1,
  "strategy": "steps",
  "logging_steps": 5,
  "res_path": "../results/"
}

Most of these parameters should be familiar from previous posts so I will not go over them. Note that I am logging the training loss every $5$ steps.

Let’s first look as to how we are building the dataset that will be used to train the LoRA adapter. There are many different ways that a dataset can be built. In this case, I just wrapped a list of Q&As in a function and converted it into a Dataset using the datasets library provided by Hugging Face. Here is the list:

texts = [
    "Q: How do transformers work?\nA: Transformers process sequences using self-attention.",
    "Q: What does physics explain?\nA: Physics explains natural phenomena using mathematical models.",
    "Q: What do machine learning models do?\nA: Machine learning models learn patterns from data.",
    "Q: How do transformers work?\nA: Transformers process sequences using self-attention mechanisms.",
    "Q: What is self-attention?\nA: Self-attention allows tokens in a sequence to attend to each other.",
]

Notice the format of the data. I purposely wrote Q to label the question, included a line break \n and an A to label the answer. The reason for this is so that the model can know what is meant as a question and what is the answer. While this may not be necessary for more recent models, since I am using GPT-2 I wanted to be sure that the data was properly labeled before passing it into the model. With the data ready, let’s move to creating the LoRA model.

For this task we will use the PEFT library from the Hugging Face ecosystem. While we will be focusing exclusively on LoRA, this library contains many different methods for adapting models. The function we will use is:

def create_lora_model(model):
    config = LoraConfig(
        r=8,
        lora_alpha=16,
        target_modules=["c_attn"],
        lora_dropout=0.1,
        bias="none",
        task_type="CAUSAL_LM",
        fan_in_fan_out=True,
    )
    return get_peft_model(model, config)

This function takes a model and applies a LoRA configuration. Let’s look at the parameters:

  1. r
    • The LoRA rank. A higher rank means more parameters to train, while a lower rank means less parameters.
  2. lora_alpha
    • The scaling factor. This parameter controls how much influence the LoRA adapter contributes relative to the original (frozen) weights.
  3. target_modules
    • Designate which modules with get augmented by the LoRA adapter. In this case, c_attn refers to the attention projection modules in GPT-2, whose job is to map the hidden states to vectors used for self-attention and vice versa.
  4. lora_dropout
    • Standard dropout used in machine learning.
  5. bias
    • Standard bias used in machine learning.
  6. task_type
    • Tells PEFT what kind of model you are passing in.
  7. fan_in_fan_out
    • Dictates how the model weights are stored. For GPT-2, this should be set to True.

From here, the LoRA model setup is complete. The class that contains all the GPT-2 functionality was covered in great detail in this post. While I’ve made some small modifications since then, I will only show where I create the LoRA model (see the file model_local.py for more information). The LoRA model is built in the class constructor and looks like the following:

train_model = AutoModelForCausalLM.from_pretrained(model_name)
self.lora_model = create_lora_model(train_model)

That’s all that’s needed for setting up a LoRA model. Now, all we have to do is evaluate the model, which is done in the Jupyter Notebook (and in this previous post). Let’s look at the parameters of the base and LoRA model. The base model:

model = LocalLM(cfg.model_name)

total_params = sum(p.numel() for p in model.model.parameters())
trainable_params = sum(p.numel() for p in model.model.parameters() if p.requires_grad)

print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")

The output is:

Total parameters: 124,439,808
Trainable parameters: 124,439,808

So the base model has over one hundred million parameters and they are all trainable. The LoRA model:

model.lora_model.print_trainable_parameters()

The output is:

trainable params: 294,912 || all params: 124,734,720 || trainable%: 0.2364

Now, the total trainable parameters in the LoRA model are $0.24\%$ of the total trainable parameters in the base model. This is a huge difference computationally and much more lightweight overall.

The evaluation set that we will use to evaluate the model is the following:

eval_prompts = [
    "Q: How do transformers work?\nA:",
    "Q: What does physics explain?\nA:",
    "Q: What do machine learning models do?\nA:",
    "Q: What is a language model?\nA:",
    "Q: What is overfitting?\nA:",
    "Q: What is a LoRA adapter?\nA:",
]

As you may note, some of these questions are exactly the same questions as in the training set. The purpose is to make sure that the LoRA model is learning correctly, since we expect the model to output the same answer we trained it with. We also include some questions that are not in the dataset so we can see how well the model adapts to information it was not directly trained on. We evaluate these questions with the base model and with the LoRA model to see the difference in output:

Base model output:
Q: How do transformers work?
A: An inverted S is a function which takes three parameters, and multiplies it with the corresponding magnitude.

LoRA model output:
Q: How do transformers work?
A: Transformers process sequences using self-attention mechanisms. Transformers allow tokens in a sequence to attend each other


Base model output:
Q: What does physics explain?
A: It doesn't necessarily have to be physical. You could say the same thing about quantum mechanics, and

LoRA model output:
Q: What does physics explain?
A: Physics explains natural phenomena using mathematical models. Model results show that objects in a sequence do not exist,


Base model output:
Q: What do machine learning models do?
A: In the past, I've discussed what we're talking about here. We can model a whole range

LoRA model output:
Q: What do machine learning models do?
A: Machine-learning model predictions learn patterns from data. Model results can be self-attributing or


Base model output:
Q: What is a language model?
A: Language models are the basic tools that allow you to create systems, algorithms, and procedures for using information

LoRA model output:
Q: What is a language model?
A: Machine learning models learn patterns from data. Model results are self-attentionable.
Rational


Base model output:
Q: What is overfitting?
A: It's not a new problem. But if the players don't understand how to use it, then

LoRA model output:
Q: What is overfitting?
A: Over fitting allows an object to fit together in a fashion. Objects can be fitted without self-att


Base model output:
Q: What is a LoRA adapter?
A: It's basically the same as an XC line. The difference being that there are two different interfaces

LoRA model output:
Q: What is a LoRA adapter?
A: A self-attention mechanism allows tokens in an array to attend to each other. Tokens can be

We see that the LoRA model outputs the answers expected for the questions we trained it on, but the output for the questions we did not include in the training dataset are not adequately answered. While this may be worrisome, it is not surprising. GPT-2’s base capabilities are not as good as recent models. Since the base model for the LoRA adapter was GPT-2, it will most likely only outperform GPT-2 with data it is directly trained on. With a more advanced model, there are much better chances of successful extrapolation to information the LoRA adapter was not directly trained on. Either way, the LoRA model was successfully trained since it could reproduce the answers contained in the training dataset.

We have now learned how LLMs can be adapted in practice using parameter-efficient fine-tuning. We took an existing model and wrapped it with a LoRA adapter. Using this model, we fine-tuned on a small dataset over $500$ epochs and tested the outputs using an evaluation dataset. From the results, we see that the LoRA model out-performed the base model on the data that it was trained on, but did not effectively extrapolate to other information. This was expected behavior since we used GPT-2 as the base model. A non-exhausted list of failure nodes are: training for a small number of epochs, not including enough weights in the LoRA adapter, not having a properly constructed dataset, and cross-contamination between the base and LoRA models (for testing outputs). Overall, we gained hands-on experience with a fine-tuning strategy widely used in the LLM work.

Some extensions involve using a larger base model that is trained on millions of domain-specific examples. With enough resources, one could set up a training scheme that involves hyperparameter tuning to ensure the best adapter is chosen. Additionally, multiple adapters can be trained for various tasks and used in conjunction. This enables flexibility and customization on the outputs without retraining the base model or having one large LoRA model.

The code presented in this post can be found on my Github under the lora-fine-tune repo [2]. Feel free to reach out if you have any questions about what we covered this week. Next time we will combine RAG with a LoRA-adapter model into one framework. Stay tuned!

  1. [1] Edward J. Hu, Yelong Shen, Phillip Wallis, et al., LoRA: Low-Rank Adaptation of Large Language Models, arXiv preprint arXiv:2106.09685, 2021.
  2. [2] Alberto J. Garcia, Parameter-Efficient Fine-Tuning with LoRA, 2026. Available at: https://github.com/AJG91/lora-fine-tune.

⬅ Back to Home

Updated: