The Rich Don't Fine-tune Like You and Me: Intro to LoRA and QLoRA

Rich people fine-tuning models
People with Hundreds of Millions of Dollars in Cloud Computing Credits to Burn

Your choice in methods of fine-tuning Large Language Models is determined by your answer to the question:

"Do you have access to hundreds of millions of dollars in capital?"

If your answer is no, you're probably looking at single or multi-device LoRA or qLoRA - two methods of "steering" a foundational model toward different outputs without incurring the massive resource and time needs of full fine-tuning.

Transfer learning enables models trained on large datasets to be adapted for specific tasks or domains through fine-tuning, a crucial aspect of modern AI development.

Transfer learning
Transfer learning is the process of taking a model trained on one task and using it for another task.

However, the resource requirements for traditional fine-tuning methods starkly divide the AI community, separating those with access to enormous computational resources from those without.

AI Resource Gap
The AI Resource Gap: People with Hundreds of Millions of Dollars in Cloud Computing Credits vs. People with $0 in Cloud Computing Credits and Old Gaming Rigs Lying Around

Fine-tuning is crucial because it allows us to take a general-purpose model and specialize it for particular applications, improving its performance on specific tasks without the need to train a new model from scratch.

For example, fine-tuning can be used to adapt a language model to generate more accurate customer support responses for a specific company or improve a medical image classifier's ability to detect certain diseases from X-rays.

For those without access to vast financial resources, techniques like LoRA (Low-Rank Adaptation) and qLoRA (Quantized LoRA) have emerged as powerful alternatives.

These methods allow for efficient "steering" of foundational models towards desired outputs, bypassing the need for extensive computational resources and time typically associated with full fine-tuning.

The Resource Divide in AI Model Adaptation

Full Fine-tuning: The Resource-Intensive Approach

Full fine-tuning is the process of updating all the weights of a pre-trained model's parameters to adapt it to a new task or domain. This method can potentially yield the best results, as it allows the model to adjust its entire knowledge base to the new data.

Fine-tuning on an Android
Full Fine-Tuning: The Resource-Intensive Approach

However, the resource requirements for this approach are staggering and often prohibitive for all but the largest tech companies and research institutions.

To put this into perspective, consider the example of OpenAI's ChatGPT 4. Sam Altman, CEO of OpenAI, has publicly stated that the company spent around $100 million on cloud resources to train this model.

This level of investment is simply not feasible for the vast majority of developers, researchers, or even medium-sized companies.

Hacker Joy
A machine learning engineer burning through $100,000 of cloud computing credits on a single fine-tuning run.

PEFT: Democratizing Model Adaptation

Given the prohibitive resource requirements of full fine-tuning, researchers have been actively seeking more efficient alternatives. This search has led to the development of Parameter-Efficient Fine-Tuning (PEFT) techniques.

PEFT methods aim to achieve comparable performance to full fine-tuning while updating only a small subset of the model's parameters.

PEFT
Parameter Efficient Fine-Tuning: Democratizing Model Adaptation

Several PEFT techniques have emerged in recent years, each with its own approach to efficient model adaptation:

  1. Prefix Tuning: This method prepends a trainable prefix to the input of each layer in the model. By optimizing only these prefixes, the model can be adapted to new tasks with minimal parameter updates.
  2. Prompt Tuning: Similar to prefix tuning, this technique focuses on optimizing a small set of continuous task-specific vectors (soft prompts) that are prepended to the input.
  3. Adapter Layers: This approach involves inserting small trainable modules (adapters) between the layers of a pre-trained model. Only these adapter layers are updated during fine-tuning, leaving the original model parameters unchanged.

Among these PEFT techniques, one method has gained particular prominence due to its effectiveness and efficiency: Low-Rank Adaptation, or LoRA.

What is LoRA?

Low-Rank Adaptation (LoRA) is a technique that efficiently fine-tunes large language models without modifying all their parameters.

Developed by Microsoft, LoRA has gained traction for delivering similar performance to full fine-tuning while significantly reducing memory and computational costs. LoRA works by adding small, trainable rank decomposition matrices to the model's existing weights, allowing it to adapt to new tasks or domains without altering most of its original parameters.

For a practical example, consider the image generation platform civit.ai, where users can create and share LoRA adapters.

Civit.ai, a website for sharing LoRA adapters and other models
Civit.ai: A platform for creating and sharing LoRA adapters

Suppose you want to generate images of a pumpkin patch for Halloween. A base model might produce fall-themed images—pumpkins, hot cider, and hay bales.

Pumpkin Patch
Your Standard run of the mill pumpkin patch

However, by applying a Halloween-themed LoRA adapter, you can add spooky elements without retraining the entire model.

Pumpkin Patch
A LoRA adapter that will "steer" a base model toward a Halloween theme

This method efficiently alters the output of the base model, without retraining, and you get a spooky pumpkin patch.

Pumpkin Patch
LoRA adapter result: A pumpkin patch with a Halloween theme

The same technique can be applied to text generation models to adapt them to specific writing styles or topics.

How LoRA Works

LoRA uses low-rank matrix decomposition to update a model's behavior efficiently. Instead of modifying the dense weight matrices of the pre-trained model, LoRA introduces two smaller matrices—down-projection and up-projection—alongside the original weights.

The down-projection reduces the input to a lower-dimensional space, and the up-projection maps it back. These matrices form a low-rank update, and only they are trained, leaving the original parameters unchanged.

Inference time means using a trained model, or model plus LoRA adapter, to generate a "prediction" - or output, given some "input" (usually a user prompt).

At inference, the LoRA updates merge with the original weights, resulting in no extra computational overhead. Here's an example in Python using Hugging Face's Transformers library where I'm loading a LoRA adapter I trained and merging its weights with the base model at inference time:

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from peft import PeftModel
import torch

# Specify the base model and the LoRA adapter - which are both available on Hugging Face's model hub
model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"
new_model = "zackproser/Meta-Llama-3.1-8B-instruct-zp-writing-ft-qlora"

# Load the base model and the LoRA adapter
base_model = AutoModelForCausalLM.from_pretrained(model_name)

model = PeftModel.from_pretrained(base_model, new_model)
# Merge the LoRA adapter with the base model
model = model.merge_and_unload()

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# Create a pipeline for text generation
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=2000, device=0)

 # Inference: Generate a response, using the base model merged with the LoRA adapter's weights
prompt = "Write me an article about getting faster as a developer"
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result)

Advantages of LoRA

  1. Memory Efficiency: LoRA reduces memory use by updating fewer parameters, allowing larger models to run on consumer hardware.
  2. Faster Training: With fewer parameters to update, training is faster, enabling quicker iterations.
  3. Reduced Computational Requirements: LoRA lowers the computational burden, making model adaptation accessible to more developers.
  4. Flexibility: LoRA adapters can be easily swapped or combined, enabling modular experimentation without retraining.
  5. Preservation of Original Model: Since LoRA doesn’t modify original weights, you can revert to the base model or switch adaptations seamlessly.

LoRA vs. Full Fine-tuning

While LoRA offers substantial efficiency gains, it’s useful to compare its performance and resource requirements to full fine-tuning:

AspectFull Fine-tuningLoRA
Parameter UpdatesAll model parametersOnly LoRA matrices (typically < 1% of model size)
Memory RequirementsVery High (Full model + gradients)Low (Only LoRA parameters + gradients)
Training TimeLong (Days to weeks for large models)Short (Hours to days)
PerformancePotentially highestComparable to full fine-tuning in many cases
FlexibilityLimited (entire model is task-specific)High (adapters can be swapped/combined)
Hardware RequirementsHigh-end GPUs, often multipleCan run on consumer-grade hardware

While full fine-tuning can potentially achieve the best performance, the trade-off in terms of resources and time is substantial. LoRA provides a compelling alternative that can achieve comparable results in many cases, with only a fraction of the resources.

Practical Example: Using Torchtune for LoRA and qLoRA

Torchtune is a native PyTorch library that helps you implement LoRA (Low-Rank Adaptation) and qLoRA (Quantized Low-Rank Adaptation) for efficient model fine-tuning.

PyTorch
PyTorch's Torchtune: The Library I Used to Implement LoRA

Accessing and Customizing Recipes

Torchtune provides pre-configured recipes for various fine-tuning scenarios. Users can view available recipes with the tune ls command:

tune ls
RECIPE CONFIG
full_finetune_single_device llama2/7B_full_low_memory mistral/7B_full_low_memory
full_finetune_distributed llama2/7B_full llama2/13B_full mistral/7B_full
lora_finetune_single_device llama2/7B_lora_single_device llama2/7B_qlora_single_device mistral/7B_lora_single_device

To use a recipe, copy it to your local directory with tune cp:

tune cp llama3_1/8B_qlora_single_device my_conf

This creates a YAML configuration file you can customize.

Configuration File Structure

The resulting YAML file contains detailed settings for the fine-tuning process. Key sections include:

  1. Model Arguments: Specifies the base model and LoRA parameters.
  2. Tokenizer: Defines the tokenizer path and settings.
  3. Checkpointer: Manages model checkpoints.
  4. Dataset and Sampler: Configures the training data.
  5. Optimizer and Scheduler: Sets learning rate and optimization parameters.
  6. Training: Defines epochs, batch size, and other training specifics.
  7. Logging: Configures metrics logging, often using Weights & Biases.

LoRA-specific Configuration

The configuration file includes LoRA-specific parameters:

lora_attn_modules: ['q_proj', 'v_proj', 'k_proj', 'output_proj']
apply_lora_to_mlp: True
apply_lora_to_output: False
lora_rank: 8
lora_alpha: 16

These settings determine which model components are adapted and the extent of the adaptation.

qLoRA Implementation

For qLoRA, Torchtune uses quantization to reduce memory usage while maintaining performance. The configuration specifies:

*component*: torchtune.models.llama3_1.qlora_llama3_1_8b

This component implements quantization alongside LoRA, enabling efficient fine-tuning on consumer-grade hardware.

Practical Application

To initiate fine-tuning with the configured recipe, users run:

tune run lora_finetune_single_device --config llama3_1/8B_qlora_single_device

This command applies the qLoRA technique to the specified Llama 3.1 8B model, using the parameters defined in the configuration file.

Torchtune's LoRA and qLoRA implementations enable efficient model adaptation to specific tasks or domains, minimizing the computational costs of full fine-tuning—ideal for resource-constrained projects and rapid iteration.

Further Reading

If you're interested in learning more about LoRA and related techniques, here are some valuable resources:

Thanks for reading!

Thanks for reading! If you have any questions, feel free to reach out to me.