Fine-tuning Techniques and Parameter-Efficient Methods
Overview
In our previous lessons, we've explored how to train language models from scratch and how to monitor training and engineer datasets. However, training models from scratch is resource-intensive and often unnecessary. Fine-tuning existing pre-trained models is a more efficient approach for most applications.
This lesson focuses on fine-tuning techniques for large language models, with special emphasis on parameter-efficient methods. As models grow to billions of parameters, traditional fine-tuning becomes prohibitively expensive. We'll explore how methods like LoRA, QLoRA, and other PEFT (Parameter-Efficient Fine-Tuning) approaches make it possible to adapt these massive models with limited computational resources.
Learning Objectives
After completing this lesson, you will be able to:
- Understand the differences between pre-training and fine-tuning
- Implement full fine-tuning for smaller models
- Apply parameter-efficient fine-tuning techniques like LoRA and adapters
- Select appropriate fine-tuning strategies based on available resources
- Diagnose and fix common fine-tuning issues
- Evaluate fine-tuned models effectively
From Pre-training to Fine-tuning
The Two-phase Learning Paradigm
Modern NLP follows a two-phase approach:
- Pre-training: Learning general language patterns from vast amounts of data
- Fine-tuning: Adapting the pre-trained model to specific tasks or domains
Analogy: Fine-tuning as Specialized Education
Think of pre-training and fine-tuning as education stages:
- Pre-training: General education that builds foundational knowledge (like K-12 and undergraduate studies)
- Fine-tuning: Specialized training for specific professions (like medical school, law school, or vocational training)
Just as a medical student builds upon general knowledge to develop specialized skills, fine-tuning builds upon a pre-trained model's general language understanding to develop task-specific capabilities.
Why Fine-tune?
This interactive tool is still under development. Check back later!
Full Fine-tuning: The Traditional Approach
How Full Fine-tuning Works
Full fine-tuning updates all parameters of a pre-trained model on a downstream task:
- Initialize with pre-trained weights
- Add task-specific head if needed (e.g., classification layer)
- Train on task-specific data with a lower learning rate
- Update all parameters throughout the network
Implementing Full Fine-tuning
{"tool": "code-editor", "defaultValue": "from transformers import AutoModelForSequenceClassification, AutoTokenizer from transformers import Trainer, TrainingArguments from datasets import load_dataset
Load pre-trained model
model_name = "bert-base-uncased" model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2) tokenizer = AutoTokenizer.from_pretrained(model_name)
Prepare dataset (example: IMDB sentiment analysis)
dataset = load_dataset("imdb")
def tokenize_function(examples): return tokenizer(examples["text"], padding="max_length", truncation=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
Define training arguments
training_args = TrainingArguments( output_dir="./results", learning_rate=2e-5, per_device_train_batch_size=8, per_device_eval_batch_size=8, num_train_epochs=3, weight_decay=0.01, evaluation_strategy="epoch", save_strategy="epoch", load_best_model_at_end=True, )
Initialize Trainer
trainer = Trainer( model=model, args=training_args, train_dataset=tokenized_datasets["train"], eval_dataset=tokenized_datasets["test"], )
Fine-tune the model
trainer.train()"}
Challenges with Full Fine-tuning
As models grow larger, full fine-tuning faces significant challenges:
-
Memory Requirements:
- A 7B parameter model in FP16 requires ~14GB just to store
- Backpropagation requires additional memory for gradients and optimizer states
- A rule of thumb: need 3-4x model size in GPU memory
-
Computational Cost:
- Training cost scales linearly with parameter count
- Fine-tuning 175B parameter models can cost thousands of dollars
-
Catastrophic Forgetting:
- Aggressive fine-tuning can cause the model to "forget" general capabilities
- Finding the right balance is challenging
Parameter-Efficient Fine-tuning (PEFT)
The PEFT Revolution
Parameter-Efficient Fine-Tuning methods fine-tune only a small subset of parameters while keeping most of the pre-trained model frozen.
Analogy: PEFT as Adding Specialized Tools
Think of PEFT as adding specialized tools to a well-equipped workshop:
- The workshop (pre-trained model) already has general-purpose tools
- Instead of rebuilding the entire workshop, you add a few specialized tools (trainable parameters)
- These specialized tools enable specific tasks while leveraging the existing equipment
Core PEFT Methods
Adapter-based Methods
How Adapters Work
Adapters are small neural network modules inserted between layers of a pre-trained model:
- Freeze the pre-trained model parameters
- Insert adapter modules after certain layers (typically attention or feed-forward)
- Train only the adapter parameters
- Adapters typically use bottleneck architecture to limit parameter count
Adapter Architecture
Adapters typically use a bottleneck architecture:
- Down-project to a small dimension (e.g., 64)
- Apply non-linearity (e.g., ReLU or GELU)
- Up-project back to original dimension
- Add a residual connection
{"tool": "code-editor", "defaultValue": "import torch import torch.nn as nn
class Adapter(nn.Module): def init(self, input_dim, bottleneck_dim=64): super().init() self.down_project = nn.Linear(input_dim, bottleneck_dim) self.activation = nn.GELU() self.up_project = nn.Linear(bottleneck_dim, input_dim) self.layer_norm = nn.LayerNorm(input_dim)
1 def forward(self, x): 2 residual = x 3 x = self.down_project(x) 4 x = self.activation(x) 5 x = self.up_project(x) 6 x = x + residual # Residual connection 7 x = self.layer_norm(x) 8 return x
Example usage
input_dim = 768 # Hidden dimension for BERT-base adapter = Adapter(input_dim, bottleneck_dim=64)
Input tensor [batch_size, sequence_length, hidden_dim]
sample_input = torch.randn(2, 128, input_dim) output = adapter(sample_input) print(f"Input shape: {sample_input.shape}") print(f"Output shape: {output.shape}") print(f"Number of trainable parameters: {sum(p.numel() for p in adapter.parameters())}")"}
Implementing Adapters with Transformers
{"tool": "code-editor", "defaultValue": "from transformers import AutoModelForSequenceClassification, AutoTokenizer from transformers.adapters import AdapterConfig, PfeifferConfig from datasets import load_dataset
Load pre-trained model
model_name = "bert-base-uncased" model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2) tokenizer = AutoTokenizer.from_pretrained(model_name)
Add and activate adapters
adapter_config = PfeifferConfig(reduction_factor=16) # Creates a bottleneck model.add_adapter("imdb", config=adapter_config) model.train_adapter("imdb") # Only train the adapter parameters model.set_active_adapters("imdb")
Check trainable parameters
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad) total_params = sum(p.numel() for p in model.parameters()) print(f"Trainable parameters: {trainable_params:,} ({trainable_params/total_params:.2%} of total)")
The rest of the fine-tuning process is the same as full fine-tuning
You would use the same Trainer setup as in the full fine-tuning example"}
Low-Rank Adaptation (LoRA)
The LoRA Principle
LoRA is based on a key insight: the updates to pre-trained weights during fine-tuning often have a low "intrinsic rank".
Analogy: LoRA as Efficient Communication
Think of LoRA like compressing a high-resolution image:
- Instead of sending the full image (all parameter updates), you send a compressed version
- The compression works by capturing the most important patterns
- You can reconstruct a close approximation to the original image with much less data
How LoRA Works
- Freeze the pre-trained model weights
- For selected weight matrices, learn low-rank update matrices
- The original operation becomes
1 Y = WXwhere:1 Y = WX + ∆WX- is the frozen pre-trained weight
1 W - is the low-rank update (rank r)
1 ∆W = BA - is a matrix of shape
1 B1 [original_dim, r] - is a matrix of shape
1 A1 [r, original_dim]
Implementing LoRA
{"tool": "code-editor", "defaultValue": "import torch import torch.nn as nn
class LoRALayer(nn.Module): def init(self, in_features, out_features, rank=8, alpha=32): super().init() self.rank = rank self.alpha = alpha self.scaling = alpha / rank
1 # Initialize A with zeros (or small random values) 2 self.A = nn.Parameter(torch.zeros(in_features, rank)) 3 nn.init.kaiming_uniform_(self.A, a=math.sqrt(5)) 4 5 # Initialize B with zeros 6 self.B = nn.Parameter(torch.zeros(rank, out_features)) 7 8 def forward(self, x, orig_weights): 9 # Original operation: x @ orig_weights 10 # LoRA operation: x @ orig_weights + x @ (A @ B) * scaling 11 return x @ orig_weights + (x @ self.A @ self.B) * self.scaling
Example usage with a pre-trained Linear layer
class LinearWithLoRA(nn.Module): def init(self, base_layer, rank=8, alpha=32): super().init() self.base_layer = base_layer # Freeze the original layer for param in self.base_layer.parameters(): param.requires_grad = False
1 # Add LoRA components 2 self.lora = LoRALayer( 3 base_layer.in_features, 4 base_layer.out_features, 5 rank=rank, 6 alpha=alpha 7 ) 8 9 def forward(self, x): 10 return self.lora(x, self.base_layer.weight)"}
LoRA with PEFT Library
{"tool": "code-editor", "defaultValue": "from transformers import AutoModelForCausalLM, AutoTokenizer from peft import get_peft_model, LoraConfig, TaskType from datasets import load_dataset
Load pre-trained model
model_name = "facebook/opt-1.3b" # Using a 1.3B parameter model as example model = AutoModelForCausalLM.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name)
Define LoRA configuration
lora_config = LoraConfig( r=16, # Rank lora_alpha=32, # Alpha parameter target_modules=["q_proj", "v_proj"], # Apply LoRA to query and value projections lora_dropout=0.05, # Dropout probability for LoRA layers bias="none", # Don't add bias terms task_type=TaskType.CAUSAL_LM # The task type )
Create PEFT model
model = get_peft_model(model, lora_config)
Print the number of trainable parameters
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad) total_params = sum(p.numel() for p in model.parameters()) print(f"Trainable parameters: {trainable_params:,} ({trainable_params/total_params:.2%} of total)")
Now we can fine-tune this model with much fewer resources
The training process would be similar to the previous examples
but would require significantly less memory and compute"}
Quantized LoRA (QLoRA)
Combining Quantization and LoRA
QLoRA combines two powerful techniques:
- Quantization: Reduces the precision of model weights (e.g., from FP16 to 4-bit)
- LoRA: Adds trainable low-rank adapters
Why QLoRA Works
-
Memory Efficiency:
- 4-bit quantization reduces memory footprint by 4x compared to FP16
- Only small LoRA modules are kept in higher precision for training
-
Minimal Performance Loss:
- Novel quantization techniques like Double Quantization minimize precision loss
- LoRA updates compensate for any quantization artifacts
Implementing QLoRA
{"tool": "code-editor", "defaultValue": "from transformers import AutoModelForCausalLM, AutoTokenizer from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model import torch from datasets import load_dataset
Load pre-trained model in 4-bit quantization
model_name = "meta-llama/Llama-2-7b-hf" # Example with a 7B parameter model model = AutoModelForCausalLM.from_pretrained( model_name, load_in_4bit=True, # Load model in 4-bit precision device_map="auto", # Automatically distribute model across available GPUs quantization_config=BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=True, # Double quantization bnb_4bit_quant_type="nf4" # Normalized float 4 ) ) tokenizer = AutoTokenizer.from_pretrained(model_name)
Prepare model for k-bit training
model = prepare_model_for_kbit_training(model)
Define LoRA configuration
lora_config = LoraConfig( r=64, # Higher rank for better performance lora_alpha=16, target_modules=[ "q_proj", "k_proj", "v_proj", "o_proj", # Attention modules "gate_proj", "up_proj", "down_proj" # MLP modules ], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM" )
Apply LoRA adapters
model = get_peft_model(model, lora_config)
Print trainable parameters
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad) total_params = sum(p.numel() for p in model.parameters()) print(f"Trainable parameters: {trainable_params:,} ({trainable_params/total_params:.2%} of total)")
Fine-tune as usual, but with much lower memory requirements
This setup allows fine-tuning a 7B parameter model on a single consumer GPU"}
Other PEFT Methods
Prefix Tuning
Prefix tuning prepends trainable vectors (virtual tokens) to the input of each transformer layer:
- Freeze the pre-trained model
- Add trainable prefix tokens to each layer
- These prefix tokens influence the model's behavior through attention
Prompt Tuning and P-Tuning
- Prompt Tuning: Adds trainable tokens only to the input layer
- P-Tuning: Uses a small neural network to generate soft prompts
IA³ (Infused Adapter by Inhibiting and Amplifying Inner Activations)
A highly parameter-efficient method that scales activations with learned vectors:
- Requires minimal additional parameters (often <0.1%)
- Simple element-wise multiplication operation
- Often works well for cross-lingual transfer
Practical Considerations for Fine-tuning
Selecting the Right Method
This interactive tool is still under development. Check back later!
Decision Framework
Use this framework to select the appropriate fine-tuning method:
-
When to use Full Fine-tuning:
- Smaller models (<1B parameters)
- Abundant computational resources
- Need maximum performance
-
When to use LoRA/Adapters:
- Medium to large models (1B-13B parameters)
- Limited but substantial resources
- Need balance of performance and efficiency
-
When to use QLoRA:
- Very large models (>7B parameters)
- Highly constrained resources
- Consumer-grade hardware
-
When to use Prefix/Prompt Tuning:
- Extremely large models
- Minimal resources
- Acceptable performance trade-off
Hyperparameter Considerations
Key hyperparameters for PEFT methods:
-
LoRA-specific:
- Rank (r): Higher values give better performance but use more parameters
- Alpha (α): Scaling factor, typically set to 2r
- Target modules: Which layers to apply LoRA to
-
Adapter-specific:
- Bottleneck dimension: Controls adapter size
- Adapter placement: Which layers to add adapters to
-
General fine-tuning:
- Learning rate: Typically lower for fine-tuning (1e-5 to 5e-5)
- Weight decay: Helps prevent overfitting (0.01 to 0.1)
- Training epochs: Often fewer for fine-tuning (2-5)
Avoiding Catastrophic Forgetting
Strategies to preserve general capabilities:
- Use lower learning rates
- Implement early stopping
- Apply regularization techniques
- Balance task-specific data with general data
- Consider multi-task fine-tuning
Advanced Topics in Fine-tuning
Domain Adaptation vs. Task Adaptation
-
Domain Adaptation:
- Adapts to a specific domain (e.g., medical, legal)
- Preserves general capabilities
- Often requires continued pre-training
-
Task Adaptation:
- Focuses on specific tasks (e.g., classification, summarization)
- May specialize at the expense of generality
- Typically uses supervised fine-tuning
Instruction Tuning
Fine-tuning models on instruction-following data:
- Input format: Typically uses a template like "Instruction: {instruction}\nInput: {input}\nOutput:"
- Dataset composition: Mix of different task types and formats
- Evaluation: Measures ability to follow diverse instructions
{"tool": "code-editor", "defaultValue": "from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments from peft import get_peft_model, LoraConfig, TaskType from datasets import load_dataset import torch
Load model and tokenizer
model_name = "facebook/opt-1.3b" model = AutoModelForCausalLM.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name)
Add special tokens if needed
tokenizer.pad_token = tokenizer.eos_token
Prepare instruction dataset
instruction_dataset = load_dataset("tatsu-lab/alpaca") # Example instruction dataset
Format data with instruction template
def format_instruction(example): if example["input"]: text = f"""Instruction: {example["instruction"]} Input: {example["input"]} Output: {example["output"]}""" else: text = f"""Instruction: {example["instruction"]} Output: {example["output"]}"""
1 return {"text": text}
formatted_dataset = instruction_dataset.map(format_instruction)
Tokenize dataset
def tokenize_function(examples): return tokenizer( examples["text"], truncation=True, max_length=512, padding="max_length" )
tokenized_dataset = formatted_dataset.map(tokenize_function, batched=True)
Apply LoRA for efficient fine-tuning
lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"], lora_dropout=0.05, bias="none", task_type=TaskType.CAUSAL_LM )
model = get_peft_model(model, lora_config)
Set up training arguments
training_args = TrainingArguments( output_dir="./instruction-tuned-model", learning_rate=2e-5, per_device_train_batch_size=4, gradient_accumulation_steps=4, max_steps=1000, save_steps=200, logging_steps=50, )
Train the model
trainer = Trainer( model=model, args=training_args, train_dataset=tokenized_dataset["train"], data_collator=lambda data: {'input_ids': torch.stack([f['input_ids'] for f in data]), 'attention_mask': torch.stack([f['attention_mask'] for f in data]), 'labels': torch.stack([f['input_ids'] for f in data])}, )
trainer.train()"}
Multi-task Fine-tuning
Training on multiple tasks simultaneously:
-
Benefits:
- Improves generalization
- Prevents overfitting to a single task
- Reduces catastrophic forgetting
-
Implementation:
- Collect datasets for multiple tasks
- Balance task representation
- Add task-specific identifiers or prompts
Continual Learning and Sequential Fine-tuning
Strategies for learning new tasks without forgetting:
-
Elastic Weight Consolidation (EWC):
- Identifies important parameters for previous tasks
- Penalizes changes to these parameters when learning new tasks
-
Knowledge Distillation:
- Uses original model as teacher
- Prevents new model from diverging too far
-
Replay Methods:
- Maintains a buffer of examples from previous tasks
- Intermixes these with new task examples during training
Practical Exercises
Exercise 1: LoRA Fine-tuning
Implement LoRA fine-tuning for a sentiment classification task:
- Load a pre-trained model (e.g., BERT or RoBERTa)
- Configure LoRA adapters
- Fine-tune on a sentiment dataset (e.g., SST-2 or IMDB)
- Evaluate performance and parameter efficiency
Exercise 2: QLoRA for Large Models
Use QLoRA to fine-tune a large language model (>7B parameters) on a single GPU:
- Set up 4-bit quantization
- Configure LoRA adapters
- Fine-tune on an instruction dataset
- Compare performance before and after fine-tuning
Exercise 3: Method Comparison
Compare different PEFT methods on the same task:
- Implement Full Fine-tuning, LoRA, Adapters, and Prefix Tuning
- Train each method with the same dataset and hyperparameters
- Analyze performance, memory usage, and training time
- Recommend the best method for different scenarios
Conclusion
Parameter-efficient fine-tuning methods have democratized access to large language models, making it possible to adapt billion-parameter models with limited resources. These techniques not only reduce computational requirements but often provide comparable performance to full fine-tuning.
As models continue to grow, PEFT methods will become increasingly important. The rapid pace of innovation in this area—from adapters to LoRA to QLoRA—suggests that even more efficient techniques may emerge in the future, further lowering the barrier to working with advanced language models.
In our next lesson, we will explore distributed training infrastructure, enabling you to work with even larger models across multiple devices or machines.
Additional Resources
Papers
- "LoRA: Low-Rank Adaptation of Large Language Models" (Hu et al., 2021)
- "QLoRA: Efficient Finetuning of Quantized LLMs" (Dettmers et al., 2023)
- "Parameter-Efficient Transfer Learning for NLP" (Houlsby et al., 2019, Adapters)
- "The Power of Scale for Parameter-Efficient Prompt Tuning" (Lester et al., 2021)
Libraries and Tools
- PEFT Library by Hugging Face
- Adapter-Transformers
- bitsandbytes for quantization