Fine-tuning Techniques and Parameter-Efficient Methods

Overview

In our previous lessons, we've explored how to train language models from scratch and how to monitor training and engineer datasets. However, training models from scratch is resource-intensive and often unnecessary. Fine-tuning existing pre-trained models is a more efficient approach for most applications.

This lesson focuses on fine-tuning techniques for large language models, with special emphasis on parameter-efficient methods. As models grow to billions of parameters, traditional fine-tuning becomes prohibitively expensive. We'll explore how methods like LoRA, QLoRA, and other PEFT (Parameter-Efficient Fine-Tuning) approaches make it possible to adapt these massive models with limited computational resources.

Learning Objectives

After completing this lesson, you will be able to:

  • Understand the differences between pre-training and fine-tuning
  • Implement full fine-tuning for smaller models
  • Apply parameter-efficient fine-tuning techniques like LoRA and adapters
  • Select appropriate fine-tuning strategies based on available resources
  • Diagnose and fix common fine-tuning issues
  • Evaluate fine-tuned models effectively

From Pre-training to Fine-tuning

The Two-phase Learning Paradigm

Modern NLP follows a two-phase approach:

  1. Pre-training: Learning general language patterns from vast amounts of data
  2. Fine-tuning: Adapting the pre-trained model to specific tasks or domains

Analogy: Fine-tuning as Specialized Education

Think of pre-training and fine-tuning as education stages:

  • Pre-training: General education that builds foundational knowledge (like K-12 and undergraduate studies)
  • Fine-tuning: Specialized training for specific professions (like medical school, law school, or vocational training)

Just as a medical student builds upon general knowledge to develop specialized skills, fine-tuning builds upon a pre-trained model's general language understanding to develop task-specific capabilities.

Why Fine-tune?

trade-off-visualizer - Coming Soon

This interactive tool is still under development. Check back later!

Tool configuration: {"defaultValue":{"xAxis":{"label":"Resources Required","min":0,"max":100,"step":10},"yAxis":{"label":"Task Performance","min":0,"max":100},"curves":[{"label":"Pre-training from Scratch","values":[100,85],"description":"Highest resource requirements, good but general performance"},{"label":"Full Fine-tuning","values":[60,95],"description":"Moderate resources, excellent task-specific performance"},{"label":"Parameter-Efficient Fine-tuning","values":[20,92],"description":"Low resources, very good task-specific performance"},{"label":"Prompt Engineering","values":[5,75],"description":"Minimal resources, moderate task-specific performance"}]}}

Full Fine-tuning: The Traditional Approach

How Full Fine-tuning Works

Full fine-tuning updates all parameters of a pre-trained model on a downstream task:

  1. Initialize with pre-trained weights
  2. Add task-specific head if needed (e.g., classification layer)
  3. Train on task-specific data with a lower learning rate
  4. Update all parameters throughout the network

Implementing Full Fine-tuning

{"tool": "code-editor", "defaultValue": "from transformers import AutoModelForSequenceClassification, AutoTokenizer from transformers import Trainer, TrainingArguments from datasets import load_dataset

Load pre-trained model

model_name = "bert-base-uncased" model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2) tokenizer = AutoTokenizer.from_pretrained(model_name)

Prepare dataset (example: IMDB sentiment analysis)

dataset = load_dataset("imdb")

def tokenize_function(examples): return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

Define training arguments

training_args = TrainingArguments( output_dir="./results", learning_rate=2e-5, per_device_train_batch_size=8, per_device_eval_batch_size=8, num_train_epochs=3, weight_decay=0.01, evaluation_strategy="epoch", save_strategy="epoch", load_best_model_at_end=True, )

Initialize Trainer

trainer = Trainer( model=model, args=training_args, train_dataset=tokenized_datasets["train"], eval_dataset=tokenized_datasets["test"], )

Fine-tune the model

trainer.train()"}

Challenges with Full Fine-tuning

As models grow larger, full fine-tuning faces significant challenges:

  1. Memory Requirements:

    • A 7B parameter model in FP16 requires ~14GB just to store
    • Backpropagation requires additional memory for gradients and optimizer states
    • A rule of thumb: need 3-4x model size in GPU memory
  2. Computational Cost:

    • Training cost scales linearly with parameter count
    • Fine-tuning 175B parameter models can cost thousands of dollars
  3. Catastrophic Forgetting:

    • Aggressive fine-tuning can cause the model to "forget" general capabilities
    • Finding the right balance is challenging

Parameter-Efficient Fine-tuning (PEFT)

The PEFT Revolution

Parameter-Efficient Fine-Tuning methods fine-tune only a small subset of parameters while keeping most of the pre-trained model frozen.

Analogy: PEFT as Adding Specialized Tools

Think of PEFT as adding specialized tools to a well-equipped workshop:

  • The workshop (pre-trained model) already has general-purpose tools
  • Instead of rebuilding the entire workshop, you add a few specialized tools (trainable parameters)
  • These specialized tools enable specific tasks while leveraging the existing equipment

Core PEFT Methods

Loading tool...

Adapter-based Methods

How Adapters Work

Adapters are small neural network modules inserted between layers of a pre-trained model:

  1. Freeze the pre-trained model parameters
  2. Insert adapter modules after certain layers (typically attention or feed-forward)
  3. Train only the adapter parameters
  4. Adapters typically use bottleneck architecture to limit parameter count
Loading tool...

Adapter Architecture

Adapters typically use a bottleneck architecture:

  1. Down-project to a small dimension (e.g., 64)
  2. Apply non-linearity (e.g., ReLU or GELU)
  3. Up-project back to original dimension
  4. Add a residual connection

{"tool": "code-editor", "defaultValue": "import torch import torch.nn as nn

class Adapter(nn.Module): def init(self, input_dim, bottleneck_dim=64): super().init() self.down_project = nn.Linear(input_dim, bottleneck_dim) self.activation = nn.GELU() self.up_project = nn.Linear(bottleneck_dim, input_dim) self.layer_norm = nn.LayerNorm(input_dim)

1
def forward(self, x):
2
residual = x
3
x = self.down_project(x)
4
x = self.activation(x)
5
x = self.up_project(x)
6
x = x + residual # Residual connection
7
x = self.layer_norm(x)
8
return x

Example usage

input_dim = 768 # Hidden dimension for BERT-base adapter = Adapter(input_dim, bottleneck_dim=64)

Input tensor [batch_size, sequence_length, hidden_dim]

sample_input = torch.randn(2, 128, input_dim) output = adapter(sample_input) print(f"Input shape: {sample_input.shape}") print(f"Output shape: {output.shape}") print(f"Number of trainable parameters: {sum(p.numel() for p in adapter.parameters())}")"}

Implementing Adapters with Transformers

{"tool": "code-editor", "defaultValue": "from transformers import AutoModelForSequenceClassification, AutoTokenizer from transformers.adapters import AdapterConfig, PfeifferConfig from datasets import load_dataset

Load pre-trained model

model_name = "bert-base-uncased" model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2) tokenizer = AutoTokenizer.from_pretrained(model_name)

Add and activate adapters

adapter_config = PfeifferConfig(reduction_factor=16) # Creates a bottleneck model.add_adapter("imdb", config=adapter_config) model.train_adapter("imdb") # Only train the adapter parameters model.set_active_adapters("imdb")

Check trainable parameters

trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad) total_params = sum(p.numel() for p in model.parameters()) print(f"Trainable parameters: {trainable_params:,} ({trainable_params/total_params:.2%} of total)")

The rest of the fine-tuning process is the same as full fine-tuning

You would use the same Trainer setup as in the full fine-tuning example"}

Low-Rank Adaptation (LoRA)

The LoRA Principle

LoRA is based on a key insight: the updates to pre-trained weights during fine-tuning often have a low "intrinsic rank".

Analogy: LoRA as Efficient Communication

Think of LoRA like compressing a high-resolution image:

  • Instead of sending the full image (all parameter updates), you send a compressed version
  • The compression works by capturing the most important patterns
  • You can reconstruct a close approximation to the original image with much less data

How LoRA Works

  1. Freeze the pre-trained model weights
  2. For selected weight matrices, learn low-rank update matrices
  3. The original operation
    1
    Y = WX
    becomes
    1
    Y = WX + ∆WX
    where:
    • 1
      W
      is the frozen pre-trained weight
    • 1
      ∆W = BA
      is the low-rank update (rank r)
    • 1
      B
      is a matrix of shape
      1
      [original_dim, r]
    • 1
      A
      is a matrix of shape
      1
      [r, original_dim]
Loading tool...

Implementing LoRA

{"tool": "code-editor", "defaultValue": "import torch import torch.nn as nn

class LoRALayer(nn.Module): def init(self, in_features, out_features, rank=8, alpha=32): super().init() self.rank = rank self.alpha = alpha self.scaling = alpha / rank

1
# Initialize A with zeros (or small random values)
2
self.A = nn.Parameter(torch.zeros(in_features, rank))
3
nn.init.kaiming_uniform_(self.A, a=math.sqrt(5))
4
5
# Initialize B with zeros
6
self.B = nn.Parameter(torch.zeros(rank, out_features))
7
8
def forward(self, x, orig_weights):
9
# Original operation: x @ orig_weights
10
# LoRA operation: x @ orig_weights + x @ (A @ B) * scaling
11
return x @ orig_weights + (x @ self.A @ self.B) * self.scaling

Example usage with a pre-trained Linear layer

class LinearWithLoRA(nn.Module): def init(self, base_layer, rank=8, alpha=32): super().init() self.base_layer = base_layer # Freeze the original layer for param in self.base_layer.parameters(): param.requires_grad = False

1
# Add LoRA components
2
self.lora = LoRALayer(
3
base_layer.in_features,
4
base_layer.out_features,
5
rank=rank,
6
alpha=alpha
7
)
8
9
def forward(self, x):
10
return self.lora(x, self.base_layer.weight)"}

LoRA with PEFT Library

{"tool": "code-editor", "defaultValue": "from transformers import AutoModelForCausalLM, AutoTokenizer from peft import get_peft_model, LoraConfig, TaskType from datasets import load_dataset

Load pre-trained model

model_name = "facebook/opt-1.3b" # Using a 1.3B parameter model as example model = AutoModelForCausalLM.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name)

Define LoRA configuration

lora_config = LoraConfig( r=16, # Rank lora_alpha=32, # Alpha parameter target_modules=["q_proj", "v_proj"], # Apply LoRA to query and value projections lora_dropout=0.05, # Dropout probability for LoRA layers bias="none", # Don't add bias terms task_type=TaskType.CAUSAL_LM # The task type )

Create PEFT model

model = get_peft_model(model, lora_config)

Print the number of trainable parameters

trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad) total_params = sum(p.numel() for p in model.parameters()) print(f"Trainable parameters: {trainable_params:,} ({trainable_params/total_params:.2%} of total)")

Now we can fine-tune this model with much fewer resources

The training process would be similar to the previous examples

but would require significantly less memory and compute"}

Quantized LoRA (QLoRA)

Combining Quantization and LoRA

QLoRA combines two powerful techniques:

  1. Quantization: Reduces the precision of model weights (e.g., from FP16 to 4-bit)
  2. LoRA: Adds trainable low-rank adapters

Why QLoRA Works

  1. Memory Efficiency:

    • 4-bit quantization reduces memory footprint by 4x compared to FP16
    • Only small LoRA modules are kept in higher precision for training
  2. Minimal Performance Loss:

    • Novel quantization techniques like Double Quantization minimize precision loss
    • LoRA updates compensate for any quantization artifacts
Loading tool...

Implementing QLoRA

{"tool": "code-editor", "defaultValue": "from transformers import AutoModelForCausalLM, AutoTokenizer from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model import torch from datasets import load_dataset

Load pre-trained model in 4-bit quantization

model_name = "meta-llama/Llama-2-7b-hf" # Example with a 7B parameter model model = AutoModelForCausalLM.from_pretrained( model_name, load_in_4bit=True, # Load model in 4-bit precision device_map="auto", # Automatically distribute model across available GPUs quantization_config=BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=True, # Double quantization bnb_4bit_quant_type="nf4" # Normalized float 4 ) ) tokenizer = AutoTokenizer.from_pretrained(model_name)

Prepare model for k-bit training

model = prepare_model_for_kbit_training(model)

Define LoRA configuration

lora_config = LoraConfig( r=64, # Higher rank for better performance lora_alpha=16, target_modules=[ "q_proj", "k_proj", "v_proj", "o_proj", # Attention modules "gate_proj", "up_proj", "down_proj" # MLP modules ], lora_dropout=0.05, bias="none", task_type="CAUSAL_LM" )

Apply LoRA adapters

model = get_peft_model(model, lora_config)

Print trainable parameters

trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad) total_params = sum(p.numel() for p in model.parameters()) print(f"Trainable parameters: {trainable_params:,} ({trainable_params/total_params:.2%} of total)")

Fine-tune as usual, but with much lower memory requirements

This setup allows fine-tuning a 7B parameter model on a single consumer GPU"}

Other PEFT Methods

Prefix Tuning

Prefix tuning prepends trainable vectors (virtual tokens) to the input of each transformer layer:

  1. Freeze the pre-trained model
  2. Add trainable prefix tokens to each layer
  3. These prefix tokens influence the model's behavior through attention

Prompt Tuning and P-Tuning

  • Prompt Tuning: Adds trainable tokens only to the input layer
  • P-Tuning: Uses a small neural network to generate soft prompts

IA³ (Infused Adapter by Inhibiting and Amplifying Inner Activations)

A highly parameter-efficient method that scales activations with learned vectors:

  • Requires minimal additional parameters (often <0.1%)
  • Simple element-wise multiplication operation
  • Often works well for cross-lingual transfer

Practical Considerations for Fine-tuning

Selecting the Right Method

trade-off-visualizer - Coming Soon

This interactive tool is still under development. Check back later!

Tool configuration: {"defaultValue":{"xAxis":{"label":"Resource Efficiency","min":0,"max":10,"step":1},"yAxis":{"label":"Performance","min":0,"max":10},"curves":[{"label":"Full Fine-tuning","values":[2,10],"description":"Highest performance, highest resource usage"},{"label":"Adapters","values":[7,9],"description":"Good performance, moderate efficiency"},{"label":"LoRA","values":[8,9.5],"description":"Excellent performance, high efficiency"},{"label":"QLoRA","values":[9.5,9],"description":"Good performance, highest efficiency"},{"label":"Prefix Tuning","values":[9,8],"description":"Moderate performance, very high efficiency"}]}}

Decision Framework

Use this framework to select the appropriate fine-tuning method:

  1. When to use Full Fine-tuning:

    • Smaller models (<1B parameters)
    • Abundant computational resources
    • Need maximum performance
  2. When to use LoRA/Adapters:

    • Medium to large models (1B-13B parameters)
    • Limited but substantial resources
    • Need balance of performance and efficiency
  3. When to use QLoRA:

    • Very large models (>7B parameters)
    • Highly constrained resources
    • Consumer-grade hardware
  4. When to use Prefix/Prompt Tuning:

    • Extremely large models
    • Minimal resources
    • Acceptable performance trade-off

Hyperparameter Considerations

Key hyperparameters for PEFT methods:

  1. LoRA-specific:

    • Rank (r): Higher values give better performance but use more parameters
    • Alpha (α): Scaling factor, typically set to 2r
    • Target modules: Which layers to apply LoRA to
  2. Adapter-specific:

    • Bottleneck dimension: Controls adapter size
    • Adapter placement: Which layers to add adapters to
  3. General fine-tuning:

    • Learning rate: Typically lower for fine-tuning (1e-5 to 5e-5)
    • Weight decay: Helps prevent overfitting (0.01 to 0.1)
    • Training epochs: Often fewer for fine-tuning (2-5)

Avoiding Catastrophic Forgetting

Strategies to preserve general capabilities:

  1. Use lower learning rates
  2. Implement early stopping
  3. Apply regularization techniques
  4. Balance task-specific data with general data
  5. Consider multi-task fine-tuning

Advanced Topics in Fine-tuning

Domain Adaptation vs. Task Adaptation

  1. Domain Adaptation:

    • Adapts to a specific domain (e.g., medical, legal)
    • Preserves general capabilities
    • Often requires continued pre-training
  2. Task Adaptation:

    • Focuses on specific tasks (e.g., classification, summarization)
    • May specialize at the expense of generality
    • Typically uses supervised fine-tuning

Instruction Tuning

Fine-tuning models on instruction-following data:

  1. Input format: Typically uses a template like "Instruction: {instruction}\nInput: {input}\nOutput:"
  2. Dataset composition: Mix of different task types and formats
  3. Evaluation: Measures ability to follow diverse instructions

{"tool": "code-editor", "defaultValue": "from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments from peft import get_peft_model, LoraConfig, TaskType from datasets import load_dataset import torch

Load model and tokenizer

model_name = "facebook/opt-1.3b" model = AutoModelForCausalLM.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name)

Add special tokens if needed

tokenizer.pad_token = tokenizer.eos_token

Prepare instruction dataset

instruction_dataset = load_dataset("tatsu-lab/alpaca") # Example instruction dataset

Format data with instruction template

def format_instruction(example): if example["input"]: text = f"""Instruction: {example["instruction"]} Input: {example["input"]} Output: {example["output"]}""" else: text = f"""Instruction: {example["instruction"]} Output: {example["output"]}"""

1
return {"text": text}

formatted_dataset = instruction_dataset.map(format_instruction)

Tokenize dataset

def tokenize_function(examples): return tokenizer( examples["text"], truncation=True, max_length=512, padding="max_length" )

tokenized_dataset = formatted_dataset.map(tokenize_function, batched=True)

Apply LoRA for efficient fine-tuning

lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "v_proj"], lora_dropout=0.05, bias="none", task_type=TaskType.CAUSAL_LM )

model = get_peft_model(model, lora_config)

Set up training arguments

training_args = TrainingArguments( output_dir="./instruction-tuned-model", learning_rate=2e-5, per_device_train_batch_size=4, gradient_accumulation_steps=4, max_steps=1000, save_steps=200, logging_steps=50, )

Train the model

trainer = Trainer( model=model, args=training_args, train_dataset=tokenized_dataset["train"], data_collator=lambda data: {'input_ids': torch.stack([f['input_ids'] for f in data]), 'attention_mask': torch.stack([f['attention_mask'] for f in data]), 'labels': torch.stack([f['input_ids'] for f in data])}, )

trainer.train()"}

Multi-task Fine-tuning

Training on multiple tasks simultaneously:

  1. Benefits:

    • Improves generalization
    • Prevents overfitting to a single task
    • Reduces catastrophic forgetting
  2. Implementation:

    • Collect datasets for multiple tasks
    • Balance task representation
    • Add task-specific identifiers or prompts

Continual Learning and Sequential Fine-tuning

Strategies for learning new tasks without forgetting:

  1. Elastic Weight Consolidation (EWC):

    • Identifies important parameters for previous tasks
    • Penalizes changes to these parameters when learning new tasks
  2. Knowledge Distillation:

    • Uses original model as teacher
    • Prevents new model from diverging too far
  3. Replay Methods:

    • Maintains a buffer of examples from previous tasks
    • Intermixes these with new task examples during training

Practical Exercises

Exercise 1: LoRA Fine-tuning

Implement LoRA fine-tuning for a sentiment classification task:

  1. Load a pre-trained model (e.g., BERT or RoBERTa)
  2. Configure LoRA adapters
  3. Fine-tune on a sentiment dataset (e.g., SST-2 or IMDB)
  4. Evaluate performance and parameter efficiency

Exercise 2: QLoRA for Large Models

Use QLoRA to fine-tune a large language model (>7B parameters) on a single GPU:

  1. Set up 4-bit quantization
  2. Configure LoRA adapters
  3. Fine-tune on an instruction dataset
  4. Compare performance before and after fine-tuning

Exercise 3: Method Comparison

Compare different PEFT methods on the same task:

  1. Implement Full Fine-tuning, LoRA, Adapters, and Prefix Tuning
  2. Train each method with the same dataset and hyperparameters
  3. Analyze performance, memory usage, and training time
  4. Recommend the best method for different scenarios

Conclusion

Parameter-efficient fine-tuning methods have democratized access to large language models, making it possible to adapt billion-parameter models with limited resources. These techniques not only reduce computational requirements but often provide comparable performance to full fine-tuning.

As models continue to grow, PEFT methods will become increasingly important. The rapid pace of innovation in this area—from adapters to LoRA to QLoRA—suggests that even more efficient techniques may emerge in the future, further lowering the barrier to working with advanced language models.

In our next lesson, we will explore distributed training infrastructure, enabling you to work with even larger models across multiple devices or machines.

Additional Resources

Papers

  • "LoRA: Low-Rank Adaptation of Large Language Models" (Hu et al., 2021)
  • "QLoRA: Efficient Finetuning of Quantized LLMs" (Dettmers et al., 2023)
  • "Parameter-Efficient Transfer Learning for NLP" (Houlsby et al., 2019, Adapters)
  • "The Power of Scale for Parameter-Efficient Prompt Tuning" (Lester et al., 2021)

Libraries and Tools

Blog Posts and Tutorials