Transformer Architecture Deep Dive

Transformer Architecture: Attention Is All You Need

Overview

In our previous lesson on RNNs, LSTMs, and GRUs, we explored the sequential approach to modeling language. While these architectures revolutionized NLP, they still suffered from fundamental limitations in handling long-range dependencies and parallelization.

This lesson introduces the Transformer architecture, a paradigm shift that replaced recurrence with attention mechanisms. First introduced in the seminal 2017 paper "Attention Is All You Need" by Vaswani et al., Transformers have become the foundation of modern NLP models like BERT, GPT, and T5 that have dramatically advanced the state of the art.

Learning Objectives

After completing this lesson, you will be able to:

  • Understand the key innovations and motivations behind the Transformer architecture
  • Explain self-attention and multi-head attention mechanisms in detail
  • Describe positional encoding and why it's necessary
  • Compare encoder-only, decoder-only, and encoder-decoder transformer variants
  • Implement basic transformer components
  • Recognize how transformers enable modern language models

The Need for a New Architecture

The Limitations of RNNs Revisited

As we saw in the previous lesson, RNNs and their variants face several critical limitations:

  1. Sequential Processing: Processing tokens one at a time creates a bottleneck for training and inference
  2. Limited Context Window: Even LSTMs struggle with very long-range dependencies
  3. Vanishing Gradients: Despite improvements, still an issue for very long sequences

Analogy: Information Highways vs. Relay Races

Think of an RNN as a relay race where information is passed from one runner (time step) to the next. If the race is long, messages can get distorted or lost along the way, and the entire race is only as fast as the slowest runner.

In contrast, a Transformer is like a highway system where every location has direct high-speed connections to every other location. Information doesn't have to flow sequentially but can take direct routes, and all routes can be traveled simultaneously.

The Transformer Architecture: A High-Level View

Architectural Overview

Tool: transformer-architecture-visualizer
This interactive tool is still under development. Check back later!
Tool configuration: {"mode":"overview","showEncoder":true,"showDecoder":true,"showMasking":true,"showAttention":false,"inputSequence":["I","love","natural","language","processing"],"outputSequence":["J'","aime","le","traitement","du","langage","naturel"],"annotations":true}

Key Innovations

The Transformer introduced several groundbreaking innovations:

  1. Self-Attention: Allows each position to directly attend to all positions
  2. Multi-Head Attention: Enables attention across different representation subspaces
  3. Positional Encoding: Captures sequence order without recurrence
  4. Residual Connections + Layer Normalization: Facilitates training of deep networks
  5. Feed-Forward Networks: Adds non-linearity and transforms representations
  6. Parallel Processing: Enables efficient training and inference

Self-Attention: The Core Mechanism

Understanding Attention

Attention allows a model to focus on relevant parts of the input sequence when making predictions. It computes a weighted sum of values, where weights reflect the relevance of each value to the current context.

The Intuition Behind Self-Attention

attention-visualizer - Coming Soon

This interactive tool is still under development. Check back later!

Tool configuration: {"defaultValue":{"sentence":"The cat sat on the mat because it was comfortable","focus":"it","showAttentionWeights":true,"step":"intuition"}}

In the example above, to understand what "it" refers to, the model must determine which previous words are most relevant. Self-attention allows the model to learn these relevance patterns.

Query, Key, Value (QKV) Framework

Self-attention can be conceptualized using the Query-Key-Value framework:

  1. Query (Q): What we're looking for
  2. Key (K): What we match against
  3. Value (V): What we retrieve if there's a match

Think of it as a sophisticated dictionary lookup:

  • The Query is like your search term
  • The Keys are like the dictionary entries
  • The Values are the definitions you retrieve

Self-Attention Computation: Step-by-Step

  1. Projection: Generate Query, Key, and Value vectors by multiplying input embeddings by weight matrices Q=XWQ,K=XWK,V=XWV\mathbf{Q} = \mathbf{X}\mathbf{W}^Q, \mathbf{K} = \mathbf{X}\mathbf{W}^K, \mathbf{V} = \mathbf{X}\mathbf{W}^V

  2. Score Calculation: Compute attention scores by multiplying Q and K matrices Score=QKT\text{Score} = \mathbf{Q}\mathbf{K}^T

  3. Scaling: Divide by square root of dimension to prevent extremely small gradients Scorescaled=QKTdk\text{Score}_{\text{scaled}} = \frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}}

  4. Masking (Decoder Only): Apply mask to prevent attending to future positions (for decoder) Scoremasked=Scorescaled+Mask\text{Score}_{\text{masked}} = \text{Score}_{\text{scaled}} + \text{Mask}

  5. Softmax: Apply softmax to get probability distribution across values Attention Weights=softmax(Scorescaled)\text{Attention Weights} = \text{softmax}(\text{Score}_{\text{scaled}})

  6. Weighted Sum: Multiply attention weights by values Attention Output=Attention Weights×V\text{Attention Output} = \text{Attention Weights} \times \mathbf{V}

Visualizing Self-Attention

Tool: self-attention-calculator
This interactive tool is still under development. Check back later!
Tool configuration: {"inputSequence":["I","attend","an","NLP","class"],"showStepByStep":true,"showMatrixCalculations":true,"dimensionSize":4,"weightMatrixView":"heatmap"}

Multi-Head Attention: Attending to Different Aspects

Why Multiple Attention Heads?

Self-attention with a single attention mechanism (or "head") can only capture one type of relationship between words. But language has many types of relationships (syntactic, semantic, referential, etc.).

Multiple attention heads allow the model to:

  • Attend to different representation subspaces simultaneously
  • Capture different types of dependencies (e.g., syntactic vs. semantic)
  • Create a richer representation by combining these diverse perspectives

Multi-Head Attention Mechanism

Tool: multi-head-attention-visualizer
This interactive tool is still under development. Check back later!
Tool configuration: {"sentence":"The scientist who discovered the neutron won the Nobel Prize","numHeads":4,"showHeadView":true,"showCombination":true,"dimensionSize":64}

Mathematical Formulation

For each head ii: headi=Attention(XWiQ,XWiK,XWiV)\text{head}_i = \text{Attention}(\mathbf{X}\mathbf{W}_i^Q, \mathbf{X}\mathbf{W}_i^K, \mathbf{X}\mathbf{W}_i^V)

The outputs from all heads are concatenated and linearly transformed: MultiHead(X)=Concat(head1,head2,...,headh)WO\text{MultiHead}(\mathbf{X}) = \text{Concat}(\text{head}_1, \text{head}_2, ..., \text{head}_h)\mathbf{W}^O

Analogy: Multiple Expert Consultants

Think of multi-head attention as consulting multiple experts who each focus on different aspects of a problem:

  • One linguist focuses on grammar
  • Another focuses on vocabulary
  • A third focuses on cultural context
  • A fourth focuses on tone

Each provides valuable insights from their perspective, and together they create a more comprehensive understanding than any single expert could provide.

Positional Encoding: Preserving Sequence Order

The Problem: Transformers Don't Know Position

Unlike RNNs, the self-attention mechanism is inherently permutation-invariant—it doesn't consider the order of tokens. This is a problem because word order is crucial in language understanding.

For example, these sentences have very different meanings despite using the same words:

  • "The dog chased the cat"
  • "The cat chased the dog"

Solution: Positional Encoding

Transformers add positional information to each word embedding using sinusoidal functions:

PE(pos,2i)=sin(pos100002i/dmodel)PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right) PE(pos,2i+1)=cos(pos100002i/dmodel)PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)

Where:

  • pospos is the position
  • ii is the dimension
  • dmodeld_{\text{model}} is the embedding dimension

Visualizing Positional Encoding

Tool: positional-encoding-visualizer
This interactive tool is still under development. Check back later!
Tool configuration: {"sequenceLength":20,"dimensionSize":64,"showHeatmap":true,"showWaveforms":true,"showUniquePattern":true}

Key Properties of Sinusoidal Positional Encoding

  1. Unique Pattern: Each position gets a unique encoding
  2. Fixed Offset: The relative encoding between positions at a fixed offset is constant
  3. Extrapolation: Can generalize to longer sequences than seen in training
  4. No New Parameters: Unlike learned positional embeddings, requires no additional parameters

Embedding + Positional Encoding

The final input to the transformer is the sum of the word embeddings and the positional encodings:

Input=WordEmbedding+PositionalEncoding\text{Input} = \text{WordEmbedding} + \text{PositionalEncoding}

Tool: embedding-combination-visualizer
This interactive tool is still under development. Check back later!
Tool configuration: {"words":["The","transformer","architecture","revolutionized","NLP"],"embeddingDimension":32,"showWordEmbeddings":true,"showPositionalEncodings":true,"showCombinedEmbeddings":true}

The Building Blocks: Encoder and Decoder

Transformer Encoder

The encoder processes the input sequence and consists of:

  1. Multi-Head Self-Attention: Each position attends to all positions
  2. Feed-Forward Neural Network: A two-layer network with ReLU activation
  3. Residual Connections: Helps gradient flow and stabilizes training
  4. Layer Normalization: Normalizes inputs to each sub-layer
Tool: transformer-encoder-visualizer
This interactive tool is still under development. Check back later!
Tool configuration: {"inputSequence":["The","transformer","is","revolutionary"],"layers":1,"showAttentionWeights":true,"showFFNActivations":true,"showLayerNorm":true,"showResidualConnections":true}

Feed-Forward Network (FFN)

The FFN applies the same transformation to each position independently:

FFN(x)=max(0,xW1+b1)W2+b2\text{FFN}(x) = \max(0, x\mathbf{W}_1 + \mathbf{b}_1)\mathbf{W}_2 + \mathbf{b}_2

This is equivalent to two dense layers with a ReLU activation in between. The FFN allows the model to transform its representations and introduces non-linearity.

Transformer Decoder

The decoder generates the output sequence and has three main components:

  1. Masked Multi-Head Self-Attention: Each position attends only to previous positions
  2. Cross-Attention: Attends to the encoder's output
  3. Feed-Forward Neural Network: Same structure as in the encoder

{"tool": "transformer-decoder-visualizer", "defaultValue": { "inputSequence": ["", "transformers", "are", "powerful"], "encoderOutput": ["Transformers", "are", "changing", "NLP"], "showMasking": true, "showCrossAttention": true, "showSelfAttention": true, "showFFN": true }}

Masking in the Decoder

The decoder must generate text autoregressively (one token at a time), so it can't "see" future tokens during training. This is achieved using a look-ahead mask:

Tool: attention-mask-visualizer
This interactive tool is still under development. Check back later!
Tool configuration: {"sequenceLength":5,"maskType":"causal","showMaskedValues":true,"showAttentionWithMask":true}

Cross-Attention

Cross-attention allows the decoder to focus on relevant parts of the input sequence:

Tool: cross-attention-visualizer
This interactive tool is still under development. Check back later!
Tool configuration: {"sourceSequence":["The","transformer","architecture","is","powerful"],"targetSequence":["L'","architecture","de","transformer","est","puissante"],"showAttentionMap":true,"showAlignment":true}

The Full Architecture: Putting It All Together

Complete Transformer Architecture

The complete transformer architecture consists of a stack of encoder and decoder layers:

Tool: transformer-architecture-visualizer
This interactive tool is still under development. Check back later!
Tool configuration: {"mode":"detailed","encoderLayers":6,"decoderLayers":6,"showDataFlow":true,"showLayerInternals":false,"showDimensions":true}

Training the Transformer

Transformers are typically trained with:

  1. Teacher forcing: Using ground truth as decoder input during training
  2. Label smoothing: Preventing overconfidence by softening the target distribution
  3. Learning rate scheduling: Using warmup and decay for optimal convergence
  4. Large batch sizes: Stabilizing training with more examples per update

Computational Complexity

The self-attention mechanism has quadratic complexity with respect to sequence length:

O(n2d)\mathcal{O}(n^2 \cdot d)

Where:

  • nn is the sequence length
  • dd is the representation dimension

This can be a limitation for very long sequences, leading to various efficient transformer variants that reduce this complexity.

Transformer Variants: Encoder-Only, Decoder-Only, and Encoder-Decoder

Encoder-Only Models

Encoder-only models are suitable for understanding tasks like classification, named entity recognition, and sentiment analysis.

Examples: BERT, RoBERTa, DistilBERT

Tool: model-architecture-comparison
This interactive tool is still under development. Check back later!
Tool configuration: {"models":["BERT","RoBERTa","DistilBERT"],"showEncoderStructure":true,"showParameters":true,"showPreTrainingObjectives":true,"showApplications":true}

Decoder-Only Models

Decoder-only models are used for text generation tasks.

Examples: GPT, GPT-2, GPT-3, GPT-4

Tool: model-architecture-comparison
This interactive tool is still under development. Check back later!
Tool configuration: {"models":["GPT","GPT-2","GPT-3","GPT-4"],"showDecoderStructure":true,"showScaling":true,"showParameters":true,"showApplications":true}

Encoder-Decoder Models

Encoder-decoder models excel at sequence-to-sequence tasks like translation and summarization.

Examples: T5, BART, Pegasus

Tool: model-architecture-comparison
This interactive tool is still under development. Check back later!
Tool configuration: {"models":["T5","BART","Pegasus"],"showEncoderDecoderStructure":true,"showPreTrainingObjectives":true,"showParameters":true,"showApplications":true}

Implementation: Building a Simple Transformer

Implementing Self-Attention in PyTorch

python
1
import torch
2
import torch.nn as nn
3
import torch.nn.functional as F
4
import math
5
6
class SelfAttention(nn.Module):
7
def __init__(self, embed_size, heads):
8
super(SelfAttention, self).__init__()
9
self.embed_size = embed_size
10
self.heads = heads
11
self.head_dim = embed_size // heads
12
13
assert (self.head_dim * heads == embed_size), "Embed size must be divisible by heads"
14
15
# Linear projections for Q, K, V
16
self.q_linear = nn.Linear(embed_size, embed_size)
17
self.k_linear = nn.Linear(embed_size, embed_size)
18
self.v_linear = nn.Linear(embed_size, embed_size)
19
self.out_linear = nn.Linear(embed_size, embed_size)
20
21
def forward(self, query, key, value, mask=None):
22
batch_size = query.shape[0]
23
24
# Linear projections and split into heads
25
q = self.q_linear(query).view(batch_size, -1, self.heads, self.head_dim).permute(0, 2, 1, 3)
26
k = self.k_linear(key).view(batch_size, -1, self.heads, self.head_dim).permute(0, 2, 1, 3)
27
v = self.v_linear(value).view(batch_size, -1, self.heads, self.head_dim).permute(0, 2, 1, 3)
28
29
# Compute attention scores
30
scores = torch.matmul(q, k.permute(0, 1, 3, 2)) / math.sqrt(self.head_dim)
31
32
# Apply mask if provided (for decoder)
33
if mask is not None:
34
scores = scores.masked_fill(mask == 0, float("-1e20"))
35
36
# Apply softmax and compute attention weights
37
attention_weights = F.softmax(scores, dim=-1)
38
39
# Compute output
40
out = torch.matmul(attention_weights, v)
41
out = out.permute(0, 2, 1, 3).contiguous()
42
out = out.view(batch_size, -1, self.embed_size)
43
out = self.out_linear(out)
44
45
return out

Implementing Positional Encoding

python
1
class PositionalEncoding(nn.Module):
2
def __init__(self, embed_size, max_len=5000):
3
super(PositionalEncoding, self).__init__()
4
5
pe = torch.zeros(max_len, embed_size)
6
position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
7
div_term = torch.exp(torch.arange(0, embed_size, 2).float() * (-math.log(10000.0) / embed_size))
8
9
# Apply sin to even indices
10
pe[:, 0::2] = torch.sin(position * div_term)
11
# Apply cos to odd indices
12
pe[:, 1::2] = torch.cos(position * div_term)
13
14
pe = pe.unsqueeze(0)
15
self.register_buffer('pe', pe)
16
17
def forward(self, x):
18
# x has shape [batch_size, seq_len, embed_size]
19
return x + self.pe[:, :x.size(1), :]

Transformer Encoder Layer

python
1
class TransformerEncoderLayer(nn.Module):
2
def __init__(self, embed_size, heads, dropout, forward_expansion):
3
super(TransformerEncoderLayer, self).__init__()
4
5
self.attention = SelfAttention(embed_size, heads)
6
self.norm1 = nn.LayerNorm(embed_size)
7
self.norm2 = nn.LayerNorm(embed_size)
8
9
self.feed_forward = nn.Sequential(
10
nn.Linear(embed_size, forward_expansion * embed_size),
11
nn.ReLU(),
12
nn.Linear(forward_expansion * embed_size, embed_size)
13
)
14
15
self.dropout = nn.Dropout(dropout)
16
17
def forward(self, x, mask=None):
18
# Self-attention block with residual connection and layer norm
19
attention_output = self.attention(x, x, x, mask)
20
x = self.norm1(x + self.dropout(attention_output))
21
22
# Feed forward block with residual connection and layer norm
23
ff_output = self.feed_forward(x)
24
x = self.norm2(x + self.dropout(ff_output))
25
26
return x

Transformer Decoder Layer

python
1
class TransformerDecoderLayer(nn.Module):
2
def __init__(self, embed_size, heads, dropout, forward_expansion):
3
super(TransformerDecoderLayer, self).__init__()
4
5
self.attention = SelfAttention(embed_size, heads)
6
self.cross_attention = SelfAttention(embed_size, heads)
7
self.norm1 = nn.LayerNorm(embed_size)
8
self.norm2 = nn.LayerNorm(embed_size)
9
self.norm3 = nn.LayerNorm(embed_size)
10
11
self.feed_forward = nn.Sequential(
12
nn.Linear(embed_size, forward_expansion * embed_size),
13
nn.ReLU(),
14
nn.Linear(forward_expansion * embed_size, embed_size)
15
)
16
17
self.dropout = nn.Dropout(dropout)
18
19
def forward(self, x, encoder_output, source_mask, target_mask):
20
# Self-attention block with residual connection and layer norm
21
attention_output = self.attention(x, x, x, target_mask)
22
x = self.norm1(x + self.dropout(attention_output))
23
24
# Cross-attention block with residual connection and layer norm
25
cross_attention_output = self.cross_attention(
26
x, encoder_output, encoder_output, source_mask
27
)
28
x = self.norm2(x + self.dropout(cross_attention_output))
29
30
# Feed forward block with residual connection and layer norm
31
ff_output = self.feed_forward(x)
32
x = self.norm3(x + self.dropout(ff_output))
33
34
return x

Applications: How Transformers Revolutionized NLP

Machine Translation

The original transformer model was designed for machine translation and significantly improved the state of the art on the WMT English-to-German and English-to-French translation tasks.

Qualitative Comparison

Loading tool...

Language Modeling and Text Generation

Transformer-based language models like GPT can generate remarkably coherent and contextually appropriate text.

Code Example: Text Generation with a Pre-trained Model

python
1
from transformers import GPT2LMHeadModel, GPT2Tokenizer
2
3
# Load pre-trained model and tokenizer
4
model_name = "gpt2"
5
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
6
model = GPT2LMHeadModel.from_pretrained(model_name)
7
8
# Generate text
9
prompt = "The transformer architecture"
10
inputs = tokenizer(prompt, return_tensors="pt")
11
outputs = model.generate(
12
inputs["input_ids"],
13
max_length=100,
14
num_return_sequences=1,
15
temperature=0.7,
16
top_p=0.9,
17
do_sample=True
18
)
19
20
# Decode the generated text
21
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
22
print(generated_text)

Bidirectional Understanding and Masked Language Modeling

BERT and its variants use transformer encoders with masked language modeling to develop bidirectional understanding of text.

Code Example: Masked Language Modeling with BERT

python
1
from transformers import BertTokenizer, BertForMaskedLM
2
import torch
3
4
# Load pre-trained model and tokenizer
5
model_name = "bert-base-uncased"
6
tokenizer = BertTokenizer.from_pretrained(model_name)
7
model = BertForMaskedLM.from_pretrained(model_name)
8
9
# Prepare masked input
10
text = "The transformer architecture has [MASK] natural language processing."
11
inputs = tokenizer(text, return_tensors="pt")
12
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
13
14
# Forward pass
15
with torch.no_grad():
16
outputs = model(**inputs)
17
predictions = outputs.logits
18
19
# Get top 5 predictions for masked token
20
predicted_token_ids = torch.topk(predictions[0, mask_token_index], 5).indices
21
predicted_tokens = [tokenizer.decode([token_id]) for token_id in predicted_token_ids[0]]
22
23
print(f"Predictions for masked token: {predicted_tokens}")

Limitations and Future Directions

Current Limitations

  1. Quadratic Complexity: Self-attention scales poorly with sequence length
  2. Context Window: Limited by training and architecture constraints
  3. Interpretability: Understanding attention patterns isn't straightforward
  4. Data Hunger: Requires massive amounts of data for best performance
  5. Compute Resources: Training large models requires significant resources

Efficient Transformer Variants

Researchers are developing efficient transformers to address these limitations:

Loading tool...

The Future of Transformers

Transformers continue to evolve in several exciting directions:

  1. Multimodal Transformers: Processing text, images, audio, and video together
  2. Domain-Specific Architectures: Specialized for specific fields (science, medicine)
  3. Mixture of Experts: Using sparse activation to scale to trillions of parameters
  4. Retrieval-Augmented Models: Enhancing LLMs with external knowledge access
  5. More Efficient Attention: Continuing to reduce the quadratic complexity

Summary

In this lesson, we've covered:

  1. The fundamental innovations of the Transformer architecture:

    • Self-attention mechanisms
    • Multi-head attention
    • Positional encoding
    • Layer normalization and residual connections
  2. The detailed workings of the architecture:

    • Encoder and decoder structure
    • Query, key, value projections
    • The feed-forward network
    • Masking in the decoder
  3. Transformer variants and applications:

    • Encoder-only models like BERT
    • Decoder-only models like GPT
    • Encoder-decoder models like T5
    • Applications in translation, generation, and understanding
  4. Implementation details and practical examples:

    • Building transformer components in PyTorch
    • Using pre-trained models for generation and MLM
  5. Current limitations and future directions:

    • Efficient transformer variants
    • Emerging research directions

Transformers have fundamentally changed how we approach NLP tasks, enabling a new generation of powerful language models. Understanding their architecture is crucial for working with modern NLP systems and developing new applications.

In our next lesson, we'll explore how transformer architectures are used in large language models (LLMs) and how techniques like fine-tuning, prompt engineering, and RLHF have enabled the creation of increasingly capable AI systems.

Practice Exercises

  1. Implement Self-Attention:

    • Write a simplified version of the self-attention mechanism
    • Visualize attention weights for a sample sentence
    • Experiment with different scaling factors
  2. Positional Encoding Analysis:

    • Implement sinusoidal positional encoding
    • Analyze how different positions are represented
    • Visualize positional encoding vectors
  3. Transformer Architecture Comparison:

    • Compare performance of RNN vs. Transformer on a simple task
    • Measure inference time for both architectures
    • Analyze computational complexity at different sequence lengths
  4. Pre-trained Model Exploration:

    • Fine-tune a small pre-trained transformer for a classification task
    • Analyze attention patterns in different heads
    • Experiment with different layer freezing strategies

Additional Resources