Pre-Transformer Models: RNN, LSTM, and GRU

Recurrent Neural Networks (RNNs): Processing Sequential Data

Overview

In our previous lessons, we've explored word representations from static embeddings to contextual embeddings. But a critical question remains: how do we effectively process sequences of these word representations to understand the full meaning of sentences, paragraphs, and documents?

This lesson introduces Recurrent Neural Networks (RNNs), the foundational architecture for sequential data processing in NLP. Before transformers became the dominant paradigm, RNNs and their variants (LSTM, GRU) were the state-of-the-art for tasks like language modeling, machine translation, and sentiment analysis.

Learning Objectives

After completing this lesson, you will be able to:

  • Understand why sequential data requires specialized neural architectures
  • Explain the basic RNN architecture and its recurrence mechanism
  • Describe the vanishing/exploding gradient problems in vanilla RNNs
  • Compare LSTM and GRU architectures and their advantages
  • Implement RNN variants for common NLP tasks
  • Recognize the limitations that led to the transformer revolution

The Sequential Nature of Language

The Challenge of Variable-Length Input

Traditional neural networks expect fixed-size inputs, but language is inherently variable in length:

  • Sentences can be short ("I agree.") or very long
  • Documents can range from tweets to novels
  • Conversations can have arbitrary turns and lengths

How do we design neural networks that can handle this variability while preserving the sequential relationships?

Analogy: Understanding Music

Consider how you understand music. A single note in isolation gives limited information, but as you hear sequences of notes, you build an understanding of the melody, rhythm, and emotional tone.

If you were to hear only random isolated notes, you'd lose the temporal patterns that make music meaningful. Similarly, to understand language, we need to process words not in isolation, but as part of a meaningful sequence while maintaining the memory of what came before.

Why Feed-Forward Networks Fall Short

Loading tool...

Recurrent Neural Networks: The Basic Architecture

The Recurrence Mechanism

The key innovation in RNNs is the recurrence mechanism: the network maintains a hidden state (or "memory") that is updated at each time step based on both the current input and the previous hidden state.

Basic RNN Architecture

Loading tool...

Mathematical Formulation

At each time step tt, the RNN computes:

ht=f(Whhht1+Whxxt+bh)\mathbf{h}_t = f(\mathbf{W}_{hh}\mathbf{h}_{t-1} + \mathbf{W}_{hx}\mathbf{x}_t + \mathbf{b}_h) yt=g(Wyhht+by)\mathbf{y}_t = g(\mathbf{W}_{yh}\mathbf{h}_t + \mathbf{b}_y)

Where:

  • xt\mathbf{x}_t is the input at time step tt (e.g., a word embedding)
  • ht\mathbf{h}_t is the hidden state at time step tt
  • ht1\mathbf{h}_{t-1} is the hidden state from the previous time step
  • yt\mathbf{y}_t is the output at time step tt
  • Whh\mathbf{W}_{hh}, Whx\mathbf{W}_{hx}, and Wyh\mathbf{W}_{yh} are weight matrices
  • bh\mathbf{b}_h and by\mathbf{b}_y are bias vectors
  • ff is typically tanh or ReLU activation function
  • gg is an output activation function (e.g., softmax for classification)

Parameter Sharing

A key advantage of RNNs is parameter sharing across time steps. The same weights are used at each step, which:

  • Drastically reduces the number of parameters
  • Allows processing sequences of any length
  • Enables the network to recognize patterns regardless of position

Training RNNs: Backpropagation Through Time (BPTT)

RNNs are trained using an extension of backpropagation called Backpropagation Through Time (BPTT), which unfolds the recurrent network through time and treats it as a deep feed-forward network.

Unfolding the RNN

Loading tool...

The Vanishing and Exploding Gradient Problems

When training RNNs on long sequences, two critical problems emerge:

  1. Vanishing Gradients:

    • Gradients become extremely small as they're propagated back in time
    • Early time steps receive minimal updates
    • Network fails to learn long-range dependencies
    • Example: Forgetting the subject of a sentence when predicting the verb
  2. Exploding Gradients:

    • Gradients grow exponentially large
    • Weights update by huge amounts
    • Training becomes unstable
    • Often results in NaN values

Visualizing Gradient Flow in RNNs

Loading tool...

Long Short-Term Memory (LSTM): Solving the Long-Term Dependency Problem

To address the vanishing gradient problem, Hochreiter and Schmidhuber introduced the Long Short-Term Memory (LSTM) architecture in 1997. LSTMs use a more complex recurrent unit with gates that control information flow.

LSTM Architecture

Loading tool...

The Gate Mechanism

An LSTM cell contains three gates that regulate information flow:

  1. Forget Gate: Decides what information to discard from the cell state
  2. Input Gate: Decides what new information to store in the cell state
  3. Output Gate: Decides what parts of the cell state to output

Mathematical Formulation

For input xt\mathbf{x}_t at time step tt:

Forget Gate: ft=σ(Wf[ht1,xt]+bf)\mathbf{f}_t = \sigma(\mathbf{W}_f \cdot [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_f)

Input Gate: it=σ(Wi[ht1,xt]+bi)\mathbf{i}_t = \sigma(\mathbf{W}_i \cdot [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_i) C~t=tanh(WC[ht1,xt]+bC)\tilde{\mathbf{C}}_t = \tanh(\mathbf{W}_C \cdot [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_C)

Cell State Update: Ct=ftCt1+itC~t\mathbf{C}_t = \mathbf{f}_t \odot \mathbf{C}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{C}}_t

Output Gate: ot=σ(Wo[ht1,xt]+bo)\mathbf{o}_t = \sigma(\mathbf{W}_o \cdot [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_o) ht=ottanh(Ct)\mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{C}_t)

Where:

  • σ\sigma is the sigmoid function
  • \odot represents element-wise multiplication
  • Ct\mathbf{C}_t is the cell state at time tt
  • ht\mathbf{h}_t is the hidden state at time tt
  • W\mathbf{W} and b\mathbf{b} are weight matrices and bias vectors

Memory Management Analogy

Think of the LSTM cell as a skilled personal assistant managing your information flow:

  • Forget Gate: Like clearing your desk of irrelevant papers
  • Input Gate: Like deciding which new information deserves to be filed away
  • Cell State: Like your organized filing cabinet of important information
  • Output Gate: Like preparing a briefing of only the relevant information you need right now

Addressing Long-Term Dependencies

LSTMs excel at capturing long-term dependencies through their explicit memory mechanism:

Loading tool...

Gated Recurrent Unit (GRU): A Streamlined Alternative

Introduced in 2014 by Cho et al., the Gated Recurrent Unit (GRU) is a simplified variant of the LSTM that combines the forget and input gates into a single "update gate."

GRU Architecture

Loading tool...

Mathematical Formulation

For input xt\mathbf{x}_t at time step tt:

Update Gate: zt=σ(Wz[ht1,xt]+bz)\mathbf{z}_t = \sigma(\mathbf{W}_z \cdot [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_z)

Reset Gate: rt=σ(Wr[ht1,xt]+br)\mathbf{r}_t = \sigma(\mathbf{W}_r \cdot [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_r)

Candidate Hidden State: h~t=tanh(W[rtht1,xt]+b)\tilde{\mathbf{h}}_t = \tanh(\mathbf{W} \cdot [\mathbf{r}_t \odot \mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b})

Final Hidden State: ht=(1zt)ht1+zth~t\mathbf{h}_t = (1 - \mathbf{z}_t) \odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t

LSTM vs. GRU: Comparison

Loading tool...

Bidirectional RNNs: Capturing Context from Both Directions

In many NLP tasks, understanding a word requires context from both past and future words. Bidirectional RNNs process the sequence in both forward and backward directions.

Bidirectional Architecture

Loading tool...

Benefits for NLP Tasks

Bidirectional processing is especially valuable for:

  • Named Entity Recognition
  • Part-of-Speech Tagging
  • Machine Translation
  • Question Answering

Example: Disambiguating Word Sense

The word "bank" has different meanings depending on context:

Loading tool...

Common NLP Applications of RNNs

Language Modeling

Language modeling is the task of predicting the next word given a sequence of previous words. This is a fundamental NLP task with applications in:

  • Speech recognition
  • Machine translation
  • Text generation
  • Spelling correction

Code Example: Simple Character-Level Language Model

python
1
import numpy as np
2
import tensorflow as tf
3
from tensorflow.keras.models import Sequential
4
from tensorflow.keras.layers import LSTM, Dense, Embedding
5
from tensorflow.keras.preprocessing.sequence import pad_sequences
6
from tensorflow.keras.preprocessing.text import Tokenizer
7
8
# Sample text data
9
text = """Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence
10
concerned with the interactions between computers and human language, in particular how to program computers to
11
process and analyze large amounts of natural language data."""
12
13
# Prepare sequences for character-level model
14
chars = sorted(list(set(text)))
15
char_to_idx = {char: i for i, char in enumerate(chars)}
16
idx_to_char = {i: char for i, char in enumerate(chars)}
17
18
# Create training sequences
19
seq_length = 40
20
sequences = []
21
next_chars = []
22
23
for i in range(len(text) - seq_length):
24
sequences.append(text[i:i + seq_length])
25
next_chars.append(text[i + seq_length])
26
27
# One-hot encode sequences
28
X = np.zeros((len(sequences), seq_length, len(chars)), dtype=np.bool)
29
y = np.zeros((len(sequences), len(chars)), dtype=np.bool)
30
31
for i, sequence in enumerate(sequences):
32
for t, char in enumerate(sequence):
33
X[i, t, char_to_idx[char]] = 1
34
y[i, char_to_idx[next_chars[i]]] = 1
35
36
# Build the model
37
model = Sequential([
38
LSTM(128, input_shape=(seq_length, len(chars)), return_sequences=True),
39
LSTM(128),
40
Dense(len(chars), activation='softmax')
41
])
42
43
model.compile(loss='categorical_crossentropy', optimizer='adam')
44
45
# Model summary
46
model.summary()
47
48
# Example of text generation function
49
def generate_text(model, seed_text, next_chars=100):
50
generated = seed_text
51
52
for _ in range(next_chars):
53
# Preprocess the input sequence
54
x_pred = np.zeros((1, seq_length, len(chars)))
55
for t, char in enumerate(seed_text):
56
x_pred[0, t, char_to_idx[char]] = 1
57
58
# Make prediction and sample from the distribution
59
preds = model.predict(x_pred)[0]
60
next_index = np.random.choice(len(chars), p=preds)
61
next_char = idx_to_char[next_index]
62
63
# Update generated text and seed for next prediction
64
generated += next_char
65
seed_text = seed_text[1:] + next_char
66
67
return generated
68
69
# After training, you could generate text like:
70
# generated_text = generate_text(model, "Natural language processing is ")

Sentiment Analysis

Sentiment analysis determines the emotional tone behind text, often used for customer reviews, social media monitoring, and brand analysis.

Code Example: Sentiment Classification with LSTM

python
1
import tensorflow as tf
2
from tensorflow.keras.preprocessing.text import Tokenizer
3
from tensorflow.keras.preprocessing.sequence import pad_sequences
4
from tensorflow.keras.models import Sequential
5
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
6
7
# Sample data
8
texts = [
9
"This movie was fantastic! I really enjoyed it.",
10
"The plot was intriguing and kept me engaged.",
11
"Terrible movie, waste of time and money.",
12
"I hated the characters and the story made no sense.",
13
"The acting was superb and the cinematography was beautiful.",
14
"What a disappointment, I expected much better."
15
]
16
labels = [1, 1, 0, 0, 1, 0] # 1 for positive, 0 for negative
17
18
# Tokenize the texts
19
max_words = 1000
20
max_len = 100
21
tokenizer = Tokenizer(num_words=max_words)
22
tokenizer.fit_on_texts(texts)
23
sequences = tokenizer.texts_to_sequences(texts)
24
word_index = tokenizer.word_index
25
data = pad_sequences(sequences, maxlen=max_len)
26
27
# Build LSTM model for sentiment analysis
28
model = Sequential([
29
Embedding(max_words, 128, input_length=max_len),
30
LSTM(64, dropout=0.2, recurrent_dropout=0.2),
31
Dense(32, activation='relu'),
32
Dropout(0.5),
33
Dense(1, activation='sigmoid')
34
])
35
36
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
37
model.summary()
38
39
# Train the model
40
# model.fit(data, np.array(labels), epochs=10, batch_size=2, validation_split=0.2)
41
42
# Example prediction
43
def predict_sentiment(text):
44
sequence = tokenizer.texts_to_sequences([text])
45
padded = pad_sequences(sequence, maxlen=max_len)
46
prediction = model.predict(padded)[0][0]
47
return f"Positive sentiment: {prediction:.2f}, Negative sentiment: {1-prediction:.2f}"

Machine Translation with Encoder-Decoder Architecture

Machine translation uses a sequence-to-sequence (Seq2Seq) architecture with an encoder RNN and a decoder RNN.

Loading tool...

Code Example: Simple Encoder-Decoder for Translation

python
1
import tensorflow as tf
2
from tensorflow.keras.models import Model
3
from tensorflow.keras.layers import Input, LSTM, Dense, Embedding
4
5
# Parameters
6
num_encoder_tokens = 5000 # Source vocabulary size
7
num_decoder_tokens = 6000 # Target vocabulary size
8
latent_dim = 256 # LSTM units
9
embedding_dim = 128 # Embedding dimensions
10
11
# Encoder
12
encoder_inputs = Input(shape=(None,))
13
encoder_embedding = Embedding(num_encoder_tokens, embedding_dim)(encoder_inputs)
14
encoder_lstm = LSTM(latent_dim, return_state=True)
15
_, state_h, state_c = encoder_lstm(encoder_embedding)
16
encoder_states = [state_h, state_c] # LSTM states
17
18
# Decoder
19
decoder_inputs = Input(shape=(None,))
20
decoder_embedding = Embedding(num_decoder_tokens, embedding_dim)(decoder_inputs)
21
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
22
decoder_outputs, _, _ = decoder_lstm(decoder_embedding, initial_state=encoder_states)
23
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
24
decoder_outputs = decoder_dense(decoder_outputs)
25
26
# Model definition
27
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
28
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
29
30
# For inference (after training):
31
# 1. Encode input sequence to get state vectors
32
encoder_model = Model(encoder_inputs, encoder_states)
33
34
# 2. Set up decoder model which accepts states and produces outputs
35
decoder_state_input_h = Input(shape=(latent_dim,))
36
decoder_state_input_c = Input(shape=(latent_dim,))
37
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
38
39
decoder_outputs, state_h, state_c = decoder_lstm(
40
decoder_embedding, initial_state=decoder_states_inputs)
41
decoder_states = [state_h, state_c]
42
decoder_outputs = decoder_dense(decoder_outputs)
43
decoder_model = Model(
44
[decoder_inputs] + decoder_states_inputs,
45
[decoder_outputs] + decoder_states)

RNNs with Attention Mechanism: A Step Toward Transformers

The attention mechanism, introduced in 2014, was a critical advancement that addressed limitations in the encoder-decoder architecture, particularly for long sequences.

The Problem: Information Bottleneck

In the basic encoder-decoder architecture, the entire source sequence is compressed into a fixed-size vector, creating an information bottleneck.

Attention Mechanism

Attention allows the decoder to "focus" on different parts of the source sequence at each decoding step.

Loading tool...

Mathematical Formulation

  1. Calculate alignment scores between decoder state st1\mathbf{s}_{t-1} and all encoder states hj\mathbf{h}_j: etj=f(st1,hj)e_{tj} = f(\mathbf{s}_{t-1}, \mathbf{h}_j)

  2. Normalize to get attention weights: αtj=exp(etj)k=1Txexp(etk)\alpha_{tj} = \frac{\exp(e_{tj})}{\sum_{k=1}^{T_x}\exp(e_{tk})}

  3. Calculate context vector as weighted sum: ct=j=1Txαtjhj\mathbf{c}_t = \sum_{j=1}^{T_x} \alpha_{tj} \mathbf{h}_j

  4. Generate output using context vector and current decoder state: yt=g(st,ct)\mathbf{y}_t = g(\mathbf{s}_t, \mathbf{c}_t)

The Bridge to Transformers

The attention mechanism was a crucial step toward the transformer architecture:

  • Eliminated the bottleneck of fixed-size context vectors
  • Allowed direct connections between distant positions
  • Introduced the concept of weighted importance between elements
  • Provided a foundation for self-attention in transformers

Limitations of RNNs and the Path to Transformers

Despite their innovations, RNNs (even with LSTM/GRU and attention) have several limitations:

Sequential Processing

RNNs process tokens sequentially, making them inherently difficult to parallelize:

Loading tool...

Limited Effective Context

Even with gating mechanisms, RNNs struggle to maintain very long-range dependencies:

Loading tool...

Emergence of Transformers

The transformer architecture addressed these limitations by:

  1. Parallelization: Processing all tokens simultaneously
  2. Direct connections: Allowing each position to attend to all positions
  3. Multi-head attention: Capturing different types of relationships
  4. Positional encoding: Maintaining sequence order without recurrence

Summary

In this lesson, we've covered:

  1. The sequential nature of language and why it requires specialized architectures
  2. Vanilla RNN architecture and its limitations
  3. LSTM and GRU cells that address the vanishing gradient problem
  4. Bidirectional RNNs for capturing context from both directions
  5. Applications in language modeling, sentiment analysis, and machine translation
  6. Attention mechanisms that paved the way for transformers
  7. Limitations of RNNs that led to the transformer revolution

RNNs represent a crucial chapter in the evolution of NLP architectures. While they've largely been superseded by transformers for many tasks, understanding RNNs is essential for appreciating the motivations behind modern architectures and for contexts where their sequential nature and efficiency make them still relevant.

In our next lesson, we'll explore transformers in depth, understanding how they revolutionized NLP and enabled the powerful language models we use today.

Practice Exercises

  1. RNN from Scratch:

    • Implement a vanilla RNN in PyTorch or TensorFlow
    • Observe the vanishing gradient problem firsthand
    • Compare training stability across different sequence lengths
  2. LSTM Language Model:

    • Build a character-level language model using LSTMs
    • Generate text samples and analyze coherence
    • Experiment with temperature settings in sampling
  3. Sentiment Analysis Comparison:

    • Implement sentiment classifiers using:
      • Bag-of-words + Logistic Regression
      • Word embeddings + Vanilla RNN
      • Word embeddings + LSTM
      • Word embeddings + Bidirectional LSTM
    • Compare performance and training time
  4. Neural Machine Translation:

    • Implement a simple encoder-decoder model for translation
    • Add an attention mechanism
    • Analyze which source words receive attention for different target words

Additional Resources