Advanced Tokenization Techniques

Overview

In our previous lesson, we introduced basic tokenization methods like word and character tokenization. While these approaches are intuitive, they have significant limitations when handling large vocabularies, out-of-vocabulary words, and morphologically rich languages.

Modern NLP models like BERT, GPT, and T5 rely on more sophisticated tokenization strategies. This lesson focuses on subword tokenization techniques that have revolutionized NLP by finding a sweet spot between character-level and word-level representations.

Learning Objectives

After completing this lesson, you will be able to:

  • Understand the limitations of traditional tokenization approaches
  • Explain how modern subword tokenization algorithms work
  • Compare different subword tokenization methods (BPE, WordPiece, SentencePiece)
  • Implement and use subword tokenizers in practice
  • Select appropriate tokenization strategies for different NLP tasks

The Need for Subword Tokenization

Limitations of Word-Level Tokenization

Word tokenization seemed intuitive in our previous lesson, but it has several critical weaknesses:

  1. Vocabulary Explosion: Languages are productive — they can generate a virtually unlimited number of words through compounding, inflection, and derivation.

  2. Out-of-Vocabulary (OOV) Words: Any word not seen during training becomes an

    1
    <UNK>
    (unknown) token, losing all semantic information.

  3. Morphological Blindness: The tokens "play", "playing", and "played" are treated as completely different words, even though they share the same root.

  4. Rare Words Problem: Infrequent words have sparse statistics, making it difficult for models to learn good representations.

Analogy: Word Construction as Lego Blocks

Think of words as structures built from smaller reusable pieces, like Lego blocks. Rather than trying to pre-manufacture every possible structure (word), we can provide the fundamental blocks and rules for combining them.

  • In English: "un" + "break" + "able" = "unbreakable"
  • In German: "Grund" + "gesetz" + "buch" = "Grundgesetzbuch" (constitution)

Visualization: Vocabulary Size vs. Coverage

Loading tool...

Byte-Pair Encoding (BPE)

BPE is one of the most widely used subword tokenization algorithms, employed by models like GPT (OpenAI) and BART (Facebook).

History and Origins

Originally developed as a data compression algorithm by Philip Gage in 1994, BPE was adapted for NLP by Rico Sennrich in 2016 for neural machine translation.

How BPE Works

BPE follows a simple yet effective procedure:

  1. Initialize vocabulary with individual characters
  2. Count all symbol pairs in the corpus
  3. Merge the most frequent pair
  4. Repeat steps 2-3 until desired vocabulary size or stopping criterion is reached

Detailed Example

Let's walk through a simplified example:

Starting text corpus:

1
low lower lowest

Initialization: Split into characters (plus a special end-of-word token "_")

1
l o w _ l o w e r _ l o w e s t _

Iterative merging:

  1. Most frequent pair: 'l' and 'o' → 'lo'
1
lo w _ lo w e r _ lo w e s t _
  1. Most frequent pair: 'lo' and 'w' → 'low'
1
low _ low e r _ low e s t _
  1. Most frequent pair: 'e' and 'r' → 'er'
1
low _ low er _ low e s t _
  1. Most frequent pair: 'e' and 's' → 'es'
1
low _ low er _ low es t _

The final vocabulary would be: {'l', 'o', 'w', 'e', 'r', 's', 't', '_', 'lo', 'low', 'er', 'es'}

Interactive BPE Visualization

Loading tool...

Python Implementation

Here's a simplified implementation of BPE training:

python
1
from collections import Counter
2
import re
3
4
def get_stats(vocab):
5
pairs = Counter()
6
for word, freq in vocab.items():
7
symbols = word.split()
8
for i in range(len(symbols)-1):
9
pairs[symbols[i], symbols[i+1]] += freq
10
return pairs
11
12
def merge_vocab(pair, v_in):
13
v_out = {}
14
bigram = re.escape(' '.join(pair))
15
p = re.compile(r'(?<!\S)' + bigram + r'(?!\S)')
16
for word in v_in:
17
w_out = p.sub(''.join(pair), word)
18
v_out[w_out] = v_in[word]
19
return v_out
20
21
def train_bpe(text, num_merges):
22
# Tokenize into words
23
words = text.split()
24
25
# Initial vocabulary: character-level
26
vocab = Counter()
27
for word in words:
28
word = ' '.join(list(word)) + ' </w>' # Split into chars, add end token
29
vocab[word] += 1
30
31
# Perform merges
32
merges = []
33
for i in range(num_merges):
34
pairs = get_stats(vocab)
35
if not pairs:
36
break
37
38
best = max(pairs, key=pairs.get)
39
merges.append(best)
40
41
vocab = merge_vocab(best, vocab)
42
43
return vocab, merges

Applications of BPE

  • OpenAI's GPT models (GPT-2, GPT-3, GPT-4)
  • Facebook's BART and RoBERTa
  • Hugging Face's Tokenizers library

WordPiece Tokenization

WordPiece is another subword algorithm, famously used in Google's BERT and related models.

How WordPiece Differs from BPE

Unlike BPE, which selects pairs based on frequency, WordPiece uses a likelihood-based approach:

  1. Initialize vocabulary with individual characters
  2. Calculate the likelihood increase for each possible merge
  3. Perform the merge that maximizes likelihood
  4. Repeat until desired vocabulary size

The likelihood is based on the probability increase for the language model when two symbols are merged.

WordPiece Algorithm

Given a language model p(w1,...,wn)p(w_1,...,w_n), the likelihood increase for merging symbols aa and bb is:

p(ab)p(a)p(b)\frac{p(a \cdot b)}{p(a) \cdot p(b)}

Intuitively, this prioritizes merges that create meaningful subwords over just frequent ones.

Unique Characteristics

  1. Prefix Marking: WordPiece marks subword units with '##' prefix (except for the first piece)
  2. Out-of-Vocabulary Handling: Unknown words are broken into smaller subwords or individual characters

Comparison with BPE

Loading tool...

Implementation Example

While the exact WordPiece algorithm is proprietary to Google, we can use Hugging Face's Tokenizers library:

python
1
from tokenizers import Tokenizer
2
from tokenizers.models import WordPiece
3
from tokenizers.trainers import WordPieceTrainer
4
from tokenizers.pre_tokenizers import Whitespace
5
6
# Initialize a tokenizer
7
tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
8
9
# Pre-tokenize on whitespace
10
tokenizer.pre_tokenizer = Whitespace()
11
12
# Train the tokenizer
13
trainer = WordPieceTrainer(
14
vocab_size=30000,
15
special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
16
)
17
18
# Load text files and train
19
tokenizer.train(files=["data.txt"], trainer=trainer)
20
21
# Save the tokenizer
22
tokenizer.save("wordpiece-tokenizer.json")

Applications of WordPiece

  • Google's BERT
  • Google's DistilBERT, ALBERT, and ELECTRA
  • Many multilingual models

SentencePiece

SentencePiece, developed by Google, is a language-agnostic tokenizer that treats the input as a raw stream of Unicode characters.

Key Features

  1. Language Agnostic: Works with any language without language-specific preprocessing
  2. Whitespace Preservation: Treats spaces as normal characters
  3. Direct Raw Text Processing: No need for pre-tokenization
  4. Reversible Tokenization: Can perfectly recover the original text

How SentencePiece Works

SentencePiece combines principles from both BPE and Unigram language models:

  1. BPE Mode: Similar to standard BPE, but operates on raw text
  2. Unigram Mode: Uses a unigram language model to find the most likely segmentation

SentencePiece Unigram Model

The Unigram model defines the probability of a sequence as:

P(x)=i=1mp(xi)P(x) = \prod_{i=1}^{m} p(x_i)

where xix_i is a subword token and p(xi)p(x_i) is its probability.

It starts with a large vocabulary and iteratively removes tokens to maximize the likelihood on the training data.

Interactive Visualization

Loading tool...

Implementation

python
1
import sentencepiece as spm
2
3
# Train SentencePiece model
4
spm.SentencePieceTrainer.train(
5
input='data.txt',
6
model_prefix='sentencepiece',
7
vocab_size=8000,
8
model_type='unigram', # or 'bpe'
9
character_coverage=0.9995,
10
normalization_rule_name='nmt_nfkc'
11
)
12
13
# Load the model
14
sp = spm.SentencePieceProcessor()
15
sp.load('sentencepiece.model')
16
17
# Encode and decode
18
text = "SentencePiece is an unsupervised text tokenizer."
19
encoded = sp.encode(text, out_type=str)
20
decoded = sp.decode(encoded)
21
22
print(f"Original: {text}")
23
print(f"Tokens: {encoded}")
24
print(f"Decoded: {decoded}")

Applications of SentencePiece

  • Google's T5 and PaLM models
  • Meta AI's LLaMA models
  • XLNet and many multilingual models
  • Particularly popular for non-English and multilingual models

Comparison of Tokenization Methods

Performance Across Languages

Loading tool...

Feature Comparison

Loading tool...

Advanced Topics

Tokenization Implications for Model Performance

The choice of tokenization strategy has profound effects on:

  1. Model Size: Vocabulary size directly impacts embedding layer parameters
  2. Training Efficiency: Better tokenization means more efficient training
  3. Language Support: Some tokenizers handle certain languages better
  4. Model Generalization: Good subword tokenization improves generalization to new words

Tokenization Challenges

  1. Language Boundaries: Not all languages use spaces or have clear word boundaries
  2. Morphologically Rich Languages: Languages like Finnish or Turkish have complex word structures
  3. Code-Switching: Handling text that mixes multiple languages
  4. Non-linguistic Content: Emojis, URLs, hashtags, code snippets

Beyond Subword Tokenization

Research continues to improve tokenization:

  1. Character-level Transformers: Bypass tokenization entirely
  2. Byte-level BPE: GPT-2/3/4 use byte-level BPE to handle any Unicode character
  3. Dynamic Tokenization: Adapt tokenization based on the input
  4. Tokenization-free Models: Some experimental approaches try to work directly with raw text

Practical Implementation

Choosing the Right Tokenizer

Guidelines for selecting a tokenizer:

  1. Task Alignment: Match your tokenizer with your downstream task
  2. Model Compatibility: If fine-tuning, use the original model's tokenizer
  3. Language Support: Consider language-specific needs
  4. Vocabulary Size: Balance between coverage and computational efficiency

Tokenization in the Hugging Face Ecosystem

The Hugging Face Tokenizers library provides fast implementations of all major tokenization algorithms:

python
1
from transformers import AutoTokenizer
2
3
# Load pre-trained tokenizers
4
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
5
gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2")
6
t5_tokenizer = AutoTokenizer.from_pretrained("t5-base")
7
8
# Example text
9
text = "Tokenization splits text into subword units!"
10
11
# Compare tokenization results
12
print("BERT (WordPiece):", bert_tokenizer.tokenize(text))
13
print("GPT-2 (BPE):", gpt2_tokenizer.tokenize(text))
14
print("T5 (SentencePiece):", t5_tokenizer.tokenize(text))

Interactive Multi-Tokenizer Comparison

Loading tool...

Summary

In this lesson, we've covered:

  1. The limitations of traditional tokenization approaches
  2. Byte-Pair Encoding (BPE) algorithm and its applications
  3. WordPiece tokenization used in BERT and related models
  4. SentencePiece for language-agnostic tokenization
  5. Practical considerations for choosing and implementing tokenizers

Modern NLP's success relies heavily on these advanced tokenization techniques, which bridge the gap between character-level and word-level representations.

In our next lesson, we'll explore word embeddings, starting from traditional approaches like Word2Vec and GloVe, before moving to the contextual representations that power today's most advanced models.

Practice Exercises

  1. Implement a simple BPE tokenizer from scratch and train it on a small corpus.
  2. Compare tokenization results from different algorithms on texts from various languages and domains.
  3. Experiment with vocabulary size to see how it affects tokenization granularity.
  4. Fine-tune a pretrained model using a different tokenizer and evaluate the performance impact.

Additional Resources