Introduction to Text Preprocessing

Overview

Text preprocessing is the foundation of Natural Language Processing (NLP) - think of it as preparing ingredients before cooking a gourmet meal. Just as a chef washes, peels, and chops vegetables before cooking, we must clean, normalize, and transform raw text before feeding it to sophisticated NLP models.

In this lesson, you'll learn the essential techniques for preparing text data so that machines can process and understand language more effectively.

Learning Objectives

After completing this lesson, you will be able to:

  • Understand why text preprocessing is crucial for NLP tasks
  • Apply key text cleaning and normalization techniques
  • Implement different tokenization approaches
  • Compare stemming and lemmatization methods
  • Extract numerical features from text using BoW and TF-IDF

Why Preprocess Text?

Human language is beautiful, messy, and complex. Consider these examples:

  • "I love NLP!"
  • "i loooove n.l.p!!!!!"
  • "I <3 Natural Language Processing"

A human understands these sentences express the same sentiment, but machines struggle with this variability. Text preprocessing creates consistency that allows models to focus on meaning rather than surface variations.

Analogy: Signal Processing

Think of text preprocessing like cleaning an audio signal. An audio engineer removes background noise, normalizes volume, and enhances clarity before analysis. Similarly, we remove "noise" from text (irrelevant symbols, inconsistent capitalization) and normalize it to reveal the underlying linguistic "signal."

Text Cleaning and Normalization

Common Cleaning Operations

  1. Removing HTML tags and markup

    Raw text scraped from websites often contains HTML tags that add no semantic value.

  2. Converting to lowercase

    Standardizing case prevents models from treating "Apple" and "apple" as different words.

  3. Removing punctuation and numbers

    Punctuation and numbers are often (but not always) removed to focus on the core text.

  4. Removing stopwords

    Common words like "the," "and," "is" that occur frequently but carry little semantic meaning.

  5. Handling contractions

    Expanding contractions like "don't" to "do not" for consistency.

Implementation Example

Here's a comprehensive text cleaning function:

python
1
import re
2
import nltk
3
from nltk.corpus import stopwords
4
5
# Download resources
6
nltk.download('stopwords')
7
8
def clean_text(text, remove_stopwords=True):
9
# Convert to lowercase
10
text = text.lower()
11
12
# Remove HTML tags
13
text = re.sub(r'<.*?>', '', text)
14
15
# Remove URLs
16
text = re.sub(r'http\S+|www\S+', '', text)
17
18
# Remove special characters and numbers
19
text = re.sub(r'[^a-zA-Z\s]', '', text)
20
21
# Remove extra whitespace
22
text = re.sub(r'\s+', ' ', text).strip()
23
24
# Remove stopwords (optional)
25
if remove_stopwords:
26
stop_words = set(stopwords.words('english'))
27
words = text.split()
28
text = ' '.join([word for word in words if word not in stop_words])
29
30
return text

Interactive Exploration

Loading tool...

Considerations

Text cleaning should be approached thoughtfully, considering your specific task:

  • For sentiment analysis, punctuation (like "!") might carry important emotional signals
  • For technical text, numbers and special characters might be crucial
  • For named entity recognition, preserving case information is valuable

Tokenization: Breaking Text into Meaningful Units

Tokenization is the process of splitting text into smaller units (tokens). Think of it as the difference between seeing a sentence as one long string versus recognizing the individual words or subwords that compose it.

Analogy: Breaking a Puzzle

Consider a jigsaw puzzle. When the pieces arrive in the box, they're all mixed together. Tokenization is like sorting the pieces - before we can understand the picture, we need to identify the individual components.

Types of Tokenization

1. Word Tokenization

The most intuitive approach - split text at whitespace (with some handling of punctuation).

python
1
from nltk.tokenize import word_tokenize
2
3
text = "Natural language processing is fascinating!"
4
tokens = word_tokenize(text)
5
print(tokens)
6
# Output: ['Natural', 'language', 'processing', 'is', 'fascinating', '!']
Loading tool...

Limitations:

  • Cannot handle out-of-vocabulary words
  • Large vocabulary size for morphologically rich languages
  • Struggles with compounds (like "New York")

2. Character Tokenization

Breaks text into individual characters.

python
1
text = "NLP is cool"
2
char_tokens = list(text)
3
print(char_tokens)
4
# Output: ['N', 'L', 'P', ' ', 'i', 's', ' ', 'c', 'o', 'o', 'l']

Advantages:

  • Small, fixed vocabulary
  • No out-of-vocabulary issues

Disadvantages:

  • Much longer sequences
  • Loses word-level semantics

3. N-gram Tokenization

Creates tokens of n contiguous items (characters or words).

python
1
from nltk.util import ngrams
2
3
text = "Natural language processing"
4
words = text.split()
5
6
# Word bigrams
7
word_bigrams = list(ngrams(words, 2))
8
print(word_bigrams)
9
# Output: [('Natural', 'language'), ('language', 'processing')]
10
11
# Character trigrams
12
char_trigrams = list(ngrams(text.replace(" ", ""), 3))
13
print(char_trigrams)
14
# Output: [('N', 'a', 't'), ('a', 't', 'u'), ...]

4. Subword Tokenization

Strikes a balance between word and character tokenization by breaking words into meaningful subunits.

Example: Byte-Pair Encoding (BPE) Intuition

  1. Start with character-level tokens
  2. Identify most frequent adjacent pairs
  3. Merge these pairs
  4. Repeat until desired vocabulary size

We'll explore advanced subword tokenization methods in the next lesson.

Visualization: How Different Tokenizers Work

Loading tool...

Stemming and Lemmatization

Both techniques reduce words to their root forms, but they use different approaches.

Stemming

Stemming uses a simple rule-based approach to cut off word endings. Think of it as a crude but fast way to chop off the ends of words.

python
1
from nltk.stem import PorterStemmer
2
3
stemmer = PorterStemmer()
4
words = ["running", "runs", "ran", "easily", "fairly"]
5
6
stemmed_words = [stemmer.stem(word) for word in words]
7
print(stemmed_words)
8
# Output: ['run', 'run', 'ran', 'easili', 'fairli']

Lemmatization

Lemmatization uses vocabulary and morphological analysis to return dictionary forms (lemmas). Think of it as looking up words in a dictionary to find their base forms.

python
1
from nltk.stem import WordNetLemmatizer
2
import nltk
3
4
nltk.download('wordnet')
5
lemmatizer = WordNetLemmatizer()
6
7
words = ["running", "runs", "ran", "better", "mice"]
8
9
# With POS tagging for better results
10
lemmatized_verbs = [lemmatizer.lemmatize(word, pos='v') for word in words[:3]]
11
print(lemmatized_verbs)
12
# Output: ['run', 'run', 'run']

Comparison: Stemming vs. Lemmatization

Loading tool...

When to Use Each Approach

  • Stemming: When speed is more important than precision
  • Lemmatization: When accuracy matters more than processing time

Feature Extraction: Converting Text to Numbers

Machine learning models work with numbers, not text. Feature extraction converts text into numerical representations.

Bag of Words (BoW)

BoW represents text as an unordered set of words, using word counts or frequencies.

Mathematical Representation

For a vocabulary V={w1,w2,...,wn}V = \{w_1, w_2, ..., w_n\} and a document dd, the BoW representation is a vector x\vec{x} where:

x=[c(w1,d),c(w2,d),...,c(wn,d)]\vec{x} = [c(w_1, d), c(w_2, d), ..., c(w_n, d)]

where c(w,d)c(w, d) is the count of word ww in document dd.

Implementation

python
1
from sklearn.feature_extraction.text import CountVectorizer
2
3
corpus = [
4
"Natural language processing is fascinating.",
5
"I love working with text data.",
6
"Machine learning models need numerical features."
7
]
8
9
vectorizer = CountVectorizer()
10
X = vectorizer.fit_transform(corpus)
11
12
print("Vocabulary:", vectorizer.get_feature_names_out())
13
print("Feature matrix shape:", X.shape)
14
print("Features for first document:\n", X[0].toarray())

TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF extends BoW by weighting terms based on their importance in a document relative to the entire corpus.

Mathematical Representation

TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)\text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D)

Where:

  • TF(t,d)\text{TF}(t, d) is the term frequency of term tt in document dd
  • IDF(t,D)=logN{dD:td}\text{IDF}(t, D) = \log\frac{N}{|\{d \in D: t \in d\}|} where NN is the total number of documents

Implementation

python
1
from sklearn.feature_extraction.text import TfidfVectorizer
2
3
vectorizer = TfidfVectorizer()
4
X = vectorizer.fit_transform(corpus)
5
6
print("TF-IDF features for first document:\n", X[0].toarray())

Visualization: BoW vs. TF-IDF

Loading tool...

Limitations of BoW and TF-IDF

  • Lose word order and context
  • Sparse high-dimensional vectors
  • Struggle with synonyms and polysemy
  • Out-of-vocabulary words

Complete Text Processing Pipeline

Let's put everything together in a complete pipeline:

python
1
import re
2
import nltk
3
from nltk.corpus import stopwords
4
from nltk.tokenize import word_tokenize
5
from nltk.stem import WordNetLemmatizer
6
from sklearn.feature_extraction.text import TfidfVectorizer
7
8
# Download necessary resources
9
nltk.download('stopwords')
10
nltk.download('punkt')
11
nltk.download('wordnet')
12
13
def preprocess_text(text, remove_stopwords=True):
14
# Step 1: Clean text
15
# Convert to lowercase
16
text = text.lower()
17
18
# Remove HTML tags
19
text = re.sub(r'<.*?>', '', text)
20
21
# Remove URLs
22
text = re.sub(r'http\S+|www\S+', '', text)
23
24
# Remove special characters and numbers
25
text = re.sub(r'[^a-zA-Z\s]', '', text)
26
27
# Remove extra whitespace
28
text = re.sub(r'\s+', ' ', text).strip()
29
30
# Step 2: Tokenize
31
tokens = word_tokenize(text)
32
33
# Step 3: Remove stopwords (optional)
34
if remove_stopwords:
35
stop_words = set(stopwords.words('english'))
36
tokens = [token for token in tokens if token not in stop_words]
37
38
# Step 4: Lemmatize
39
lemmatizer = WordNetLemmatizer()
40
tokens = [lemmatizer.lemmatize(token, pos='v') for token in tokens]
41
42
# Return processed tokens
43
return tokens
44
45
# Example usage
46
documents = [
47
"Natural Language Processing (NLP) techniques are being used in many applications today!",
48
"Machine learning models require numerical data to work properly.",
49
"Text preprocessing is the first step in any NLP pipeline."
50
]
51
52
# Process each document
53
processed_docs = []
54
for doc in documents:
55
processed_tokens = preprocess_text(doc)
56
processed_docs.append(' '.join(processed_tokens))
57
58
print("Processed documents:")
59
for doc in processed_docs:
60
print(doc)
61
62
# Convert to TF-IDF features
63
vectorizer = TfidfVectorizer()
64
X = vectorizer.fit_transform(processed_docs)
65
66
print("\nFeature matrix shape:", X.shape)
67
print("Features for first document:\n", X[0].toarray())

Interactive Preprocessing Explorer

Loading tool...

Practical Considerations

Processing Pipeline Choices

The choices you make in your preprocessing pipeline should be informed by:

  1. Language characteristics: Different languages may require different approaches
  2. Task requirements: Some tasks need more preservation of original text
  3. Computational constraints: Lemmatization is more resource-intensive than stemming
  4. Domain specificity: Technical or specialized text might need custom preprocessing

Common Pitfalls

  • Over-preprocessing: Removing too much information (like punctuation for sentiment analysis)
  • Under-preprocessing: Not handling important variations (like case differences)
  • Ignoring domain-specific needs: Medical or legal text requires specialized preprocessing
  • Not validating results: Always inspect your preprocessing output

Beyond Basic Preprocessing

Text preprocessing continues to evolve with:

  • Contextualized preprocessing: Adapting cleaning based on context
  • Learned tokenization: Models that learn how to tokenize
  • End-to-end approaches: Models that process raw text directly

Summary

In this lesson, we've covered:

  1. Text cleaning and normalization: Making text consistent and removing noise
  2. Tokenization: Breaking text into smaller meaningful units
  3. Stemming and lemmatization: Reducing words to their base forms
  4. Feature extraction: Converting text to numerical features

These foundational preprocessing techniques form the backbone of traditional NLP pipelines. In the next lesson, we'll explore advanced tokenization techniques that are used in modern transformer-based models.

Practice Exercises

  1. Basic Preprocessing: Implement a function that:

    • Cleans text (lowercase, removes special characters)
    • Tokenizes using word tokenization
    • Removes stopwords Test it on a paragraph of your choice.
  2. Comparative Analysis: Compare the effects of:

    • Different stemmers (Porter, Snowball, Lancaster)
    • Lemmatization
    • With and without stopword removal How do these choices affect the final representation?
  3. Advanced Pipeline: Build a complete preprocessing pipeline that:

    • Takes raw text
    • Applies cleaning
    • Offers choice of tokenization
    • Applies stemming or lemmatization
    • Extracts features using TF-IDF
    • Returns a feature matrix ready for machine learning

Additional Resources