Word Embeddings: From Word2Vec to FastText

Overview

In our previous lessons, we explored how to preprocess text and tokenize it into meaningful units. While these are crucial steps, they still don't solve a fundamental challenge in NLP: how do we represent words in a way that captures their meaning and relationships?

This lesson introduces word embeddings - dense vector representations that encode semantic relationships between words. These representations revolutionized NLP by enabling machines to understand semantic similarity, analogies, and other relationships between words that were previously difficult to capture.

Learning Objectives

After completing this lesson, you will be able to:

  • Understand the limitations of traditional one-hot encoding for word representation
  • Explain the intuition and theory behind word embeddings
  • Differentiate between Word2Vec approaches (CBOW and Skip-gram)
  • Understand how GloVe captures global statistics
  • Recognize how FastText handles subword information
  • Implement and use pre-trained word embeddings in practical applications

The Challenge of Word Representation

One-Hot Encoding: A Starting Point

Before embeddings, the standard approach to represent words was one-hot encoding:

1
"cat" → [0, 0, 1, 0, 0, ... 0]
2
"dog" → [0, 0, 0, 1, 0, ... 0]

In a one-hot encoding, each word gets a unique position in a very high-dimensional vector (the size of your vocabulary). Only one element is "hot" (set to 1), and all others are 0.

Limitations of One-Hot Encoding

  1. Dimensionality: For a vocabulary of 50,000 words, each vector has 50,000 dimensions but only contains a single piece of information.

  2. No Semantic Information: "cat" and "kitten" are as different as "cat" and "spacecraft" - all word pairs are equidistant.

  3. No Generalization: A model can't transfer knowledge between similar words.

Analogy: Library with No Organization

Imagine a library where books are simply assigned arbitrary shelf numbers without any organizing principle. Similar books might be placed on opposite ends of the building. Finding related content would require memorizing each book's exact location, with no way to guess where related titles might be.

Word embeddings are like organizing this library topically, where similar books are placed near each other, allowing you to browse naturally based on subject matter.

Distributional Semantics: The Foundation

The theoretical foundation for word embeddings comes from distributional semantics, captured in J.R. Firth's famous quote:

"You shall know a word by the company it keeps."

This idea suggests that words appearing in similar contexts likely have similar meanings. For example, "cat" and "dog" often appear near words like "pet," "animal," "fur," etc.

Visualizing the Distributional Hypothesis

Loading tool...

Word2Vec: Making Words Computable

In 2013, Tomas Mikolov and colleagues at Google introduced Word2Vec, a groundbreaking approach to learning word representations from large text corpora.

The Word2Vec Intuition

Word2Vec transforms words into dense vectors (typically 100-300 dimensions) where:

  1. Similar words are close together in vector space
  2. Relationships between words are preserved as vector operations
  3. Different aspects of meaning are captured in different dimensions

Two Architecture Variants

Word2Vec comes in two flavors:

  1. Continuous Bag of Words (CBOW): Predicts a target word from its context words
  2. Skip-gram: Predicts context words from a target word

Continuous Bag of Words (CBOW)

CBOW predicts a target word given its surrounding context words.

Architecture

  1. Context words are one-hot encoded
  2. These encodings are projected through a shared weight matrix
  3. The projections are averaged
  4. The result passes through an output layer to predict the target word

Mathematical Formulation

For a target word wtw_t and context words wtn,...,wt1,wt+1,...,wt+nw_{t-n}, ..., w_{t-1}, w_{t+1}, ..., w_{t+n}:

  1. Input layer: One-hot vectors xtn,...,xt1,xt+1,...,xt+n\mathbf{x}_{t-n}, ..., \mathbf{x}_{t-1}, \mathbf{x}_{t+1}, ..., \mathbf{x}_{t+n}
  2. Hidden layer: h=12nWT(xtn+...+xt1+xt+1+...+xt+n)\mathbf{h} = \frac{1}{2n}\mathbf{W}^T(\mathbf{x}_{t-n} + ... + \mathbf{x}_{t-1} + \mathbf{x}_{t+1} + ... + \mathbf{x}_{t+n})
  3. Output layer: uj=vjTh\mathbf{u}_j = \mathbf{v'}_j^T\mathbf{h} for each word jj in vocabulary
  4. Softmax: p(wjwtn,...,wt1,wt+1,...,wt+n)=exp(uj)j=1Vexp(uj)p(w_j|w_{t-n}, ..., w_{t-1}, w_{t+1}, ..., w_{t+n}) = \frac{\exp(u_j)}{\sum_{j'=1}^{V} \exp(u_{j'})}

Where W\mathbf{W} and V\mathbf{V'} are the input-to-hidden and hidden-to-output weight matrices.

Skip-gram

Skip-gram is the reverse of CBOW: it predicts context words given a target word.

Architecture

  1. Target word is one-hot encoded
  2. This encoding is projected through a weight matrix
  3. The result is used to predict each context word independently

Mathematical Formulation

For a target word wtw_t and context words wtn,...,wt1,wt+1,...,wt+nw_{t-n}, ..., w_{t-1}, w_{t+1}, ..., w_{t+n}:

  1. Input layer: One-hot vector xt\mathbf{x}_t
  2. Hidden layer: h=WTxt\mathbf{h} = \mathbf{W}^T\mathbf{x}_t
  3. Output layer: uj=vjTh\mathbf{u}_j = \mathbf{v'}_j^T\mathbf{h} for each word jj in vocabulary
  4. Softmax: For each position in context window, calculate p(wt+iwt)=exp(uwt+i)j=1Vexp(uj)p(w_{t+i}|w_t) = \frac{\exp(u_{w_{t+i}})}{\sum_{j=1}^{V} \exp(u_j)}

Visual Comparison: CBOW vs Skip-gram

Loading tool...

Training Optimizations

Computing the full softmax for large vocabularies (e.g., millions of words) is computationally expensive. Two main optimization techniques are used:

  1. Hierarchical Softmax: Uses a binary tree structure to reduce complexity from O(V) to O(log V)
  2. Negative Sampling: Updates only a small subset of weights in each iteration

Negative Sampling Explained

Instead of updating all output neurons, negative sampling:

  1. Updates the weights for the correct output
  2. Updates weights for a few randomly chosen "negative" outputs
  3. Significantly speeds up training

The objective function becomes:

logσ(vwOTvwI)+i=1kEwiPn(w)[logσ(vwiTvwI)]\log \sigma(v_{w_O}^T \cdot v_{w_I}) + \sum_{i=1}^k \mathbb{E}_{w_i \sim P_n(w)}[\log \sigma(-v_{w_i}^T \cdot v_{w_I})]

Where:

  • vwIv_{w_I} is the input vector for the target word
  • vwOv_{w_O} is the output vector for the context word
  • wiw_i are the negative samples
  • σ\sigma is the sigmoid function
  • kk is the number of negative samples (typically 5-20)

CBOW vs Skip-gram: When to Use Each

Loading tool...

Interactive Word2Vec Explorer

Loading tool...

Word Analogies: Vector Arithmetic

One of the most fascinating properties of word embeddings is their ability to capture linguistic regularities through vector arithmetic.

The Famous Example

vec("king")vec("man")+vec("woman")vec("queen")\text{vec}(\text{"king"}) - \text{vec}(\text{"man"}) + \text{vec}(\text{"woman"}) \approx \text{vec}(\text{"queen"})

This shows how the model captures gender relationships between words.

Other Analogies

Loading tool...

GloVe: Global Vectors for Word Representation

While Word2Vec learns from local context windows, GloVe (Global Vectors) incorporates global statistics about word co-occurrences across the entire corpus.

GloVe's Approach

GloVe combines the advantages of two paradigms:

  1. Matrix factorization methods like LSA (captures global statistics)
  2. Local context window methods like Word2Vec (captures local context)

GloVe's Mathematical Foundation

GloVe trains on global word-word co-occurrence statistics from a corpus. The objective function is:

J=i,j=1Vf(Xij)(wiTw~j+bi+b~jlogXij)2J = \sum_{i,j=1}^{V} f(X_{ij})(\mathbf{w}_i^T\mathbf{\tilde{w}}_j + b_i + \tilde{b}_j - \log X_{ij})^2

Where:

  • XijX_{ij} is the number of times word jj appears in the context of word ii
  • wi\mathbf{w}_i and w~j\mathbf{\tilde{w}}_j are word vectors
  • bib_i and b~j\tilde{b}_j are bias terms
  • f(Xij)f(X_{ij}) is a weighting function that gives less weight to rare co-occurrences

GloVe vs Word2Vec

Loading tool...

FastText: Improving with Subword Information

FastText, developed by Facebook Research, extends Word2Vec by incorporating subword information, addressing a major limitation of previous models: handling out-of-vocabulary and rare words.

The Subword Approach

While Word2Vec and GloVe treat each word as an atomic unit, FastText represents each word as a bag of character n-grams plus the whole word.

For example, the word "where" with n-grams of length 3-6 would be represented as:

  • Whole word: "where"
  • Character n-grams: "<wh", "whe", "her", "ere", "re>", "<whe", "wher", "here", "ere>", "<wher", "where", "here>", "<where", "where>"

(Note: < and > are special boundary symbols)

Mathematical Formulation

In FastText, a word's embedding is the sum of its character n-gram embeddings:

vw=gGwzg\mathbf{v}_w = \sum_{g \in G_w} \mathbf{z}_g

Where:

  • GwG_w is the set of n-grams appearing in word ww
  • zg\mathbf{z}_g is the vector representation of n-gram gg

Benefits of FastText

  1. Handles out-of-vocabulary words: Can generate embeddings for words never seen during training
  2. Better for morphologically rich languages: Captures prefixes, suffixes, and roots
  3. Robust to misspellings: Similar spellings result in similar embeddings
  4. Smaller models: Can represent larger vocabularies efficiently

Visualizing FastText vs Word2Vec

Loading tool...

Analogy: Character-Based Recognition

Think of how humans recognize related words. If you've never seen the word "unhappiness" but know "happy," "unhappy," and "happiness," you can deduce its meaning from its parts. FastText follows a similar principle, building word meaning from component parts.

Practical Implementation

Using Word2Vec with Gensim

python
1
import gensim.downloader as api
2
from gensim.models import Word2Vec
3
import numpy as np
4
5
# Load pre-trained model
6
word2vec_model = api.load('word2vec-google-news-300')
7
8
# Find similar words
9
similar_words = word2vec_model.most_similar('computer', topn=5)
10
print("Words similar to 'computer':")
11
for word, similarity in similar_words:
12
print(f" {word}: {similarity:.4f}")
13
14
# Word analogies
15
result = word2vec_model.most_similar(
16
positive=['woman', 'king'],
17
negative=['man'],
18
topn=1
19
)
20
print(f"\nking - man + woman = {result[0][0]}")
21
22
# Train your own model
23
sentences = [
24
["cat", "say", "meow"],
25
["dog", "say", "woof"]
26
# Add more sentences here
27
]
28
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)
29
30
# Get vector for a word
31
cat_vector = model.wv['cat']
32
print(f"\nVector for 'cat' (first 5 dimensions): {cat_vector[:5]}")
33
34
# Save and load
35
model.save("word2vec.model")
36
loaded_model = Word2Vec.load("word2vec.model")

Using GloVe with Python

python
1
import numpy as np
2
from gensim.models import KeyedVectors
3
import urllib.request
4
import os
5
import zipfile
6
7
# Download and extract GloVe vectors
8
glove_url = "http://nlp.stanford.edu/data/glove.6B.zip"
9
glove_path = "glove.6B.zip"
10
11
if not os.path.exists(glove_path):
12
urllib.request.urlretrieve(glove_url, glove_path)
13
with zipfile.ZipFile(glove_path, 'r') as zip_ref:
14
zip_ref.extractall(".")
15
16
# Load GloVe vectors
17
glove_file = 'glove.6B.100d.txt'
18
glove_model = {}
19
20
with open(glove_file, 'r', encoding='utf-8') as f:
21
for line in f:
22
values = line.split()
23
word = values[0]
24
vector = np.asarray(values[1:], dtype='float32')
25
glove_model[word] = vector
26
27
# Function to find closest words
28
def find_closest_words(word_vector, model, n=5):
29
similarities = {}
30
for word, vector in model.items():
31
if len(vector) == len(word_vector):
32
similarity = np.dot(vector, word_vector) / (np.linalg.norm(vector) * np.linalg.norm(word_vector))
33
similarities[word] = similarity
34
35
return sorted(similarities.items(), key=lambda x: x[1], reverse=True)[:n]
36
37
# Example usage
38
if 'computer' in glove_model:
39
computer_vector = glove_model['computer']
40
closest_words = find_closest_words(computer_vector, glove_model)
41
print("\nWords similar to 'computer' (GloVe):")
42
for word, similarity in closest_words:
43
print(f" {word}: {similarity:.4f}")

Using FastText

python
1
import fasttext
2
import fasttext.util
3
4
# Download pre-trained FastText model
5
fasttext.util.download_model('en', if_exists='ignore')
6
7
# Load the model
8
ft_model = fasttext.load_model('cc.en.300.bin')
9
10
# Reduce model dimensions for faster processing (optional)
11
fasttext.util.reduce_model(ft_model, 100)
12
13
# Get word vectors
14
computer_vector = ft_model.get_word_vector('computer')
15
print(f"\nVector for 'computer' (first 5 dimensions): {computer_vector[:5]}")
16
17
# Get vectors for OOV words
18
typo_vector = ft_model.get_word_vector('computre') # Misspelling
19
print(f"Vector for misspelled 'computre' (first 5 dimensions): {typo_vector[:5]}")
20
21
# Train your own FastText model
22
# Model modes: cbow or skipgram
23
model = fasttext.train_unsupervised(
24
'data.txt',
25
model='skipgram',
26
dim=100,
27
epoch=5,
28
lr=0.05,
29
wordNgrams=2
30
)
31
32
# Save and load
33
model.save_model("fasttext_model.bin")
34
loaded_model = fasttext.load_model("fasttext_model.bin")

Evaluating Word Embeddings

Intrinsic Evaluation

  1. Word Similarity: How well do embedding distances correlate with human judgments?

    • WordSim-353, SimLex-999, MEN datasets
  2. Word Analogies: How well do embeddings capture relationships?

    • Google analogy dataset (semantic and syntactic analogies)

Extrinsic Evaluation

Test performance on downstream tasks:

  • Named Entity Recognition
  • Sentiment Analysis
  • Part-of-Speech Tagging

Visualization of Evaluation Metrics

Loading tool...

Limitations of Traditional Word Embeddings

Despite their revolutionary impact, traditional word embeddings have several limitations:

  1. Static Word Representations: Each word has a single vector, regardless of context

    • "bank" has the same representation in "river bank" and "bank account"
  2. Limited Compositional Understanding: Poor at representing phrases and sentences

  3. Bias and Fairness Issues: Embeddings learn and amplify biases in training data

    • Example: "man : doctor :: woman : nurse"
  4. Requires Large Corpora: Need substantial training data for good quality

Visualizing Contextual Ambiguity

Loading tool...

Summary

In this lesson, we've covered:

  1. The evolution from sparse to dense word representations
  2. Word2Vec approaches: CBOW and Skip-gram
  3. GloVe's incorporation of global statistics
  4. FastText's handling of subword information
  5. Practical implementations of word embedding models
  6. Limitations of traditional embedding approaches

These foundational models revolutionized NLP by transforming words into rich, meaningful vector spaces. However, they represent just the beginning of the embedding journey.

In our next lesson, we'll explore contextual embeddings from models like ELMo, BERT, and modern language models, which address many limitations of the traditional approaches we've covered here.

Practice Exercises

  1. Word Embedding Exploration:

    • Download pre-trained Word2Vec, GloVe, and FastText models
    • Compare their performance on a set of word analogies
    • Visualize word clusters in 2D using dimensionality reduction
  2. Training Custom Embeddings:

    • Train Word2Vec and FastText embeddings on a domain-specific corpus
    • Compare their performance against general pre-trained models
    • Analyze how domain focus affects quality
  3. Word Similarity Application:

    • Build a simple document similarity system using word embeddings
    • Create an average-of-embeddings representation for sentences
    • Compute distances between documents
  4. Embedding Bias Analysis:

    • Investigate gender, racial, or other biases in pre-trained embeddings
    • Implement a simple debiasing approach
    • Measure the impact of debiasing on analogy tasks

Additional Resources