Word Embeddings: From Word2Vec to FastText
Overview
In our previous lessons, we explored how to preprocess text and tokenize it into meaningful units. While these are crucial steps, they still don't solve a fundamental challenge in NLP: how do we represent words in a way that captures their meaning and relationships?
This lesson introduces word embeddings - dense vector representations that encode semantic relationships between words. These representations revolutionized NLP by enabling machines to understand semantic similarity, analogies, and other relationships between words that were previously difficult to capture.
Learning Objectives
After completing this lesson, you will be able to:
- Understand the limitations of traditional one-hot encoding for word representation
- Explain the intuition and theory behind word embeddings
- Differentiate between Word2Vec approaches (CBOW and Skip-gram)
- Understand how GloVe captures global statistics
- Recognize how FastText handles subword information
- Implement and use pre-trained word embeddings in practical applications
The Challenge of Word Representation
One-Hot Encoding: A Starting Point
Before embeddings, the standard approach to represent words was one-hot encoding:
1 "cat" → [0, 0, 1, 0, 0, ... 0] 2 "dog" → [0, 0, 0, 1, 0, ... 0]
In a one-hot encoding, each word gets a unique position in a very high-dimensional vector (the size of your vocabulary). Only one element is "hot" (set to 1), and all others are 0.
Limitations of One-Hot Encoding
-
Dimensionality: For a vocabulary of 50,000 words, each vector has 50,000 dimensions but only contains a single piece of information.
-
No Semantic Information: "cat" and "kitten" are as different as "cat" and "spacecraft" - all word pairs are equidistant.
-
No Generalization: A model can't transfer knowledge between similar words.
Analogy: Library with No Organization
Imagine a library where books are simply assigned arbitrary shelf numbers without any organizing principle. Similar books might be placed on opposite ends of the building. Finding related content would require memorizing each book's exact location, with no way to guess where related titles might be.
Word embeddings are like organizing this library topically, where similar books are placed near each other, allowing you to browse naturally based on subject matter.
Distributional Semantics: The Foundation
The theoretical foundation for word embeddings comes from distributional semantics, captured in J.R. Firth's famous quote:
"You shall know a word by the company it keeps."
This idea suggests that words appearing in similar contexts likely have similar meanings. For example, "cat" and "dog" often appear near words like "pet," "animal," "fur," etc.
Visualizing the Distributional Hypothesis
Word2Vec: Making Words Computable
In 2013, Tomas Mikolov and colleagues at Google introduced Word2Vec, a groundbreaking approach to learning word representations from large text corpora.
The Word2Vec Intuition
Word2Vec transforms words into dense vectors (typically 100-300 dimensions) where:
- Similar words are close together in vector space
- Relationships between words are preserved as vector operations
- Different aspects of meaning are captured in different dimensions
Two Architecture Variants
Word2Vec comes in two flavors:
- Continuous Bag of Words (CBOW): Predicts a target word from its context words
- Skip-gram: Predicts context words from a target word
Continuous Bag of Words (CBOW)
CBOW predicts a target word given its surrounding context words.
Architecture
- Context words are one-hot encoded
- These encodings are projected through a shared weight matrix
- The projections are averaged
- The result passes through an output layer to predict the target word
Mathematical Formulation
For a target word and context words :
- Input layer: One-hot vectors
- Hidden layer:
- Output layer: for each word in vocabulary
- Softmax:
Where and are the input-to-hidden and hidden-to-output weight matrices.
Skip-gram
Skip-gram is the reverse of CBOW: it predicts context words given a target word.
Architecture
- Target word is one-hot encoded
- This encoding is projected through a weight matrix
- The result is used to predict each context word independently
Mathematical Formulation
For a target word and context words :
- Input layer: One-hot vector
- Hidden layer:
- Output layer: for each word in vocabulary
- Softmax: For each position in context window, calculate
Visual Comparison: CBOW vs Skip-gram
Training Optimizations
Computing the full softmax for large vocabularies (e.g., millions of words) is computationally expensive. Two main optimization techniques are used:
- Hierarchical Softmax: Uses a binary tree structure to reduce complexity from O(V) to O(log V)
- Negative Sampling: Updates only a small subset of weights in each iteration
Negative Sampling Explained
Instead of updating all output neurons, negative sampling:
- Updates the weights for the correct output
- Updates weights for a few randomly chosen "negative" outputs
- Significantly speeds up training
The objective function becomes:
Where:
- is the input vector for the target word
- is the output vector for the context word
- are the negative samples
- is the sigmoid function
- is the number of negative samples (typically 5-20)
CBOW vs Skip-gram: When to Use Each
Interactive Word2Vec Explorer
Word Analogies: Vector Arithmetic
One of the most fascinating properties of word embeddings is their ability to capture linguistic regularities through vector arithmetic.
The Famous Example
This shows how the model captures gender relationships between words.
Other Analogies
GloVe: Global Vectors for Word Representation
While Word2Vec learns from local context windows, GloVe (Global Vectors) incorporates global statistics about word co-occurrences across the entire corpus.
GloVe's Approach
GloVe combines the advantages of two paradigms:
- Matrix factorization methods like LSA (captures global statistics)
- Local context window methods like Word2Vec (captures local context)
GloVe's Mathematical Foundation
GloVe trains on global word-word co-occurrence statistics from a corpus. The objective function is:
Where:
- is the number of times word appears in the context of word
- and are word vectors
- and are bias terms
- is a weighting function that gives less weight to rare co-occurrences
GloVe vs Word2Vec
FastText: Improving with Subword Information
FastText, developed by Facebook Research, extends Word2Vec by incorporating subword information, addressing a major limitation of previous models: handling out-of-vocabulary and rare words.
The Subword Approach
While Word2Vec and GloVe treat each word as an atomic unit, FastText represents each word as a bag of character n-grams plus the whole word.
For example, the word "where" with n-grams of length 3-6 would be represented as:
- Whole word: "where"
- Character n-grams: "<wh", "whe", "her", "ere", "re>", "<whe", "wher", "here", "ere>", "<wher", "where", "here>", "<where", "where>"
(Note: < and > are special boundary symbols)
Mathematical Formulation
In FastText, a word's embedding is the sum of its character n-gram embeddings:
Where:
- is the set of n-grams appearing in word
- is the vector representation of n-gram
Benefits of FastText
- Handles out-of-vocabulary words: Can generate embeddings for words never seen during training
- Better for morphologically rich languages: Captures prefixes, suffixes, and roots
- Robust to misspellings: Similar spellings result in similar embeddings
- Smaller models: Can represent larger vocabularies efficiently
Visualizing FastText vs Word2Vec
Analogy: Character-Based Recognition
Think of how humans recognize related words. If you've never seen the word "unhappiness" but know "happy," "unhappy," and "happiness," you can deduce its meaning from its parts. FastText follows a similar principle, building word meaning from component parts.
Practical Implementation
Using Word2Vec with Gensim
1 import gensim.downloader as api 2 from gensim.models import Word2Vec 3 import numpy as np 4
5 # Load pre-trained model 6 word2vec_model = api.load('word2vec-google-news-300') 7
8 # Find similar words 9 similar_words = word2vec_model.most_similar('computer', topn=5) 10 print("Words similar to 'computer':") 11 for word, similarity in similar_words: 12 print(f" {word}: {similarity:.4f}") 13
14 # Word analogies 15 result = word2vec_model.most_similar( 16 positive=['woman', 'king'], 17 negative=['man'], 18 topn=1 19 ) 20 print(f"\nking - man + woman = {result[0][0]}") 21
22 # Train your own model 23 sentences = [ 24 ["cat", "say", "meow"], 25 ["dog", "say", "woof"] 26 # Add more sentences here 27 ] 28 model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4) 29
30 # Get vector for a word 31 cat_vector = model.wv['cat'] 32 print(f"\nVector for 'cat' (first 5 dimensions): {cat_vector[:5]}") 33
34 # Save and load 35 model.save("word2vec.model") 36 loaded_model = Word2Vec.load("word2vec.model")
Using GloVe with Python
1 import numpy as np 2 from gensim.models import KeyedVectors 3 import urllib.request 4 import os 5 import zipfile 6
7 # Download and extract GloVe vectors 8 glove_url = "http://nlp.stanford.edu/data/glove.6B.zip" 9 glove_path = "glove.6B.zip" 10
11 if not os.path.exists(glove_path): 12 urllib.request.urlretrieve(glove_url, glove_path) 13 with zipfile.ZipFile(glove_path, 'r') as zip_ref: 14 zip_ref.extractall(".") 15
16 # Load GloVe vectors 17 glove_file = 'glove.6B.100d.txt' 18 glove_model = {} 19
20 with open(glove_file, 'r', encoding='utf-8') as f: 21 for line in f: 22 values = line.split() 23 word = values[0] 24 vector = np.asarray(values[1:], dtype='float32') 25 glove_model[word] = vector 26
27 # Function to find closest words 28 def find_closest_words(word_vector, model, n=5): 29 similarities = {} 30 for word, vector in model.items(): 31 if len(vector) == len(word_vector): 32 similarity = np.dot(vector, word_vector) / (np.linalg.norm(vector) * np.linalg.norm(word_vector)) 33 similarities[word] = similarity 34 35 return sorted(similarities.items(), key=lambda x: x[1], reverse=True)[:n] 36
37 # Example usage 38 if 'computer' in glove_model: 39 computer_vector = glove_model['computer'] 40 closest_words = find_closest_words(computer_vector, glove_model) 41 print("\nWords similar to 'computer' (GloVe):") 42 for word, similarity in closest_words: 43 print(f" {word}: {similarity:.4f}")
Using FastText
1 import fasttext 2 import fasttext.util 3
4 # Download pre-trained FastText model 5 fasttext.util.download_model('en', if_exists='ignore') 6
7 # Load the model 8 ft_model = fasttext.load_model('cc.en.300.bin') 9
10 # Reduce model dimensions for faster processing (optional) 11 fasttext.util.reduce_model(ft_model, 100) 12
13 # Get word vectors 14 computer_vector = ft_model.get_word_vector('computer') 15 print(f"\nVector for 'computer' (first 5 dimensions): {computer_vector[:5]}") 16
17 # Get vectors for OOV words 18 typo_vector = ft_model.get_word_vector('computre') # Misspelling 19 print(f"Vector for misspelled 'computre' (first 5 dimensions): {typo_vector[:5]}") 20
21 # Train your own FastText model 22 # Model modes: cbow or skipgram 23 model = fasttext.train_unsupervised( 24 'data.txt', 25 model='skipgram', 26 dim=100, 27 epoch=5, 28 lr=0.05, 29 wordNgrams=2 30 ) 31
32 # Save and load 33 model.save_model("fasttext_model.bin") 34 loaded_model = fasttext.load_model("fasttext_model.bin")
Evaluating Word Embeddings
Intrinsic Evaluation
-
Word Similarity: How well do embedding distances correlate with human judgments?
- WordSim-353, SimLex-999, MEN datasets
-
Word Analogies: How well do embeddings capture relationships?
- Google analogy dataset (semantic and syntactic analogies)
Extrinsic Evaluation
Test performance on downstream tasks:
- Named Entity Recognition
- Sentiment Analysis
- Part-of-Speech Tagging
Visualization of Evaluation Metrics
Limitations of Traditional Word Embeddings
Despite their revolutionary impact, traditional word embeddings have several limitations:
-
Static Word Representations: Each word has a single vector, regardless of context
- "bank" has the same representation in "river bank" and "bank account"
-
Limited Compositional Understanding: Poor at representing phrases and sentences
-
Bias and Fairness Issues: Embeddings learn and amplify biases in training data
- Example: "man : doctor :: woman : nurse"
-
Requires Large Corpora: Need substantial training data for good quality
Visualizing Contextual Ambiguity
Summary
In this lesson, we've covered:
- The evolution from sparse to dense word representations
- Word2Vec approaches: CBOW and Skip-gram
- GloVe's incorporation of global statistics
- FastText's handling of subword information
- Practical implementations of word embedding models
- Limitations of traditional embedding approaches
These foundational models revolutionized NLP by transforming words into rich, meaningful vector spaces. However, they represent just the beginning of the embedding journey.
In our next lesson, we'll explore contextual embeddings from models like ELMo, BERT, and modern language models, which address many limitations of the traditional approaches we've covered here.
Practice Exercises
-
Word Embedding Exploration:
- Download pre-trained Word2Vec, GloVe, and FastText models
- Compare their performance on a set of word analogies
- Visualize word clusters in 2D using dimensionality reduction
-
Training Custom Embeddings:
- Train Word2Vec and FastText embeddings on a domain-specific corpus
- Compare their performance against general pre-trained models
- Analyze how domain focus affects quality
-
Word Similarity Application:
- Build a simple document similarity system using word embeddings
- Create an average-of-embeddings representation for sentences
- Compute distances between documents
-
Embedding Bias Analysis:
- Investigate gender, racial, or other biases in pre-trained embeddings
- Implement a simple debiasing approach
- Measure the impact of debiasing on analogy tasks
Additional Resources
- Word2Vec Paper: Efficient Estimation of Word Representations in Vector Space
- GloVe Project at Stanford
- FastText: Library for Efficient Text Classification and Representation Learning
- Gensim: Topic Modelling for Humans
- Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings
- Book: "Speech and Language Processing" by Dan Jurafsky and James H. Martin (Chapter on Vector Semantics)