Unveiling Word2Vec: A Step-by-Step Guide to Understanding What It Learns and How

Overview

Word2Vec revolutionized natural language processing by producing dense vector embeddings that capture semantic relationships. Despite its widespread use, the precise learning mechanism remained elusive for years. A breakthrough theory now reveals that, under realistic conditions, Word2Vec simplifies to unweighted least-squares matrix factorization, with the final embeddings given by Principal Component Analysis (PCA). This tutorial breaks down that result into an accessible, detailed guide.

Unveiling Word2Vec: A Step-by-Step Guide to Understanding What It Learns and How — Source: bair.berkeley.edu

Prerequisites

To follow this guide, you should be comfortable with:

Basic machine learning concepts (supervised vs. self-supervised learning)
Linear algebra (vectors, matrices, eigenvalues)
Neural network fundamentals (two-layer networks, gradient descent)
Word embeddings (the idea of representing words as vectors)

Familiarity with the Skip-Gram and Negative Sampling details is helpful but not required.

Step-by-Step Instructions

Step 1: Setting Up Word2Vec as a Minimal Language Model

Word2Vec trains a two-layer linear network using a contrastive objective. The input is a one-hot encoded word, the hidden layer is a set of word embeddings (the weight matrix W), and the output layer produces scores for context words. Through self-supervised learning on a text corpus, the network learns to predict surrounding words.

Key insight: This setup is equivalent to a linear neural language model. The learning dynamics thus reveal how feature learning occurs in more advanced transformers.

Step 2: Initializing Near the Origin – From Zero to One Dimension

When all embeddings are initialized randomly but extremely close to the origin (essentially zero vectors), the learning process begins from a collapsed state. The gradient flow drives the system to escape this symmetric point by learning one concept at a time. This concept corresponds to a linear subspace (a direction in the embedding space) that gets encoded in the weight matrix.

Mathematically, the loss landscape contains saddle points. The network incrementally increases the rank of the weight matrix, each time adding a new orthogonal direction. This is visible as discrete steps in the training loss curve.

Step 3: Reduction to Unweighted Least-Squares Matrix Factorization

Under mild approximations (like ignoring the non-linearities from the softmax), the gradient flow dynamics of Word2Vec become equivalent to solving an unweighted least-squares problem. Specifically, the learning task reduces to factorizing a pointwise mutual information (PMI) matrix into two low-rank matrices.

The goal becomes: minimize ||W - UV||², where W is the PMI matrix, and U, V are the word and context embeddings. This is a classic matrix factorization problem with a simple closed-form solution via SVD.

Step 4: Final Representations Given by PCA

Because the factorization is unweighted (all entries treated equally), the optimal embeddings are the top principal components of the PMI matrix. In other words, the word vectors after training align with the eigenvectors corresponding to the largest eigenvalues of that matrix.

This explains the linear structure observed in practice: analogies like 'king - man + woman ≈ queen' arise because embeddings capture directions of maximal variance in the co-occurrence statistics.

Common Mistakes

Ignoring the initialization: If embeddings are not tiny, the learning dynamics may not follow the discrete steps described. Small initialization is crucial for the rank-incrementing behavior.
Misunderstanding linear representations: The linear directions are emergent from the factorization, not explicitly supervised. Expecting analogies to always work perfectly ignores that some relations require non-linear combinations.
Overlooking the approximation: The theory relies on dropping the softmax and weighting factors. In practice, word frequency imbalances can alter the factorization. Use caution when applying the PCA interpretation to raw embeddings.
Assuming instantaneous learning: Even with the theory, training takes time. The stepwise nature appears only when the learning rate is small and initialization is near zero.

Summary

Word2Vec's learning process can now be quantitatively understood: starting from near-zero initialization, it learns one concept at a time, ultimately performing unweighted matrix factorization of the PMI matrix. The final embeddings are the principal components, explaining the geometric properties of word vectors. This tutorial walked through the key steps – from setup to the PCA result – and highlighted common pitfalls. The insights extend to modern LLMs, where similar linear representations emerge from self-supervised learning.

Tags: