NNLM: The Unexpected Birth of Word Embedding Neural Network Language Models

An important technology after n-gram is NNLM (Neural Network Language Model), proposed by Yoshua Bengio and others in 2003. It was originally intended to improve language modeling using neural networks, but unexpectedly led to the invention of word embeddings.

Representation Problem: From Ordinal Encoding to One-Hot#

Before delving into NNLM, it is essential to understand a more fundamental question: how can we make computers understand vocabulary?

In the early stages of machine learning, researchers faced a fundamental problem: computers can only understand numbers, but a large amount of real-world data is categorical. For example, gender has male/female categories, and colors have red/green/blue/black, etc.

The initial idea was simple and crude: assign a number to each category. For instance, in the gender category, male=1 and female=2; in the color category, red=1, green=2, blue=3. However, this ordinal encoding has a serious problem—it implies a size and distance relationship between entities.

For example, if red=1, green=2, and blue=3, when calculating distances in machine learning:

The distance between red and green: |1-2|=1
The distance between red and blue: |1-3|=2
The distance between green and blue: |2-3|=1

The model would consider red and green to be more similar, while red and blue to be quite different. However, in reality, the primary colors red, green, and blue should be equidistant; none is closer to the other.

To solve this semantic ambiguity, one-hot encoding emerged. Its core idea is to represent the independence of categories using the orthogonality of vectors. Taking red, green, and blue as an example:

Red: [1, 0, 0]
Green: [0, 1, 0]
Blue: [0, 0, 1]

Each category is an orthogonal vector, equidistant from each other, with no implied size relationship. This idea corresponds to the concept of basis vectors in linear algebra—each category is a standard basis vector in a high-dimensional space, mutually orthogonal and of equal magnitude.

The Dilemma of One-Hot#

One-hot encoding solves the problem of ordinal encoding but brings new troubles:

Curse of Dimensionality: For a vocabulary of 10,000 words, using one-hot encoding requires a 10,000-dimensional vector space. In reality, vocabularies are often much larger, making the encoding space extremely vast.

Sparsity: When the dimensionality is too high, there is very little useful information in the entire vector. In a 10,000-dimensional vector, only one position is 1, while the rest are all 0, and this sparsity is detrimental to computation and storage.

Semantic Gap: One-hot completely discards the true relational connections between entities. For example, "dog" and "cat" are both animals and semantically similar, but in one-hot encoding, their distance is the same as that between "dog" and "car."

These issues are particularly prominent in language modeling. Whether using the exact matching method of n-gram or one-hot encoding in neural networks, neither can capture the semantic similarity between words, leading to poor model generalization.

The Solution Approach of NNLM#

The emergence of NNLM simultaneously addresses two problems: the curse of dimensionality in n-gram and the limitations of one-hot encoding.

The curse of dimensionality faced by n-gram: When a sequence of words has never appeared in the training data, traditional methods struggle to provide reasonable probability estimates. It's like memorizing; one can only remember the sentence patterns they have seen, and they are powerless when encountering new combinations.

The idea behind NNLM is clever: if a sequence of words consists of semantically similar words, even if this sequence has never been seen, it should still receive a high probability. For instance, if the model has seen "I love cats," it should also give a reasonable probability to "I adore dogs," because "love" and "adore," as well as "cats" and "dogs," are semantically similar.

To achieve this capability, the model needs to learn the similarities between words, which requires representing vocabulary in some continuous vector space rather than the discrete orthogonal vectors of one-hot.

The Working Mechanism of NNLM#

The architecture of NNLM is not complex; it consists of two layers of MLP. It simultaneously learns two things: the vector representation of each word and the probability predictions based on these vectors.

1748682900245

Embedding Layer: From Sparse to Dense Transformation#

Assuming we want to predict the next word in a sentence, NNLM first converts the one-hot vectors of the preceding words (for example, the first three words) into low-dimensional dense vectors through an embedding matrix C: $a_i = CW_i$
Assuming the vocabulary size is V=10,000 and the embedding dimension is d=300, then C is a 300×10,000 matrix. When we have a one-hot vector $W_i$ (dimension 10,000×1, with only the j-th position being 1 and the rest being 0), the matrix multiplication $CW_i$ effectively extracts the j-th column of matrix C, resulting in a 300-dimensional vector.
In other words, each column of the embedding matrix C corresponds to the vector representation of a word in the vocabulary. At the start of training, these vectors are randomly initialized, but as training progresses, semantically similar words gradually cluster together in this 300-dimensional space.

Concatenation Layer: Constructing Context Representation#

Next, these word vectors are concatenated to form a complete context representation:
$X = \text{concat}(a_1, a_2, a_3)$
If each word vector is 300-dimensional, after concatenating three words, $X$ becomes a 900-dimensional vector.

Why simple concatenation instead of addition or other operations?
Concatenation preserves the order information of the words, allowing the model to distinguish between different combinations like "A B C" and "B A C."

Hidden Layer: Non-linear Feature Extraction#

The concatenated vector X is sent to the hidden layer for non-linear transformation:
$h = \tanh(W_1X + b_1)$ The role of this hidden layer is to extract higher-level features. The linear transformation $W_1X + b_1$ maps the 900-dimensional input to the hidden layer dimension (for example, 500 dimensions), and then the $tanh$ activation function introduces non-linearity.

Why is non-linearity needed?
Without an activation function, the entire network would be a series of linear transformations, equivalent to a single-layer linear model, significantly reducing its expressive power. The tanh function allows the model to learn complex patterns of word combinations.

Output Layer: Probability Distribution Calculation#

Finally, the output of the hidden layer is mapped to a vector of vocabulary size, and then the softmax function is applied to obtain the probability distribution:
$P(w_4|w_1,w_2,w_3) = \text{softmax}(W_2h + b_2)$
Here, $W_2$ is a 500×10,000 matrix that maps the 500-dimensional output of the hidden layer to 10,000 dimensions (vocabulary size). The $softmax$ function ensures that the probabilities of all words sum to 1:
$P(w_i) = \frac{\exp(z_i)}{\sum_{j=1}^{V} \exp(z_j)}$
where $z_i$ is the logit value (score) corresponding to the i-th word.

Parameter Update#

During training, the cross-entropy loss function is used, aiming to maximize the probability of the correct word. Backpropagation simultaneously updates all parameters: $W_1, b_1, W_2, b_2$ , and most importantly, the embedding matrix C.

The update of the embedding matrix C is the most crucial part. To better predict the next word, the gradient will "encourage" semantically similar words to be closer in the vector space. For example, if "cat" and "dog" frequently appear in similar contexts, their vector representations will gradually move closer together.

Complete Process of Dimensional Changes
The dimensional changes throughout the process are as follows:

Input: 3 one-hot vectors of 10,000 dimensions
After embedding: 3 dense vectors of 300 dimensions
After concatenation: 1 vector of 900 dimensions
Hidden layer: 1 vector of 500 dimensions
Output layer: 1 vector of 10,000 dimensions (probability distribution)

This architecture can learn semantic similarities because the model is forced to find suitable positions for each word in a limited embedding space, and the prediction task naturally clusters words that appear in similar contexts together.

Unexpected Gains#

At that time, Bengio and his colleagues' main goal was to improve language modeling, but after training the model, they found that the embedding matrix C had learned something very interesting. Similar words indeed clustered together in the vector space, and these vectors could even perform mathematical operations, such as the famous king - man + woman ≈ queen.

This discovery solved the three main problems of one-hot encoding:

Dimensionality Problem: Reduced from 10,000 dimensions to a few hundred dimensions
Sparsity Problem: Transformed into dense vectors
Semantic Gap Problem: Similar words are close together in space

Although the term "word embeddings" did not exist at that time, this idea quickly caught the attention of other researchers. In 2008, Collobert and Weston demonstrated the power of pre-trained word vectors in downstream tasks. By 2013, Mikolov and others released the Word2Vec toolkit specifically designed to learn word vectors, marking the true popularization of word embedding technology.

Word2Vec simplified the architecture of NNLM and proposed two models: CBOW (predicting the target word based on context) and Skip-gram (predicting context based on the target word). Skip-gram performs better on small datasets and rare words, while CBOW trains faster and is more effective on frequent words.

Advantages and Limitations of NNLM#

Compared to n-gram, the greatest advantage of NNLM is its generalization ability. In experiments by Bengio and his colleagues, NNLM significantly outperformed the state-of-the-art trigram model on two text corpora. Moreover, the "by-product" of word embeddings later proved to be immensely valuable.

However, NNLM also has several issues.

First is computational complexity; training a neural network is much slower than simple n-gram counting. Second, although it is no longer constrained by the Markov assumption, it can only look at fixed-length history, unable to handle arbitrary-length contexts. Additionally, each word has only one fixed vector representation, which cannot address the polysemy problem.

Another practical issue is that computing softmax on large vocabularies is very slow.

These problems inspired subsequent research.

Some Thoughts#

From the perspective of representation learning, NNLM proves that good representations are key to solving problems. From ordinal encoding to one-hot to word embeddings, each step optimizes the way data is represented, and improvements in representation often lead to significant performance gains.

Technological progress is incremental, with each generation building on the previous one to solve problems. Ordinal encoding addressed the basic numerical representation issue, one-hot resolved the semantic ambiguity issue, and word embeddings tackled the sparsity and semantic similarity issues. Moreover, sometimes the outcome may not be as important; rather, the process itself is what truly matters, just like the Word2Vec that evolved from NNLM.