Reinterpreting n-gram: A Bridge from Statistics to Deep Learning

Recently, while sorting out early NLP technologies, I re-summarized the n-gram model. Although there is now more focus on architectures like Transformers, looking back at these "ancient" methods, there is actually a lot to learn.

The Basic Idea of n-gram#

The core idea of n-gram is quite intuitive: the probability of a word appearing depends only on the few preceding words (specifically, the previous n-1 words). This sounds like a simplification of language, but humans often choose subsequent words based on the preceding ones when speaking.

Several commonly used n-gram models:

Unigram: Each word appears independently, without considering context.
Bigram: Each word only looks at the preceding 1 word.
Trigram: Each word looks at the preceding 2 words.

When first encountering this, one might feel that this assumption is too rough, but in fact, this simplification is effective in many scenarios.

Markov Assumption#

The theoretical foundation behind n-gram is the Markov assumption. Mathematically expressed as:

$P(x_{i+1}|x_{i},x_{i-1},x_{i-2}...) = P(x_{i+1}|x_{i})$

The key to this assumption is "memorylessness"—the next state depends only on the current state and not on the historical path.

For example, calculating the probability of "I love deep learning":

Bigram Calculation:
$P("I love deep learning") = P("I") \times P("love"|"I") \times P("deep"|"love") \times P("learning"|"deep")$

Trigram Calculation:
$P("I love deep learning") = P("I") \times P("love"|"I") \times P("deep"|"I love") \times P("learning"|"love deep")$

The probability of complex sentences is decomposed into simple conditional probability products, which is both elegant and practical.

Data Sparsity Problem#

n-gram has a very practical problem: data sparsity. If a certain word combination has not appeared in the training data, its probability is 0, which makes the entire sentence probability 0.

To solve this problem, predecessors devised several smoothing strategies:

Add-One Smoothing#

Add 1 to the counts of all words, so there will be no zero probabilities.

Add-K Smoothing#

This is a generalized version of add-one smoothing, controlling the degree of smoothing by adjusting the K value, which is generally more effective than add-one smoothing.

Add-one smoothing is a special case of add-K smoothing, and when using add-K smoothing, the value of K should focus on the training data and be tuned on the validation set.

When the amount of training data is small, extreme distributions may occur, and at this time, a larger K value is generally used to balance the occurrence probabilities of each word.

When the amount of training data is large, a smaller K value is used to maintain the original statistical distribution.

Kneser-Ney Smoothing#

A more complex but more effective method that considers the distribution of words in different contexts.

These smoothing methods reflect the practical spirit of early NLP researchers—finding ways to solve problems without getting entangled in the perfection of theory.

Characteristics of n-gram#

Simple and Effective: Although the assumptions are simple, they perform well in many tasks. Current input methods and spell checkers still use similar technologies.

Computationally Friendly: Both training and inference are fast, which was important in an era of limited computational resources.

Obvious Limitations: The inability to handle long-distance dependencies is a fundamental issue. For example, "It rained yesterday... I'm very tired today," such semantic relationships that span multiple words cannot be captured by n-gram.

From n-gram to Modern Models#

Looking back now, n-gram actually laid an important foundation for later models:

Probabilistic Modeling Approach: Viewing language as a probability distribution, this idea has continued to this day.
Sequence Prediction Framework: The basic framework of predicting the future based on history.
Statistical Learning Methods: Methodologies for learning patterns from data.

Even the latest large language models essentially do similar things—predicting the next token based on the preceding tokens. However, modern models can consider longer contexts and capture more complex dependencies.

Reflections on Technological Evolution#

Understanding the historical context of technological development is very important. Many of today's "innovations" can actually be traced back to early technologies. Although n-gram is simple, the core idea it embodies—understanding and predicting language through statistical patterns—remains effective today.

Just like learning mathematics, mastering foundational concepts is more important than pursuing the latest and most complex technologies. n-gram may not be the strongest model, but it provides a good framework for thinking, helping to understand the essence of language modeling.

Technological progress is gradual, with each generation of technology improving upon the previous one. n-gram teaches us how to view language through a probabilistic lens, a perspective that is still valuable today. Moreover, simple methods often reveal deep patterns, and understanding the essence of problems is more important than pursuing complex solutions.

Hands-On Implementation#

Refer to the GitHub repository

NeoFii/Hands-on-NLP

Jupyter Notebook00