Learning NLP Language Models with Real Data

Language Models (LMs) estimate the relative likelihood of different phrases and are useful in many different Natural Language Processing applications (NLP). For example, they have been used in Twitter Bots for ‘robot’ accounts to form their own sentences.

In this post, we will first formally define LMs and then demonstrate how they can be computed with real data. All the methods shown are demonstrated fully with code in the following Kaggle notebook.


Part 1: Defining Language Models

The goal of probabilistic language modelling is to calculate the probability of a sentence of sequence of words:

and can be used to find the probability of the next word in the sequence:

A model that computes either of these is called a Language Model.

Initial Method for Calculating Probabilities

Definition: Conditional Probability

let A and B be two events with P(B) =/= 0, the conditional probability of A given B is:

Definition: Chain Rule

In general cases, the formula is as follows:

The chain rule applied to compute the joined probability of words in a sequence is therefore:

For example:

This is a lot to calculate, could we not simply estimate this by counting and dividing the results as shown in the following formula:

In general, no! There are far to many possible sentences in this method that would need to be calculated and we would like have very sparse data making results unreliable.

Methods using the Markov Assumption

Definition: Markov Property

A stochastic process has the Markov property if the conditional probability distribution of future states of the process (conditional on both past and present states) depends only upon the present state, not on the sequence of events that preceded it. A process with this property is called a Markov process. (Wikipedia)

In other words, the probability of the next word can be estimated given only the previous k number of words.

For example, if k=1:

or if k=2:

General equation for the Markov Assumption, k=i :

N-gram Models

From the Markov Assumption, we can formally define N-gram models where k = n-1 as the following:

And the simplest versions of this are defined as the Unigram Model (k = 1) and the Bigram Model (k=2).

Unigram Model (k=1):

Bigram Model (k=2):

These equations can be extended to compute trigrams, 4-grams, 5-grams, etc. In general, this is an insufficient model of language because sentences often have long distance dependencies. For example, the subject of a sentence may be at the start whilst our next word to be predicted occurs mode than 10 words later.

Estimating Bigram Probabilities using the Maximum Likelihood Estimate:

Small Example

Given a corpus with the following three sentences, we would like to find the probability that “I” starts the sentence.

<s I am Sam /s>

<s Sam I am /s>

<s I do not like green eggs and ham /s>

where “<s” and “/s>” denote the start and end of the sentence respectively.

Therefore, we have:

I other words, of the three times the sentence started in our corpus, “I” appeared as the first word.

Part 2: Applying Language Models to Real Data

Data Source and Pre-Processing

For this demonstration, we will be using the IMDB large movie review dataset made available by Stanford. The data contains the rating given by the reviewer, the polarity and the full comment.

For example, the first negative comment here in full is the following:

First, we convert the full comments into their individua sentences, introduce notation for the start and end of sentence and clean the text by removing any punctuation and lowercase all words.

Unigram Model

As this is the easiest to compute, we can find the probability of each word occurring as use this to estimate the probability of the whole sentence occurring by the following:

Alternatively, we can compute this using logarithms as by log rules, the following holds true:

We do this because addition is typically computationally faster than multiplication.

For example, with the unigram model, we can calculate the probability of the following words.

Bigram Model

The unigram model is perhaps not accurate, therefore we introduce the bigram estimation instead. Applying this is somewhat more complex, first we find the co-occurrences of each word into a word-word matrix. The counts are then normalised by the counts of the previous word as shown in the following equation:

So, for example, if we wanted to improve our calculation for the P(a|to) shown previously, we first count the occurrences of (to,a) and divide this by the count of occurrences of (t0).

and likewise, if we were to change the initial word to ‘has’:

As mentioned, to properly utilise the bigram model we need to compute the word-word matrix for all word pair occurrences. With this, we can find the most likely word to follow the current one. However, this also requires an exceptional amount of time if the corpus is large and so it may be better to compute this for words as required rather than doing so exhaustively.

With this, we can find some examples of the most likely word to follow the given word:

Some words have many likely words to follow but others, such as “unnatural” have only one. This is likely due to there being few instances of the word occurring in the first place.

Forming Sentences with Probabilities

These computations can be use to form basic sentences. We will fix the start and end of the sentence to the respective notations “<s” and “/s>” and will vary the columns chosen from the word-word matrix so that the sentences become varied.

Even calculating the next word ‘on the fly’, the time required to do this is exceptionally large. For example, even one sentence on my machine took almost two hours to compute:

Checking some of these probabilities manually we find that the words are very likely, for example:

Tri-grams and beyond!

If we continue the estimation equation, we can form one for trigrams:

For example:

Therefore, the tri-gram phrase ‘to a movie’ is used more commonly than ‘to a film’ and is the choice our algorithm would take when forming sentences.

Part 3: Training and Testing the Language Models

The corpus used to train our LMs will impact the output predictions. Therefore we need to introduce a methodology for evaluating how well our trained LMs perform. The best trained LM is the one that can correctly predict the next word of sentences in an unseen test set.

This can be time consuming, to build multiple LMs for comparison could take hours to compute. Therefore, we introduce the intrinsic evaluation method of perplexity. In short perplexity is a measure of how well a probability distribution or probability model predicts a sample.

Definition: Perplexity

Perplexity is the inverse probability of the test set normalised by the number of words, more specifically can be defined by the following equation:

e.g. Suppose a sentence consists of random digits [0–9], what is the perplexity of this sentence by a model that assigns an equal probability (i.e. P=1/10) to each digit?

Entropy in Mechanics and Information Theory

In mechanics, Boltzmann defined the entropy of a system is related to the natural log of the number of micro-states:

W is the number of micro-states and is calculated by:

where n is the number of positions and x is the number of molecules.

In information Theory, entropy (denoted H(X)) of a random variable X is the expected log probability defined by:

and is a measure of uncertainty.

In other words, entropy is the number of possible states that a system can be.

Example: Entropy of a bias coin toss

Say we have the probabilities of heads and tails in a coin toss defined by:

  • P(heads) = P(heads) = p
  • P(tails) = 1−P(tails) = 1−p

Then the entropy of this is:

If the coin is fair, i.e. p = 0.5, then we have:

The full entropy distribution over varying bias probabilities is shown below.

Entropy of Language

Perplexity and Entropy

For example:

Part 4: Challenges in Fitting Language Models

Due to the output of LMs being dependent on the training corpus, N-grams only work well if the training corpus is similar to the testing dataset and we risk overfitting in training.

As with any machine learning method, we would like results that are generalisable to new information.

Even harder is how we deal with words that do not even appear in training but are in the test data.

Dealing with Zero Counts in Training: Laplace +1 Smoothing

To deal with words that are unseen in training we can introduce add-one smoothing. To do this, we simply add one to the count of each word.

This shifts the distribution slightly and is often used in text classification and domains where the number of zeros isn’t large. However, this is not often used for n-grams, instead we use more complex methods.

First, let us create a dummy training corpus and test set from the original data. With this in our example, we found that 25% of the words contained in the small test set did not appear in our limited corpus.

Therefore, we applying Laplace +1 smoothing by adding these unseen words to the training set and add 1 to all counts:

Further Smoothing Methods

Laplace +1 smoothing is used in text classification and domains where the number of zeros isn’t large. However, it is not often used for n-grams, some better smoothing methods for n-grams are:

  • Add-k Laplace Smoothing
  • Good-Turing
  • Kenser-Ney
  • Witten-Bell

Part 5: Selecting the Language Model to Use

We have introduced the first three LMs (unigram, bigram and trigram) but which is best to use?

Trigrams are generally provide better outputs than bigrams and bigrams provide better outputs than unigrams but as we increase the complexity the computation time becomes increasingly large. Furthermore, the amount of data available decreases as we increase n (i.e. there will be far fewer next words available in a 10-gram than a bigram model).

Back-off Method

Use trigrams (or higher n model) if there is good evidence to, else use bigrams (or other simpler n-gram model).


In interpolation, we use a mixture of n-gram models.

Simple Interpolation:


To calculate the lambdas, a held-out subset of the corpus is used and parameters are tried until a combination that maximises the probability of the held out data is found.

Small Example

Say we are given the following corpus:

  • <s I am Sam /s>
  • <s Sam I am /s>
  • <s I am Sam /s>
  • <s I do not like green eggs and Sam /s>

Example with IMDB Data

We again apply the formula:

and with the corpus defined previously from the IMDB source, we take a small subset (10%) of this as the ‘hold-out’ set.

Testing a range of possible lambda values (noting that λ1 + λ2 = 1), we find the following:

Therefore, the optimal lambda values for this example are:

λ1 = 0

λ2 = 1

I hope this provides you with a decent introduction to language models and the code assists with your learning.



Leave a Reply