Skip to main content Link Menu Expand (external link) Document Search Copy Copied

Vocabulary Laws

Contents

  1. Introduction
    1. Frequency of the words
  2. Zipf’s Law
    1. Ordinary Least Squares
  3. Herdan’s Law / Heaps’ Law
    1. Optimization
  4. Activities
    1. Zipf’s Law - \(f=\frac{c}{r}\)
    2. Zipf’s Law - \(f=\frac{c}{r^\alpha}\)
    3. Heaps’ Law - \(\mid v \mid = kn^\beta\)
    4. Comparison of Spanish-speaking countries

Libraries used

from microtc.utils import tweet_iterator
from EvoMSA.tests.test_base import TWEETS
from text_models import Vocabulary
from text_models.utils import date_range
from wordcloud import WordCloud as WC
import numpy as np
from collections import Counter
from matplotlib import pylab as plt
from scipy.optimize import minimize
from joblib import Parallel, delayed
from tqdm import tqdm

Installing external libraries

pip install microtc
pip install evomsa
pip install text_models

Introduction

The journey of natural language processing starts with the simple procedure of counting words; with it, we will be able to model some characteristics of the languages, identify patterns in a text, and perform exploratory data analysis on a collection of Tweets.

At this point, let us define a word as a sequence of characters bounded by a space - this is a shallow definition; however, it is suitable for most words written in Latin languages and English.

Frequency of the words

The frequency of the words in a document can be computed using a dictionary. A dictionary is a data structure that associates a keyword with a value. The following code uses a dictionary (variable word) to count the word frequencies of texts stored in a JSON format, each one per line. It uses the function tweet_iterator that iterates over the file, scanning one line at a time and converting the JSON into a dictionary where the keyword text contains the text.

words = dict()
for tw in tweet_iterator(TWEETS):
    text = tw['text']
    for w in text.split():
        key = w.strip()
        try:
            words[key] += 1
        except KeyError:
            words[key] = 1

Method split returns a list of strings where the split occurs on the space character, which follows the definition of a word. Please note the use of an exception to handle the case where the keyword is not in the dictionary. As an example, the word si (yes in Spanish) appears 29 times in the corpus.

words['si']
29

The counting pattern is frequent, so it is implemented in the Counter class under the package collections; the following code implements the method described using Counter; the key element is the method update, and to make the code shorter, list comprehension is used.

words = Counter()
for tw in tweet_iterator(TWEETS):
    text = tw['text']
    words.update([x.strip() for x in text.split()])

Counter has another helpful method to analyze the frequencies; in particular, the most_common method returns the most frequent keywords. For example, the following code gets the five most common keywords.

words.most_common(5)
[('de', 299), ('que', 282), ('a', 205), ('la', 182), ('el', 147)]

These are the words, of, what, a, the, and the, respectively.

Zipf’s Law

The word frequency allows defining some empirical characteristics of the language. These characteristics are essential to understand the challenges of developing an algorithm capable of understanding language.

Let us start with Zipf’s law. The law relates the frequency of a word with its rank. In an order set, the rank corresponds to the element’s position, e.g., the first element has the rank of one, the second has the second rank, and so on. Explicitly, the relationship is defined as \(f \cdot r = c\), where \(f\) is the frequency, \(r\) is the rank, and \(c\) is a constant. For example, the frequency is \(f=\frac{c}{r}\); as can be seen when the rank equals the constant, then the frequency is one, meaning that the rest of the words are infrequent and that there are only frequent few words.

The following figure depicts the described characteristic. It shows a scatter plot between rank and frequency.

Zipf's Law

The frequency and the rank can be computed with the following two lines using the dataset of the previous example.

freq = [f for _, f  in words.most_common()]
rank = range(1, len(freq) + 1)

Once the frequency and the rank are computed, the figure is created using the following code.

plt.plot(rank, freq, '.')
plt.grid()
plt.xlabel('Rank')
plt.ylabel('Frequency')
plt.tight_layout()

The previous figure does not depict the relation \(f=\frac{c}{r}\), in order to show it let us draw the figure scatter plot between the rank inverse and the frequency; this can be obtained by changing the following lines into the previous procedure – numpy is used in the following code to create an array of elements.

freq = [f for _, f  in words.most_common()]
rank = 1 / np.arange(1, len(freq) + 1)

As observed from the following figure, Zipf’s Law is a rough estimate of the relationship between rank and frequency. Nonetheless, it provides all the elements to understand the importance of this relationship.

Log Zipf's Law

Ordinary Least Squares

The missing step is to estimate the value of \(c\). Constant \(c\) can be calculated using ordinary least squares (OLS). The idea is to create a system of equations where the unknown is \(c\), and the dependent variable is \(f\). These can represent in matrix notation as \(A \cdot \mathbf c = \mathbf f\), where \(A \in \mathbb R^{N \times 1}\) is composed by \(N\) inverse rank measurements, \(c \in \mathbb R^1\) is the parameter to be identified, and \(\mathbf f \in \mathbb R^N\) is a column vector containing the frequency measurements.

The coefficients can be computed using the function np.linalg.lstsq described in the following code.

X = np.atleast_2d(rank).T
c = np.linalg.lstsq(X, freq, rcond=None)[0]
c
array([461.40751913])

Once \(c\) has been identified, it can be used to predict the model; the following figure presents the measurements and the predicted points using the identified coefficient.

Zipf's Law - model

The previous figure was created with the following code; the variable hy contains the predicted frequency.

hy = np.dot(X, c)
plt.plot(rank, freq, '.')
plt.plot(rank, hy)
plt.legend(['Measured', 'Predicted'])
plt.grid()
plt.xlabel('Inverse Rank')
plt.ylabel('Frequency')
plt.tight_layout()

Herdan’s Law / Heaps’ Law

A language used evolves new words are incorporated in the language, and the relationship between the vocabulary size and the number of words (tokens) is expressed in the Heaps’ Law. Let \(\mid v \mid\) represents the vocabulary size, and \(n\) the number of words; then the relationship between these elements is \(\mid v \mid = k n^\beta\) where \(k\) and \(\beta\) are the parameters that need to be identified.

The following figure depicts the relation between \(n\) and \(\mid v \mid\) using the dataset of the previous examples.

Heaps' Law

Taking the procedure used in Zipf’s Law as a starting point, it is only needed to collect, for each line, the number of words and the vocabulary.

words = Counter()
tokens_voc= list()
for tw in tweet_iterator(TWEETS):
    text = tw['text']
    words.update([x.strip() for x in text.split()])
    tokens_voc.append([sum(list(words.values())),
                       len(words)])

n = [x[0] for x in tokens_voc]
v = [x[1] for x in tokens_voc]

Once the points (number of words and vocabulary size) are measured, it is time to estimate the parameters \(k\) and \(\beta\). As can be observed, it is not possible to express the Heaps’ Law as a system of equations such as \(A \cdot [k, \beta]^\intercal = \mid \mathbf v \mid\); consequently, these parameters cannot be estimated using OLS.

Optimization

The values of these parameters can be estimated by posing them as an optimization problem. An optimization problem minimizes (maximizes) an objective function. The goal is to find the inputs corresponding to the minimum (maximum) of the function, i.e.,

\[\min_{\mathbf w \in \Omega} f(\mathbf w),\]

where \(\mathbf w\) are the inputs of the function \(f\), and \(\Omega\) is the search space, namely the set containing all feasible values of the inputs, e.g., \(\Omega = \{\mathbf w \mid \mathbf w \in \mathbb R^d, \forall_i \mathbf w_i \geq 0\}\) that represents all the vector of dimension \(d\) whose components are equal or greater than zero.

There are many optimization algorithms; some are designed to tackle all possible optimization problems, and others are tailored for a particular class of problems. For example, OLS is an optimization algorithm where the objective function \(f\) is defined as:

\[f(\mathbf w) = \sum_{i=1}^N (y_i - \mathbf a_i \cdot \mathbf w)^2,\]

where \(\mathbf w\) is a vector containing the parameters, and \(\mathbf a_i\) is the \(i\)-th measurement of the independent variables, and \(y_i\) is the corresponding dependent variable. In the Zipf’s Law example, \(\mathbf w\) corresponds to the parameter \(c\), \(\mathbf a_i\) is the \(i\)-the measure of the inverse rank, and \(y_i\) is the corresponding frequency.

Let us analyze in more detail the previous objective function. The sum goes for all the measurements; there are \(N\) pairs of observations composed by the response (dependent variable) and the independent variables. For each observation \(i\), the square error is computed between the dependent variable (\(y_i\)) and the dot product of the independent variable (\(\mathbf a_i\)) and parameters (\(\mathbf x\)).

This notation is specific for OLS; however, it is helpful to define the general case. The first step is to define the set \(\mathcal D\) composed by the \(N\) observations, i.e., \(\mathcal D=\{(\mathbf x_i, y_i) \mid 1 \leq i \leq N\}\) where \(y_i\) is the response variable and \(\mathbf x\) contains the independent variables. The second step is to use an unspecified loss function as \(L(y, \hat y) = (y - \hat y)^2\). The last step is to abstract the model, that is replace \(\mathbf a_i \cdot \mathbf w\) with a function \(g(\mathbf x) = \mathbf a_i \cdot \mathbf w\). Combining these components the general version of the OLS optimization problem can be expressed as:

\[\min_{g \in \Omega} \sum_{(\mathbf x, y) \in \mathcal D} L(y, g(\mathbf x)),\]

where \(g\) are all the functions defined in the search space \(\Omega\), \(L\) is a loss function, and \(\mathcal D\) contains \(N\) pairs of observations. This optimization problem is known as supervised learning in machine learning.

Returning to Heaps’ Law problem where the goal is to identify the coefficients \(k\) and \(\beta\) of the model \(\mid v \mid = kn^\beta\). That is, function \(g\) can be defined as \(g_{k,\alpha}(n) = kn^\beta\) and using the square error as \(L\) the optimization problem is:

\[\min_{(k, \beta) \in \mathbb R^2} \sum_{(n, \mid v \mid) \in \mathcal D} (\mid v \mid - kn^\beta)^2.\]

As mentioned previously, there are many optimization algorithms; some can be found on the function scipy.optimize.minimize. The function admits as arguments the objective function and the initial values of the parameters.

The following code solves the optimization problem using the minimize function.

def f(w, y, x):
    k, beta = w
    return ((y - k * x**beta)**2).sum()

n = np.array(n)
v = np.array(v)
res = minimize(f, np.array([1, 0.5]), (v, n))
k, beta = res.x
k, beta
(2.351980550243793, 0.8231559504194587)

Once \(k\) and \(\beta\) have been identified, it is possible to use them in the Heaps’ Law and produce a figure containing the measurements and predicted values.

Heaps' Law - model

Activities

Zipf’s Law and Heaps’ Law model two language characteristics; these characteristics are summarized in the values of the parameters \(c\) and \(k\) and \(\beta\) respectively. So far, we have identified these parameters using a toy dataset. In order to illustrate how these parameters can be used to compare languages, we are going to use a dataset of words (and bi-grams) available in a library text_models described in A Python library for exploratory data analysis on twitter data based on tokens and aggregated origin–destination information.

text_models contains the frequency of words and bigrams of words measured from tweets written in different languages (Arabic, Spanish, and English) and collected from the Twitter open stream.

The first step is to retrieve the information from a particular date, location, and language. As an example, let us retrieve the data corresponding to January 10, 2022, in Mexico and written in Spanish, which can be done with the following code.

date = dict(year=2022, month=1, day=10)
voc = Vocabulary(date, lang='Es', country='MX')
words = {k: v for k, v in voc.voc.items() if not k.count('~')}

Variable voc contains the frequency of the words and bigrams and some useful functions to distill the information and perform an exploratory data analysis on the data. The raw data is stored in a dictionary on the variable voc.voc. The bi-grams are identified with the character ‘~’, e.g., ‘of~the’ corresponds to the bi-gram of the words of and the. Finally, the variable words contains the frequency of the words.

A traditional approach to explore the information of a list of words and their frequencies is to create a word cloud. The following figure is the word cloud of the frequency of the words retrieved from text_models.

Word Cloud (MX)

Code of the word cloud
wc = WC().generate_from_frequencies(words)
plt.imshow(wc)
plt.axis('off')
plt.tight_layout()

Zipf’s Law - \(f=\frac{c}{r}\)

We are in the position to identify the Zips’ Law coefficient from a dataset retrieved from text_models. In this part, the goal is to estimate \(c\) from all the Spanish-speaking countries on a particular date, January 10, 2022. On this occasion, the library joblib is used to parallelize the code, particularly the class Parallel. As can be seen, the variable words is a list of dictionaries containing the words frequency for all the countries.

countries = ['MX', 'CO', 'ES', 'AR',
             'PE', 'VE', 'CL', 'EC',
             'GT', 'CU', 'BO', 'DO', 
             'HN', 'PY', 'SV', 'NI', 
             'CR', 'PA', 'UY']
vocs = Parallel(n_jobs=-1)(delayed(Vocabulary)(date,
                                               lang='Es',
                                               country=country)
                               for country in tqdm(countries))
words = [{k: v for k, v in voc.voc.items() if not k.count('~')}
         for voc in vocs]

The next step is to estimate the \(c\) coefficient of Zipf’s Law. Function zipf identifies the coefficient using OLS from a dictionary of words frequency. Variable zipf_c contains the coefficient values for each country, and we also computed the number of words seen, which is stored in variable tokens.

def zipf(data):
    freq = [f for _, f  in Counter(data).most_common()]
    rank = 1 / np.arange(1, len(freq) + 1)
    X = np.atleast_2d(rank).T
    return np.linalg.lstsq(X, freq, rcond=None)[0]

zipf_c = [zipf(w) for w in words]
tokens = [sum(list(w.values())) for w in words]

The following table presents the coefficient \(c\) and the number of words (\(n\)) for all the countries. The table is sorted by \(c\) in descending order. It can be seen that \(c\) and \(n\) are highly correlated, so this version of Zipf’s Law only characterizes the total number of words seen on the dataset.

Country\(c\)\(n\)
AR28383.27405802
ES21392.39267724
MX20166.90269211
CO11549.12153952
CL8324.78114394
VE4704.4056019
UY3714.8552058
PE2996.5139945
EC2768.2135822
PY1914.5624544
DO1843.0122969
PA1575.2320726
GT1312.0717237
NI756.568967
HN704.908243
CR607.046994
CU604.966989
SV513.015703
BO416.464306

The following table shows the correlation of \(c\) and \(n\); it is observed that the correlation between them is higher than \(0.99\), this means that measuring the number of words is enough and there is no need to estimate \(c\).

 \(c\)\(n\)
\(c\)1.00000.9979
\(n\)0.99791.0000

The correlation is obtained using the following code.

X = np.array([(b[0], c) for b, c in zip(zipf_c, tokens)])
corr = np.corrcoef(X.T)

Zipf’s Law - \(f=\frac{c}{r^\alpha}\)

Zipf’s Law has different versions; we started with the simpler one; however, that version does not provide information that the one contained in the number of words. A straightforward approach is to change the constant in Zipf’s Law with a parameter and then use an algorithm to estimate that constant. Zipf’s Law has the following form \(f=\frac{c}{r^1}\), where the constant is \(1\); changing \(1\) by \(\alpha\) it is obtained \(f=\frac{c}{r^\alpha}\). Without considering the meaning of the variables on this later version of Zipf’s Law, one can see the similarities between Zipf’s Law and Heap’s Law formulation; both have a multiplying constant \(c\) and \(k\), and an exponential one \(\alpha\) and \(\beta\).

Coefficients \(c\) and \(\alpha\) can be estimated using the dataset stored in variable words, the procedure is a combination of the approach used to identify \(c\) in the previous formulation and the algorithm described to estimate \(k\) and \(\beta\) of the Heap’s Law.

The following table shows the values of the coefficients \(c\), \(\alpha\), and the number of words for the different countries.

Country\(c\)\(\alpha\)\(n\)
AR24606.530.7627405802
MX17533.380.7715269211
ES18745.240.7842267724
CO10027.210.7691153952
CL7231.780.7716114394
VE4075.390.769556019
UY3175.760.743252058
PE2592.780.769039945
EC2391.380.767235822
PY1636.430.748824544
DO1570.400.742522969
PA1347.330.751420726
GT1114.080.740817237
NI645.040.75418967
HN596.580.73798243
CR512.800.73496994
CU514.400.74706989
SV435.750.74605703
BO352.770.74094306

The correlation of these variables can be seen in the following table. It is observed that \(c\) is highly correlated with the number of words \(n\). On the other hand, the correlation between \(\alpha\) and \(n\) indicates that these variables contain different information.

 \(c\)\(\alpha\)\(n\)
\(c\)1.00000.65880.9975
\(\alpha\)0.65881.00000.6308
\(n\)0.99750.63081.0000

Heaps’ Law - \(\mid v \mid = kn^\beta\)

Heaps’ Law models the relationship between the number of words (\(n\)) and the size of the vocabulary \(\mid v \mid\). In order to estimate the parameters, it is needed to obtain a dataset where \(n\) is varied and for each one compute \(\mid v \mid\). The dataset used to estimate the coefficients of Zipf’s Law contains \(n\) and \(\mid v \mid\) for a particular date, country, and language. \(\mid v \mid\) corresponds to the length of the dictionary, that is, the number of different words, namely the number for keys in it. Consequently, the dataset needed to estimate the coefficients of Heap’s Law can be obtained by collecting measuring \(n\) and \(\mid v \mid\) on different days.

The dataset can be obtained by first creating a helper function that contains the code used to retrieve the words in the function get_words.

COUNTRIES = ['MX', 'CO', 'ES', 'AR',
             'PE', 'VE', 'CL', 'EC',
             'GT', 'CU', 'BO', 'DO', 
             'HN', 'PY', 'SV', 'NI', 
             'CR', 'PA', 'UY'] 

def get_words(date=dict(year=2022, month=1, day=10)):
   

    vocs = Parallel(n_jobs=-1)(delayed(Vocabulary)(date,
                                                   lang='Es',
                                                   country=country)
                                   for country in tqdm(COUNTRIES))
    words = [{k: v for k, v in voc.voc.items() if not k.count('~')}
             for voc in vocs]
    return words

Function get_words retrieves the words for all the countries on a particular date; consequently, the next step is to decide the dates to create the dataset. The following code uses the function date_range to create a list of dates starting from November 1, 2021, and finishing on November 30, 2021 (inclusive). Variable words is a list where each element corresponds to a different day containing the words frequency of all the countries; however, to estimate the coefficients, it is needed to have a dataset for each country with all the days. The last line created the desired dataset; it is a list where each element has the frequency of the words for all the days of a particular country.

init = dict(year=2021, month=11, day=1)
end = dict(year=2021, month=11, day=30)
dates = date_range(init, end)
words = [get_words(d) for d in dates]
ww = [[w[index] for w in words] for index in range(len(COUNTRIES))]

The information stored in variable ww contains, for each country, the information needed to create \(n\) and \(\mid v \mid\). Function voc_tokens computes \(n\) and \(\mid v \mid\) for the dataset given, i.e., an element of ww. It uses the variable cnt as an accumulator; it is observed that the pattern of line 3 is used every time cnt is updated.

def voc_tokens(data):
    cnt = Counter(data[0])
    output = [[len(cnt), sum(list(cnt.values()))]]
    for x in data[1:]:
        cnt.update(x)
        _ = [len(cnt), sum(list(cnt.values()))]
        output.append(_)
    output = np.array(output)
    return output[:, 0], output[:, 1]

For example, voc_tokens can be used to create \(n\) and \(\mid v \mid\) of Mexico as:

n_mx, v_mx = voc_tokens(ww[0])

Having described how to obtain \(n\) and \(\mid v \mid\) for each country, it is time to use the procedure to estimate the country’s \(k\) and \(\beta\) coefficients. The following table shows the values of these coefficients for each country; it also includes the maximum number of words (\(\max n\)) in the last column.

Country\(k\)\(\beta\)\(\max n\)
AR10.190.526212883620
ES2.400.631610495462
MX4.790.57499738471
CO8.310.53115987400
CL11.130.51995689730
UY10.680.50851659338
EC7.460.53191491032
VE4.200.56511387172
PE11.090.50901249352
PY9.280.5091969505
PA8.490.5209885552
DO6.150.5322749857
GT8.010.5135556313
HN4.940.5445422616
CU4.780.5611305530
NI9.150.4918243830
SV3.350.5703240215
CR5.710.5276223434
BO2.550.5958159700

As we have seen previously, the correlation between the variables displays the relation between the coefficients; it was helpful to realize that the first version of Zipf’s Law was not needed. The following table shows the correlation between \(k\), \(\beta\), and \(\max n\). It is observed that \(k\) and \(\beta\) are uncorrelated to \(\max n\), and that there is a negative correlation between \(k\) and \(\beta\).

 \(k\)\(\beta\)\(\max n\)
\(k\)1.0000-0.86430.0587
\(\beta\)-0.86431.00000.3235
\(\max n\)0.05870.32351.0000

Comparison of Spanish-speaking countries

Coefficients \(\alpha\) and \(\beta\) characterize the vocabulary on a particular region, language, and dates. These coefficients can be used to compare the similarities between the countries. A simple approach is to create a scatter problem between \(\alpha\) and \(\beta\). The following figure shows the scatter plot of the Spanish-speaking countries.

Spanish - $$\alpha$$, $$\beta$$