Vocabulary Laws

Introduction
1. Frequency of the words
Zipf’s Law
1. Ordinary Least Squares
Herdan’s Law / Heaps’ Law
1. Optimization
Activities

Libraries used

from microtc.utils import tweet_iterator
from EvoMSA.tests.test_base import TWEETS
from text_models import Vocabulary
from text_models.utils import date_range
from wordcloud import WordCloud as WC
import numpy as np
from collections import Counter
from matplotlib import pylab as plt
from scipy.optimize import minimize
from joblib import Parallel, delayed
from tqdm import tqdm

Installing external libraries

pip install microtc
pip install evomsa
pip install text_models

Introduction

The journey of natural language processing starts with the simple procedure of counting words; with it, we will be able to model some characteristics of the languages, identify patterns in a text, and perform exploratory data analysis on a collection of Tweets.

At this point, let us define a word as a sequence of characters bounded by a space - this is a shallow definition; however, it is suitable for most words written in Latin languages and English.

Frequency of the words

The frequency of the words in a document can be computed using a dictionary. A dictionary is a data structure that associates a keyword with a value. The following code uses a dictionary (variable word) to count the word frequencies of texts stored in a JSON format, each one per line. It uses the function tweet_iterator that iterates over the file, scanning one line at a time and converting the JSON into a dictionary where the keyword text contains the text.

words = dict()
for tw in tweet_iterator(TWEETS):
    text = tw['text']
    for w in text.split():
        key = w.strip()
        try:
            words[key] += 1
        except KeyError:
            words[key] = 1

Method split returns a list of strings where the split occurs on the space character, which follows the definition of a word. Please note the use of an exception to handle the case where the keyword is not in the dictionary. As an example, the word si (yes in Spanish) appears 29 times in the corpus.

words['si']
29

The counting pattern is frequent, so it is implemented in the Counter class under the package collections; the following code implements the method described using Counter; the key element is the method update, and to make the code shorter, list comprehension is used.

words = Counter()
for tw in tweet_iterator(TWEETS):
    text = tw['text']
    words.update([x.strip() for x in text.split()])

Counter has another helpful method to analyze the frequencies; in particular, the most_common method returns the most frequent keywords. For example, the following code gets the five most common keywords.

words.most_common(5)
[('de', 299), ('que', 282), ('a', 205), ('la', 182), ('el', 147)]

These are the words, of, what, a, the, and the, respectively.

Zipf’s Law

The word frequency allows defining some empirical characteristics of the language. These characteristics are essential to understand the challenges of developing an algorithm capable of understanding language.

Let us start with Zipf’s law. The law relates the frequency of a word with its rank. In an order set, the rank corresponds to the element’s position, e.g., the first element has the rank of one, the second has the second rank, and so on. Explicitly, the relationship is defined as $f \cdot r = c$, where $f$ is the frequency, $r$ is the rank, and $c$ is a constant. For example, the frequency is $f=\frac{c}{r}$; as can be seen when the rank equals the constant, then the frequency is one, meaning that the rest of the words are infrequent and that there are only frequent few words.

The following figure depicts the described characteristic. It shows a scatter plot between rank and frequency.

Zipf's Law

The frequency and the rank can be computed with the following two lines using the dataset of the previous example.

freq = [f for _, f  in words.most_common()]
rank = range(1, len(freq) + 1)

Once the frequency and the rank are computed, the figure is created using the following code.

plt.plot(rank, freq, '.')
plt.grid()
plt.xlabel('Rank')
plt.ylabel('Frequency')
plt.tight_layout()

The previous figure does not depict the relation $f=\frac{c}{r}$, in order to show it let us draw the figure scatter plot between the rank inverse and the frequency; this can be obtained by changing the following lines into the previous procedure – numpy is used in the following code to create an array of elements.

freq = [f for _, f  in words.most_common()]
rank = 1 / np.arange(1, len(freq) + 1)

As observed from the following figure, Zipf’s Law is a rough estimate of the relationship between rank and frequency. Nonetheless, it provides all the elements to understand the importance of this relationship.

Log Zipf's Law

Ordinary Least Squares

The missing step is to estimate the value of $c$. Constant $c$ can be calculated using ordinary least squares (OLS). The idea is to create a system of equations where the unknown is $c$, and the dependent variable is $f$. These can represent in matrix notation as $A \cdot \mathbf c = \mathbf f$, where $A \in \mathbb R^{N \times 1}$ is composed by $N$ inverse rank measurements, $c \in \mathbb R^1$ is the parameter to be identified, and $\mathbf f \in \mathbb R^N$ is a column vector containing the frequency measurements.

The coefficients can be computed using the function np.linalg.lstsq described in the following code.

X = np.atleast_2d(rank).T
c = np.linalg.lstsq(X, freq, rcond=None)[0]
c
array([461.40751913])

Once $c$ has been identified, it can be used to predict the model; the following figure presents the measurements and the predicted points using the identified coefficient.

Zipf's Law - model

The previous figure was created with the following code; the variable hy contains the predicted frequency.

hy = np.dot(X, c)
plt.plot(rank, freq, '.')
plt.plot(rank, hy)
plt.legend(['Measured', 'Predicted'])
plt.grid()
plt.xlabel('Inverse Rank')
plt.ylabel('Frequency')
plt.tight_layout()

Herdan’s Law / Heaps’ Law

A language used evolves new words are incorporated in the language, and the relationship between the vocabulary size and the number of words (tokens) is expressed in the Heaps’ Law. Let $\mid v \mid$ represents the vocabulary size, and $n$ the number of words; then the relationship between these elements is $\mid v \mid = k n^\beta$ where $k$ and $\beta$ are the parameters that need to be identified.

The following figure depicts the relation between $n$ and $\mid v \mid$ using the dataset of the previous examples.

Heaps' Law

Taking the procedure used in Zipf’s Law as a starting point, it is only needed to collect, for each line, the number of words and the vocabulary.

words = Counter()
tokens_voc= list()
for tw in tweet_iterator(TWEETS):
    text = tw['text']
    words.update([x.strip() for x in text.split()])
    tokens_voc.append([sum(list(words.values())),
                       len(words)])

n = [x[0] for x in tokens_voc]
v = [x[1] for x in tokens_voc]

Once the points (number of words and vocabulary size) are measured, it is time to estimate the parameters $k$ and $\beta$. As can be observed, it is not possible to express the Heaps’ Law as a system of equations such as $A \cdot [k, \beta]^\intercal = \mid \mathbf v \mid$; consequently, these parameters cannot be estimated using OLS.

Optimization

The values of these parameters can be estimated by posing them as an optimization problem. An optimization problem minimizes (maximizes) an objective function. The goal is to find the inputs corresponding to the minimum (maximum) of the function, i.e.,

\[\min_{\mathbf w \in \Omega} f(\mathbf w),\]

where $\mathbf w$ are the inputs of the function $f$, and $\Omega$ is the search space, namely the set containing all feasible values of the inputs, e.g., $\Omega = \{\mathbf w \mid \mathbf w \in \mathbb R^d, \forall_i \mathbf w_i \geq 0\}$ that represents all the vector of dimension $d$ whose components are equal or greater than zero.

There are many optimization algorithms; some are designed to tackle all possible optimization problems, and others are tailored for a particular class of problems. For example, OLS is an optimization algorithm where the objective function $f$ is defined as:

\[f(\mathbf w) = \sum_{i=1}^N (y_i - \mathbf a_i \cdot \mathbf w)^2,\]

where $\mathbf w$ is a vector containing the parameters, and $\mathbf a_i$ is the $i$-th measurement of the independent variables, and $y_i$ is the corresponding dependent variable. In the Zipf’s Law example, $\mathbf w$ corresponds to the parameter $c$, $\mathbf a_i$ is the $i$-the measure of the inverse rank, and $y_i$ is the corresponding frequency.

Let us analyze in more detail the previous objective function. The sum goes for all the measurements; there are $N$ pairs of observations composed by the response (dependent variable) and the independent variables. For each observation $i$, the square error is computed between the dependent variable ($y_i$) and the dot product of the independent variable ($\mathbf a_i$) and parameters ($\mathbf x$).

This notation is specific for OLS; however, it is helpful to define the general case. The first step is to define the set $\mathcal D$ composed by the $N$ observations, i.e., $\mathcal D=\{(\mathbf x_i, y_i) \mid 1 \leq i \leq N\}$ where $y_i$ is the response variable and $\mathbf x$ contains the independent variables. The second step is to use an unspecified loss function as $L(y, \hat y) = (y - \hat y)^2$. The last step is to abstract the model, that is replace $\mathbf a_i \cdot \mathbf w$ with a function $g(\mathbf x) = \mathbf a_i \cdot \mathbf w$. Combining these components the general version of the OLS optimization problem can be expressed as:

\[\min_{g \in \Omega} \sum_{(\mathbf x, y) \in \mathcal D} L(y, g(\mathbf x)),\]

where $g$ are all the functions defined in the search space $\Omega$, $L$ is a loss function, and $\mathcal D$ contains $N$ pairs of observations. This optimization problem is known as supervised learning in machine learning.

Returning to Heaps’ Law problem where the goal is to identify the coefficients $k$ and $\beta$ of the model $\mid v \mid = kn^\beta$. That is, function $g$ can be defined as $g_{k,\alpha}(n) = kn^\beta$ and using the square error as $L$ the optimization problem is:

\[\min_{(k, \beta) \in \mathbb R^2} \sum_{(n, \mid v \mid) \in \mathcal D} (\mid v \mid - kn^\beta)^2.\]

As mentioned previously, there are many optimization algorithms; some can be found on the function scipy.optimize.minimize. The function admits as arguments the objective function and the initial values of the parameters.

The following code solves the optimization problem using the minimize function.

def f(w, y, x):
    k, beta = w
    return ((y - k * x**beta)**2).sum()

n = np.array(n)
v = np.array(v)
res = minimize(f, np.array([1, 0.5]), (v, n))
k, beta = res.x
k, beta
(2.351980550243793, 0.8231559504194587)

Once $k$ and $\beta$ have been identified, it is possible to use them in the Heaps’ Law and produce a figure containing the measurements and predicted values.

Heaps' Law - model

Activities

Zipf’s Law and Heaps’ Law model two language characteristics; these characteristics are summarized in the values of the parameters $c$ and $k$ and $\beta$ respectively. So far, we have identified these parameters using a toy dataset. In order to illustrate how these parameters can be used to compare languages, we are going to use a dataset of words (and bi-grams) available in a library text_models described in A Python library for exploratory data analysis on twitter data based on tokens and aggregated origin–destination information.

text_models contains the frequency of words and bigrams of words measured from tweets written in different languages (Arabic, Spanish, and English) and collected from the Twitter open stream.

The first step is to retrieve the information from a particular date, location, and language. As an example, let us retrieve the data corresponding to January 10, 2022, in Mexico and written in Spanish, which can be done with the following code.

date = dict(year=2022, month=1, day=10)
voc = Vocabulary(date, lang='Es', country='MX')
words = {k: v for k, v in voc.voc.items() if not k.count('~')}

Variable voc contains the frequency of the words and bigrams and some useful functions to distill the information and perform an exploratory data analysis on the data. The raw data is stored in a dictionary on the variable voc.voc. The bi-grams are identified with the character ‘~’, e.g., ‘of~the’ corresponds to the bi-gram of the words of and the. Finally, the variable words contains the frequency of the words.

A traditional approach to explore the information of a list of words and their frequencies is to create a word cloud. The following figure is the word cloud of the frequency of the words retrieved from text_models.

Word Cloud (MX)

Code of the word cloud

wc = WC().generate_from_frequencies(words)
plt.imshow(wc)
plt.axis('off')
plt.tight_layout()

Zipf’s Law - $f=\frac{c}{r}$

We are in the position to identify the Zips’ Law coefficient from a dataset retrieved from text_models. In this part, the goal is to estimate $c$ from all the Spanish-speaking countries on a particular date, January 10, 2022. On this occasion, the library joblib is used to parallelize the code, particularly the class Parallel. As can be seen, the variable words is a list of dictionaries containing the words frequency for all the countries.

countries = ['MX', 'CO', 'ES', 'AR',
             'PE', 'VE', 'CL', 'EC',
             'GT', 'CU', 'BO', 'DO', 
             'HN', 'PY', 'SV', 'NI', 
             'CR', 'PA', 'UY']
vocs = Parallel(n_jobs=-1)(delayed(Vocabulary)(date,
                                               lang='Es',
                                               country=country)
                               for country in tqdm(countries))
words = [{k: v for k, v in voc.voc.items() if not k.count('~')}
         for voc in vocs]

The next step is to estimate the $c$ coefficient of Zipf’s Law. Function zipf identifies the coefficient using OLS from a dictionary of words frequency. Variable zipf_c contains the coefficient values for each country, and we also computed the number of words seen, which is stored in variable tokens.

def zipf(data):
    freq = [f for _, f  in Counter(data).most_common()]
    rank = 1 / np.arange(1, len(freq) + 1)
    X = np.atleast_2d(rank).T
    return np.linalg.lstsq(X, freq, rcond=None)[0]

zipf_c = [zipf(w) for w in words]
tokens = [sum(list(w.values())) for w in words]

The following table presents the coefficient $c$ and the number of words ($n$) for all the countries. The table is sorted by $c$ in descending order. It can be seen that $c$ and $n$ are highly correlated, so this version of Zipf’s Law only characterizes the total number of words seen on the dataset.

Country	$c$	$n$
AR	28383.27	405802
ES	21392.39	267724
MX	20166.90	269211
CO	11549.12	153952
CL	8324.78	114394
VE	4704.40	56019
UY	3714.85	52058
PE	2996.51	39945
EC	2768.21	35822
PY	1914.56	24544
DO	1843.01	22969
PA	1575.23	20726
GT	1312.07	17237
NI	756.56	8967
HN	704.90	8243
CR	607.04	6994
CU	604.96	6989
SV	513.01	5703
BO	416.46	4306

The following table shows the correlation of $c$ and $n$; it is observed that the correlation between them is higher than $0.99$, this means that measuring the number of words is enough and there is no need to estimate $c$.

	$c$	$n$
$c$	1.0000	0.9979
$n$	0.9979	1.0000

The correlation is obtained using the following code.

X = np.array([(b[0], c) for b, c in zip(zipf_c, tokens)])
corr = np.corrcoef(X.T)

Zipf’s Law - $f=\frac{c}{r^\alpha}$

Zipf’s Law has different versions; we started with the simpler one; however, that version does not provide information that the one contained in the number of words. A straightforward approach is to change the constant in Zipf’s Law with a parameter and then use an algorithm to estimate that constant. Zipf’s Law has the following form $f=\frac{c}{r^1}$, where the constant is $1$; changing $1$ by $\alpha$ it is obtained $f=\frac{c}{r^\alpha}$. Without considering the meaning of the variables on this later version of Zipf’s Law, one can see the similarities between Zipf’s Law and Heap’s Law formulation; both have a multiplying constant $c$ and $k$, and an exponential one $\alpha$ and $\beta$.

Coefficients $c$ and $\alpha$ can be estimated using the dataset stored in variable words, the procedure is a combination of the approach used to identify $c$ in the previous formulation and the algorithm described to estimate $k$ and $\beta$ of the Heap’s Law.

The following table shows the values of the coefficients $c$, $\alpha$, and the number of words for the different countries.

Country	$c$	$\alpha$	$n$
AR	24606.53	0.7627	405802
MX	17533.38	0.7715	269211
ES	18745.24	0.7842	267724
CO	10027.21	0.7691	153952
CL	7231.78	0.7716	114394
VE	4075.39	0.7695	56019
UY	3175.76	0.7432	52058
PE	2592.78	0.7690	39945
EC	2391.38	0.7672	35822
PY	1636.43	0.7488	24544
DO	1570.40	0.7425	22969
PA	1347.33	0.7514	20726
GT	1114.08	0.7408	17237
NI	645.04	0.7541	8967
HN	596.58	0.7379	8243
CR	512.80	0.7349	6994
CU	514.40	0.7470	6989
SV	435.75	0.7460	5703
BO	352.77	0.7409	4306

The correlation of these variables can be seen in the following table. It is observed that $c$ is highly correlated with the number of words $n$. On the other hand, the correlation between $\alpha$ and $n$ indicates that these variables contain different information.

	$c$	$\alpha$	$n$
$c$	1.0000	0.6588	0.9975
$\alpha$	0.6588	1.0000	0.6308
$n$	0.9975	0.6308	1.0000

Heaps’ Law - $\mid v \mid = kn^\beta$

Heaps’ Law models the relationship between the number of words ($n$) and the size of the vocabulary $\mid v \mid$. In order to estimate the parameters, it is needed to obtain a dataset where $n$ is varied and for each one compute $\mid v \mid$. The dataset used to estimate the coefficients of Zipf’s Law contains $n$ and $\mid v \mid$ for a particular date, country, and language. $\mid v \mid$ corresponds to the length of the dictionary, that is, the number of different words, namely the number for keys in it. Consequently, the dataset needed to estimate the coefficients of Heap’s Law can be obtained by collecting measuring $n$ and $\mid v \mid$ on different days.

The dataset can be obtained by first creating a helper function that contains the code used to retrieve the words in the function get_words.

COUNTRIES = ['MX', 'CO', 'ES', 'AR',
             'PE', 'VE', 'CL', 'EC',
             'GT', 'CU', 'BO', 'DO', 
             'HN', 'PY', 'SV', 'NI', 
             'CR', 'PA', 'UY'] 

def get_words(date=dict(year=2022, month=1, day=10)):
   

    vocs = Parallel(n_jobs=-1)(delayed(Vocabulary)(date,
                                                   lang='Es',
                                                   country=country)
                                   for country in tqdm(COUNTRIES))
    words = [{k: v for k, v in voc.voc.items() if not k.count('~')}
             for voc in vocs]
    return words

Function get_words retrieves the words for all the countries on a particular date; consequently, the next step is to decide the dates to create the dataset. The following code uses the function date_range to create a list of dates starting from November 1, 2021, and finishing on November 30, 2021 (inclusive). Variable words is a list where each element corresponds to a different day containing the words frequency of all the countries; however, to estimate the coefficients, it is needed to have a dataset for each country with all the days. The last line created the desired dataset; it is a list where each element has the frequency of the words for all the days of a particular country.

init = dict(year=2021, month=11, day=1)
end = dict(year=2021, month=11, day=30)
dates = date_range(init, end)
words = [get_words(d) for d in dates]
ww = [[w[index] for w in words] for index in range(len(COUNTRIES))]

The information stored in variable ww contains, for each country, the information needed to create $n$ and $\mid v \mid$. Function voc_tokens computes $n$ and $\mid v \mid$ for the dataset given, i.e., an element of ww. It uses the variable cnt as an accumulator; it is observed that the pattern of line 3 is used every time cnt is updated.

def voc_tokens(data):
    cnt = Counter(data[0])
    output = [[len(cnt), sum(list(cnt.values()))]]
    for x in data[1:]:
        cnt.update(x)
        _ = [len(cnt), sum(list(cnt.values()))]
        output.append(_)
    output = np.array(output)
    return output[:, 0], output[:, 1]

For example, voc_tokens can be used to create $n$ and $\mid v \mid$ of Mexico as:

n_mx, v_mx = voc_tokens(ww[0])

Having described how to obtain $n$ and $\mid v \mid$ for each country, it is time to use the procedure to estimate the country’s $k$ and $\beta$ coefficients. The following table shows the values of these coefficients for each country; it also includes the maximum number of words ($\max n$) in the last column.

Country	$k$	$\beta$	$\max n$
AR	10.19	0.5262	12883620
ES	2.40	0.6316	10495462
MX	4.79	0.5749	9738471
CO	8.31	0.5311	5987400
CL	11.13	0.5199	5689730
UY	10.68	0.5085	1659338
EC	7.46	0.5319	1491032
VE	4.20	0.5651	1387172
PE	11.09	0.5090	1249352
PY	9.28	0.5091	969505
PA	8.49	0.5209	885552
DO	6.15	0.5322	749857
GT	8.01	0.5135	556313
HN	4.94	0.5445	422616
CU	4.78	0.5611	305530
NI	9.15	0.4918	243830
SV	3.35	0.5703	240215
CR	5.71	0.5276	223434
BO	2.55	0.5958	159700

As we have seen previously, the correlation between the variables displays the relation between the coefficients; it was helpful to realize that the first version of Zipf’s Law was not needed. The following table shows the correlation between $k$, $\beta$, and $\max n$. It is observed that $k$ and $\beta$ are uncorrelated to $\max n$, and that there is a negative correlation between $k$ and $\beta$.

	$k$	$\beta$	$\max n$
$k$	1.0000	-0.8643	0.0587
$\beta$	-0.8643	1.0000	0.3235
$\max n$	0.0587	0.3235	1.0000

Comparison of Spanish-speaking countries

Coefficients $\alpha$ and $\beta$ characterize the vocabulary on a particular region, language, and dates. These coefficients can be used to compare the similarities between the countries. A simple approach is to create a scatter problem between $\alpha$ and $\beta$. The following figure shows the scatter plot of the Spanish-speaking countries.

$Spanish - $$\alpha$$, $$\beta$$$