EncExp

Introduction

EncExp is a set of tools for creating and using explainable embeddings. As with any embedding, the aim is to have a set of vectors that can be associated with tokens, and consequently, an utterance can be represented in the vector space span by the vectors. However, the difference concerning the embedding estimated with GloVe or Word2Vec, among others, is that EncExp associates vectors where each component has a meaning. The component’s value indicates whether the word associated with the component might be present in the sentence.

The component’s meaning is a direct consequence of the procedure used to estimate the embedding. EncExp estimates the embedding by solving \(d\) binary self-supervised classification problems, where the label is the presence of a particular token. The classifier used is a linear Support Vector Machine.

Installing using pip

A more general approach to installing EncExp is through the use of the command pip, as illustrated in the following instruction.

pip install EncExp

Datasets and libraries

Code

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import Normalizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import make_pipeline
from CompStats.metrics import macro_recall
from encexp import TextModel, SeqTM, EncExpT
from encexp.utils import load_dataset

X, y = load_dataset(['mx', 'ar'], return_X_y=True)
Xtrain, Xval, ytrain, yval = train_test_split(X, y)

TextModel

tm = make_pipeline(TextModel(lang='es'),
                   LinearSVC()).fit(Xtrain, ytrain)

TextModel (Corpus)

corpus = make_pipeline(TextModel(lang='es', pretrained=False),
                       LinearSVC()).fit(Xtrain, ytrain)

SeqTM

seq = make_pipeline(SeqTM(lang='es'),
                    LinearSVC()).fit(Xtrain, ytrain)

EncExp

enc = make_pipeline(EncExpT(lang='es'),
                    Normalizer(),
                    LinearSVC()).fit(Xtrain, ytrain)

Performance

Code

score = macro_recall(yval, tm.predict(Xval),
                     name='TextModel')
_ = score(corpus.predict(Xval), name='TextModel (Corpus)')                     
_ = score(seq.predict(Xval), name='SeqTM')
_ = score(enc.predict(Xval), name='EncExpT')
score.plot()

Country	train	test
ALL	6234204	2550
ae	5452729	4096
bh	2186848	4096
dj	2873	309
dz	749350	4096
eg	8388608	4096
iq	2298729	4096
jo	1639094	4096
kw	8388608	4096
lb	1774155	4096
ly	1577801	4096
ma	417223	4096
mr	41019	1809
om	3394449	4096
qa	2255733	4096
sa	8388608	4096
sd	357365	4096
so	17411	561
sy	250829	4093
td	4797	706
tn	277004	4096
ye	659052	4096

Country	train	test
ALL	2580641	4096
es	53258	4096

Country	train	test
ALL	7457826	4096
at	7004	4096
ch	4573	4096
de	83023	4096

Country	train	test
ALL	8388608	4096
ag	296869	4096
ai	23970	1250
au	8388608	4096
bb	723617	4096
bm	214699	4096
bs	1062581	4096
bz	121491	4096
ca	8388608	4096
ck	8054	274
cm	322337	4096
dm	53799	1140
fj	44874	1934
fk	11919	412
fm	7498	266
gb	8388608	4096
gd	112621	2761
gg	24073	1790
gh	6796293	4096
gi	159881	4096
gm	159419	4096
gu	302189	3008
gy	84253	4096
ie	8388608	4096
im	212844	1495
in	8388608	4096
jm	2632294	4096
ke	8023781	4096
kn	87501	3652
ky	170409	4096
lc	262461	4096
lr	105029	4096
ls	195553	4096
mp	85076	617
mt	320087	4096
mu	210286	4096
mw	595014	4096
na	1046724	4096
ng	8388608	4096
nz	5379853	4096
pg	71532	3904
ph	8388608	4096
pk	6824969	4096
pr	53302	3164
pw	6557	691
rw	373701	4096
sb	8142	458
sd	132625	4096
sg	4499238	4096
sh	2876	974
sl	140466	4096
sx	47527	1745
sz	222231	4096
tc	151340	3064
to	25733	901
tt	1416624	4096
ug	3432615	4096
us	8388608	4096
vc	132334	4096
vg	107615	1650
vi	86763	219
vu	13988	767
za	8388608	4096
zm	1193290	4096
zw	1436001	4096

Country	train	test
ALL	8388608	4096
ar	8388608	4096
bo	1296270	4096
cl	8388608	4096
co	8388608	4096
cr	5342256	4096
cu	825963	4096
do	7861274	4096
ec	8388608	4096
es	8388608	4096
gq	14090	4096
gt	4704647	4096
hn	2305931	4096
mx	8388608	4096
ni	2307983	4096
pa	6703302	4096
pe	8388608	4096
pr	357092	1487
py	8388608	4096
sv	2949477	4096
uy	8388608	4096
ve	8388608	4096

Country	train	test
ALL	8388608	4096
be	290117	4096
bf	28516	4096
bj	63350	4096
ca	541702	4096
cd	405742	4096
cf	13446	2122
cg	26443	4096
ch	116820	4096
ci	250861	4096
cm	421887	4096
dj	6331	2237
fr	6719963	4096
ga	27053	4096
gn	90623	4096
ht	25940	4096
km	6025	1736
lu	11359	4096
mc	13038	4096
ml	54498	4096
nc	7150	1715
ne	27470	4096
pf	6408	2304
rw	4695	2749
sn	251425	4096
td	15161	4096
tg	44079	4096

Country	train	test
ALL	3018844	4096
in	69489	4096

Country	train	test
ALL	8388608	4096
id	96992	4096

Country	train	test
ALL	8388608	4096
it	169433	4096

Country	train	test
ALL	8388608	4096
jp	119218	4096

Country	train	test
ALL	2396019	4096
kr	7799	4096

Country	train	test
ALL	5126009	4096
be	16880	4096
nl	102887	4096

Country	train	test
ALL	5980448	4096
pl	54719	4096

Country	train	test
ALL	8388608	4096
ao	53243	4096
br	8388608	4096
cv	8022	2434
mz	78572	4096
pt	1193837	4096

Country	train	test
ALL	8388608	4096
by	652824	4096
kg	145944	4096
kz	336193	4096
ru	8388608	4096

Country	train	test
ALL	8388608	4096
ph	85762	4096

Country	train	test
ALL	8388608	4096
cy	3085	924
tr	550664	4096

Country	train	test
ALL	340922	4096
cn	206313	4096
hk	13944	4096
sg	4452	3963
tw	115165	4096

Description

The dataset used to create the self-supervised problems is a collection of Tweets collected from the open stream for several years, i.e., the Spanish collection started on December 11, 2015; English on July 1, 2016; Arabic on January 25, 2017; Russian on October 16, 2018; and the rest of the languages on June 1, 2021. In all the cases, the last day collected was June 9, 2023. The collected Tweets were filtered with the following restrictions: retweets were removed; URLs and usernames were replaced by the tokens _url and _usr, respectively; and only tweets with at least 50 characters were included in the final collection.

The corpora are divided into two sets: the first set is used as a training set, i.e., to estimate the parameters, while the second set corresponds to the test set, which could be used to measure the model’s performance. The basis for this division is a specific date, with tweets published before October 1, 2022, forming the first set. Those published on or after October 3, 2022, are being used to create the test set.

The training set and test set were created using an equivalent procedure; the only difference is that the maximum size of the training set is \(2^{23}\) (8 million tweets), and the test set is \(2^{12}\) (4,096 tweets).

There are pairs of training and test sets for each country, using tweets with geographic information, and a pair that groups all tweets without geographic information, labeled as ALL. Each set was meticulously crafted to have, as closely as possible, a uniform distribution of the days. Within each day, near duplicates were removed. Then, a three-day sliding window was used to remove near duplicates within the window. The final step was to shuffle the data to remove the ordering by date.