Introduction

EncExp is a set of tools for creating and using explainable embeddings. As with any embedding, the aim is to have a set of vectors that can be associated with tokens, and consequently, an utterance can be represented in the vector space span by the vectors. However, the difference concerning the embedding estimated with GloVe or Word2Vec, among others, is that EncExp associates vectors where each component has a meaning. The component’s value indicates whether the word associated with the component might be present in the sentence.

The component’s meaning is a direct consequence of the procedure used to estimate the embedding. EncExp estimates the embedding by solving \(d\) binary self-supervised classification problems, where the label is the presence of a particular token. The classifier used is a linear Support Vector Machine.

Installing using pip

A more general approach to installing EncExp is through the use of the command pip, as illustrated in the following instruction.

pip install EncExp
Datasets and libraries
Code
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import Normalizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import make_pipeline
from CompStats.metrics import macro_recall
from encexp import TextModel, SeqTM, EncExpT
from encexp.utils import load_dataset

X, y = load_dataset(['mx', 'ar'], return_X_y=True)
Xtrain, Xval, ytrain, yval = train_test_split(X, y)
TextModel
tm = make_pipeline(TextModel(lang='es'),
                   LinearSVC()).fit(Xtrain, ytrain)
TextModel (Corpus)
corpus = make_pipeline(TextModel(lang='es', pretrained=False),
                       LinearSVC()).fit(Xtrain, ytrain)
SeqTM
seq = make_pipeline(SeqTM(lang='es'),
                    LinearSVC()).fit(Xtrain, ytrain)
EncExp
enc = make_pipeline(EncExpT(lang='es'),
                    Normalizer(),
                    LinearSVC()).fit(Xtrain, ytrain)
Performance
Code
score = macro_recall(yval, tm.predict(Xval),
                     name='TextModel')
_ = score(corpus.predict(Xval), name='TextModel (Corpus)')                     
_ = score(seq.predict(Xval), name='SeqTM')
_ = score(enc.predict(Xval), name='EncExpT')
score.plot()
Country train test
ALL 6234204 2550
ae 5452729 4096
bh 2186848 4096
dj 2873 309
dz 749350 4096
eg 8388608 4096
iq 2298729 4096
jo 1639094 4096
kw 8388608 4096
lb 1774155 4096
ly 1577801 4096
ma 417223 4096
mr 41019 1809
om 3394449 4096
qa 2255733 4096
sa 8388608 4096
sd 357365 4096
so 17411 561
sy 250829 4093
td 4797 706
tn 277004 4096
ye 659052 4096
Country train test
ALL 2580641 4096
es 53258 4096
Country train test
ALL 7457826 4096
at 7004 4096
ch 4573 4096
de 83023 4096
Country train test
ALL 8388608 4096
ag 296869 4096
ai 23970 1250
au 8388608 4096
bb 723617 4096
bm 214699 4096
bs 1062581 4096
bz 121491 4096
ca 8388608 4096
ck 8054 274
cm 322337 4096
dm 53799 1140
fj 44874 1934
fk 11919 412
fm 7498 266
gb 8388608 4096
gd 112621 2761
gg 24073 1790
gh 6796293 4096
gi 159881 4096
gm 159419 4096
gu 302189 3008
gy 84253 4096
ie 8388608 4096
im 212844 1495
in 8388608 4096
jm 2632294 4096
ke 8023781 4096
kn 87501 3652
ky 170409 4096
lc 262461 4096
lr 105029 4096
ls 195553 4096
mp 85076 617
mt 320087 4096
mu 210286 4096
mw 595014 4096
na 1046724 4096
ng 8388608 4096
nz 5379853 4096
pg 71532 3904
ph 8388608 4096
pk 6824969 4096
pr 53302 3164
pw 6557 691
rw 373701 4096
sb 8142 458
sd 132625 4096
sg 4499238 4096
sh 2876 974
sl 140466 4096
sx 47527 1745
sz 222231 4096
tc 151340 3064
to 25733 901
tt 1416624 4096
ug 3432615 4096
us 8388608 4096
vc 132334 4096
vg 107615 1650
vi 86763 219
vu 13988 767
za 8388608 4096
zm 1193290 4096
zw 1436001 4096
Country train test
ALL 8388608 4096
ar 8388608 4096
bo 1296270 4096
cl 8388608 4096
co 8388608 4096
cr 5342256 4096
cu 825963 4096
do 7861274 4096
ec 8388608 4096
es 8388608 4096
gq 14090 4096
gt 4704647 4096
hn 2305931 4096
mx 8388608 4096
ni 2307983 4096
pa 6703302 4096
pe 8388608 4096
pr 357092 1487
py 8388608 4096
sv 2949477 4096
uy 8388608 4096
ve 8388608 4096
Country train test
ALL 8388608 4096
be 290117 4096
bf 28516 4096
bj 63350 4096
ca 541702 4096
cd 405742 4096
cf 13446 2122
cg 26443 4096
ch 116820 4096
ci 250861 4096
cm 421887 4096
dj 6331 2237
fr 6719963 4096
ga 27053 4096
gn 90623 4096
ht 25940 4096
km 6025 1736
lu 11359 4096
mc 13038 4096
ml 54498 4096
nc 7150 1715
ne 27470 4096
pf 6408 2304
rw 4695 2749
sn 251425 4096
td 15161 4096
tg 44079 4096
Country train test
ALL 3018844 4096
in 69489 4096
Country train test
ALL 8388608 4096
id 96992 4096
Country train test
ALL 8388608 4096
it 169433 4096
Country train test
ALL 8388608 4096
jp 119218 4096
Country train test
ALL 2396019 4096
kr 7799 4096
Country train test
ALL 5126009 4096
be 16880 4096
nl 102887 4096
Country train test
ALL 5980448 4096
pl 54719 4096
Country train test
ALL 8388608 4096
ao 53243 4096
br 8388608 4096
cv 8022 2434
mz 78572 4096
pt 1193837 4096
Country train test
ALL 8388608 4096
by 652824 4096
kg 145944 4096
kz 336193 4096
ru 8388608 4096
Country train test
ALL 8388608 4096
ph 85762 4096
Country train test
ALL 8388608 4096
cy 3085 924
tr 550664 4096
Country train test
ALL 340922 4096
cn 206313 4096
hk 13944 4096
sg 4452 3963
tw 115165 4096
Description

The dataset used to create the self-supervised problems is a collection of Tweets collected from the open stream for several years, i.e., the Spanish collection started on December 11, 2015; English on July 1, 2016; Arabic on January 25, 2017; Russian on October 16, 2018; and the rest of the languages on June 1, 2021. In all the cases, the last day collected was June 9, 2023. The collected Tweets were filtered with the following restrictions: retweets were removed; URLs and usernames were replaced by the tokens _url and _usr, respectively; and only tweets with at least 50 characters were included in the final collection.

The corpora are divided into two sets: the first set is used as a training set, i.e., to estimate the parameters, while the second set corresponds to the test set, which could be used to measure the model’s performance. The basis for this division is a specific date, with tweets published before October 1, 2022, forming the first set. Those published on or after October 3, 2022, are being used to create the test set.

The training set and test set were created using an equivalent procedure; the only difference is that the maximum size of the training set is \(2^{23}\) (8 million tweets), and the test set is \(2^{12}\) (4,096 tweets).

There are pairs of training and test sets for each country, using tweets with geographic information, and a pair that groups all tweets without geographic information, labeled as ALL. Each set was meticulously crafted to have, as closely as possible, a uniform distribution of the days. Within each day, near duplicates were removed. Then, a three-day sliding window was used to remove near duplicates within the window. The final step was to shuffle the data to remove the ordering by date.