EncExp is a set of tools for creating and using explainable embeddings. As with any embedding, the aim is to have a set of vectors that can be associated with tokens, and consequently, an utterance can be represented in the vector space span by the vectors. However, the difference concerning the embedding estimated with GloVe or Word2Vec, among others, is that EncExp associates vectors where each component has a meaning. The component’s value indicates whether the word associated with the component might be present in the sentence.
The component’s meaning is a direct consequence of the procedure used to estimate the embedding. EncExp estimates the embedding by solving \(d\) binary self-supervised classification problems, where the label is the presence of a particular token. The classifier used is a linear Support Vector Machine.
A more general approach to installing EncExp
is through the use of the command pip, as illustrated in the following instruction.
pip install EncExp
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import Normalizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import make_pipeline
from CompStats.metrics import macro_recall
from encexp import TextModel, SeqTM, EncExpT
from encexp.utils import load_dataset
= load_dataset(['mx', 'ar'], return_X_y=True)
X, y = train_test_split(X, y) Xtrain, Xval, ytrain, yval
= make_pipeline(TextModel(lang='es'),
tm LinearSVC()).fit(Xtrain, ytrain)
= make_pipeline(TextModel(lang='es', pretrained=False),
corpus LinearSVC()).fit(Xtrain, ytrain)
= make_pipeline(SeqTM(lang='es'),
seq LinearSVC()).fit(Xtrain, ytrain)
= make_pipeline(EncExpT(lang='es'),
enc
Normalizer(), LinearSVC()).fit(Xtrain, ytrain)
= macro_recall(yval, tm.predict(Xval),
score ='TextModel')
name= score(corpus.predict(Xval), name='TextModel (Corpus)')
_ = score(seq.predict(Xval), name='SeqTM')
_ = score(enc.predict(Xval), name='EncExpT')
_ score.plot()
Country | train | test |
---|---|---|
ALL | 6234204 | 2550 |
ae | 5452729 | 4096 |
bh | 2186848 | 4096 |
dj | 2873 | 309 |
dz | 749350 | 4096 |
eg | 8388608 | 4096 |
iq | 2298729 | 4096 |
jo | 1639094 | 4096 |
kw | 8388608 | 4096 |
lb | 1774155 | 4096 |
ly | 1577801 | 4096 |
ma | 417223 | 4096 |
mr | 41019 | 1809 |
om | 3394449 | 4096 |
qa | 2255733 | 4096 |
sa | 8388608 | 4096 |
sd | 357365 | 4096 |
so | 17411 | 561 |
sy | 250829 | 4093 |
td | 4797 | 706 |
tn | 277004 | 4096 |
ye | 659052 | 4096 |
Country | train | test |
---|---|---|
ALL | 2580641 | 4096 |
es | 53258 | 4096 |
Country | train | test |
---|---|---|
ALL | 7457826 | 4096 |
at | 7004 | 4096 |
ch | 4573 | 4096 |
de | 83023 | 4096 |
Country | train | test |
---|---|---|
ALL | 8388608 | 4096 |
ag | 296869 | 4096 |
ai | 23970 | 1250 |
au | 8388608 | 4096 |
bb | 723617 | 4096 |
bm | 214699 | 4096 |
bs | 1062581 | 4096 |
bz | 121491 | 4096 |
ca | 8388608 | 4096 |
ck | 8054 | 274 |
cm | 322337 | 4096 |
dm | 53799 | 1140 |
fj | 44874 | 1934 |
fk | 11919 | 412 |
fm | 7498 | 266 |
gb | 8388608 | 4096 |
gd | 112621 | 2761 |
gg | 24073 | 1790 |
gh | 6796293 | 4096 |
gi | 159881 | 4096 |
gm | 159419 | 4096 |
gu | 302189 | 3008 |
gy | 84253 | 4096 |
ie | 8388608 | 4096 |
im | 212844 | 1495 |
in | 8388608 | 4096 |
jm | 2632294 | 4096 |
ke | 8023781 | 4096 |
kn | 87501 | 3652 |
ky | 170409 | 4096 |
lc | 262461 | 4096 |
lr | 105029 | 4096 |
ls | 195553 | 4096 |
mp | 85076 | 617 |
mt | 320087 | 4096 |
mu | 210286 | 4096 |
mw | 595014 | 4096 |
na | 1046724 | 4096 |
ng | 8388608 | 4096 |
nz | 5379853 | 4096 |
pg | 71532 | 3904 |
ph | 8388608 | 4096 |
pk | 6824969 | 4096 |
pr | 53302 | 3164 |
pw | 6557 | 691 |
rw | 373701 | 4096 |
sb | 8142 | 458 |
sd | 132625 | 4096 |
sg | 4499238 | 4096 |
sh | 2876 | 974 |
sl | 140466 | 4096 |
sx | 47527 | 1745 |
sz | 222231 | 4096 |
tc | 151340 | 3064 |
to | 25733 | 901 |
tt | 1416624 | 4096 |
ug | 3432615 | 4096 |
us | 8388608 | 4096 |
vc | 132334 | 4096 |
vg | 107615 | 1650 |
vi | 86763 | 219 |
vu | 13988 | 767 |
za | 8388608 | 4096 |
zm | 1193290 | 4096 |
zw | 1436001 | 4096 |
Country | train | test |
---|---|---|
ALL | 8388608 | 4096 |
ar | 8388608 | 4096 |
bo | 1296270 | 4096 |
cl | 8388608 | 4096 |
co | 8388608 | 4096 |
cr | 5342256 | 4096 |
cu | 825963 | 4096 |
do | 7861274 | 4096 |
ec | 8388608 | 4096 |
es | 8388608 | 4096 |
gq | 14090 | 4096 |
gt | 4704647 | 4096 |
hn | 2305931 | 4096 |
mx | 8388608 | 4096 |
ni | 2307983 | 4096 |
pa | 6703302 | 4096 |
pe | 8388608 | 4096 |
pr | 357092 | 1487 |
py | 8388608 | 4096 |
sv | 2949477 | 4096 |
uy | 8388608 | 4096 |
ve | 8388608 | 4096 |
Country | train | test |
---|---|---|
ALL | 8388608 | 4096 |
be | 290117 | 4096 |
bf | 28516 | 4096 |
bj | 63350 | 4096 |
ca | 541702 | 4096 |
cd | 405742 | 4096 |
cf | 13446 | 2122 |
cg | 26443 | 4096 |
ch | 116820 | 4096 |
ci | 250861 | 4096 |
cm | 421887 | 4096 |
dj | 6331 | 2237 |
fr | 6719963 | 4096 |
ga | 27053 | 4096 |
gn | 90623 | 4096 |
ht | 25940 | 4096 |
km | 6025 | 1736 |
lu | 11359 | 4096 |
mc | 13038 | 4096 |
ml | 54498 | 4096 |
nc | 7150 | 1715 |
ne | 27470 | 4096 |
pf | 6408 | 2304 |
rw | 4695 | 2749 |
sn | 251425 | 4096 |
td | 15161 | 4096 |
tg | 44079 | 4096 |
Country | train | test |
---|---|---|
ALL | 3018844 | 4096 |
in | 69489 | 4096 |
Country | train | test |
---|---|---|
ALL | 8388608 | 4096 |
id | 96992 | 4096 |
Country | train | test |
---|---|---|
ALL | 8388608 | 4096 |
it | 169433 | 4096 |
Country | train | test |
---|---|---|
ALL | 8388608 | 4096 |
jp | 119218 | 4096 |
Country | train | test |
---|---|---|
ALL | 2396019 | 4096 |
kr | 7799 | 4096 |
Country | train | test |
---|---|---|
ALL | 5126009 | 4096 |
be | 16880 | 4096 |
nl | 102887 | 4096 |
Country | train | test |
---|---|---|
ALL | 5980448 | 4096 |
pl | 54719 | 4096 |
Country | train | test |
---|---|---|
ALL | 8388608 | 4096 |
ao | 53243 | 4096 |
br | 8388608 | 4096 |
cv | 8022 | 2434 |
mz | 78572 | 4096 |
pt | 1193837 | 4096 |
Country | train | test |
---|---|---|
ALL | 8388608 | 4096 |
by | 652824 | 4096 |
kg | 145944 | 4096 |
kz | 336193 | 4096 |
ru | 8388608 | 4096 |
Country | train | test |
---|---|---|
ALL | 8388608 | 4096 |
ph | 85762 | 4096 |
Country | train | test |
---|---|---|
ALL | 8388608 | 4096 |
cy | 3085 | 924 |
tr | 550664 | 4096 |
Country | train | test |
---|---|---|
ALL | 340922 | 4096 |
cn | 206313 | 4096 |
hk | 13944 | 4096 |
sg | 4452 | 3963 |
tw | 115165 | 4096 |
The dataset used to create the self-supervised problems is a collection of Tweets collected from the open stream for several years, i.e., the Spanish collection started on December 11, 2015; English on July 1, 2016; Arabic on January 25, 2017; Russian on October 16, 2018; and the rest of the languages on June 1, 2021. In all the cases, the last day collected was June 9, 2023. The collected Tweets were filtered with the following restrictions: retweets were removed; URLs and usernames were replaced by the tokens _url and _usr, respectively; and only tweets with at least 50 characters were included in the final collection.
The corpora are divided into two sets: the first set is used as a training set, i.e., to estimate the parameters, while the second set corresponds to the test set, which could be used to measure the model’s performance. The basis for this division is a specific date, with tweets published before October 1, 2022, forming the first set. Those published on or after October 3, 2022, are being used to create the test set.
The training set and test set were created using an equivalent procedure; the only difference is that the maximum size of the training set is \(2^{23}\) (8 million tweets), and the test set is \(2^{12}\) (4,096 tweets).
There are pairs of training and test sets for each country, using tweets with geographic information, and a pair that groups all tweets without geographic information, labeled as ALL. Each set was meticulously crafted to have, as closely as possible, a uniform distribution of the days. Within each day, near duplicates were removed. Then, a three-day sliding window was used to remove near duplicates within the window. The final step was to shuffle the data to remove the ordering by date.