Introduction

dialectid aims to develop a set of algorithms to detect the dialect of a given text. For example, given a text written in Spanish, dialectid predicts the Spanish-speaking country where the text comes from.

dialectid is available for Arabic (ar), German (de), English (en), Spanish (es), French (fr), Dutch (nl), Portuguese (pt), Russian (ru), Turkish (tr), and Chinese (zh).

Installing using conda

dialectid can be install using the conda package manager with the following instruction.

conda install --channel conda-forge dialectid
Installing using pip

A more general approach to installing dialectid is through the use of the command pip, as illustrated in the following instruction.

pip install dialectid
Countries
from dialectid import DialectId
detect = DialectId(lang='es')
detect.countries
array(['ar', 'bo', 'cl', 'co', 'cr', 'cu', 'do', 'ec', 'es', 'gq', 'gt',
       'hn', 'mx', 'ni', 'pa', 'pe', 'pr', 'py', 'sv', 'uy', 've'],
      dtype='<U2')
Dialect Identification
from dialectid import DialectId
detect = DialectId(lang='es')
detect.predict(['comiendo unos tacos',
                'acompañando el asado con un buen vino'])
array(['mx', 'uy'], dtype='<U2')
Decision Function
from dialectid import DialectId
detect = DialectId(lang='es')
df = detect.decision_function(['acompañando el asado con un buen vino'])[0]
index = df.argsort()[::-1]
[(detect.countries[i], df[i]) for i in index
 if df[i] > 0]
[(np.str_('uy'), np.float64(1.5416804610086077)),
 (np.str_('py'), np.float64(1.3321806689071498)),
 (np.str_('ar'), np.float64(1.2182585368838774))]
Probability
from dialectid import DialectId
detect = DialectId(lang='es', probability=True)
prob = detect.predict_proba(['acompañando el asado con un buen vino'])[0]
index = prob.argsort()[::-1]
[(detect.countries[i], prob[i])
 for i in index[:4]]
[(np.str_('uy'), np.float64(0.4595517299090438)),
 (np.str_('ar'), np.float64(0.35344213246401396)),
 (np.str_('py'), np.float64(0.186954384531818)),
 (np.str_('cl'), np.float64(2.812471659260275e-05))]
Table 1: Number of tweets in the training and test sets for the Arabic-speaking countries.
Country train test
ae 119600 4096
bh 119666 4096
dj 2873 309
dz 119143 4096
eg 119122 4096
iq 119655 4096
jo 119718 4096
kw 119117 4096
lb 119370 4096
ly 119659 4096
ma 119556 4096
mr 41017 1809
om 119771 4096
qa 119362 4096
sa 119009 4096
sd 120078 4096
so 17410 561
sy 119159 4093
td 4797 706
tn 119244 4096
ye 119823 4096
Table 2: Number of tweets in the training and test sets for the German-speaking countries.
Country train test
at 7004 4096
ch 4573 4096
de 83023 4096
Table 3: Number of tweets in the training and test sets for the English-speaking countries.
Country train test
ag 37015 4096
ai 23965 1250
au 36947 4096
bb 36702 4096
bm 36967 4096
bs 37280 4096
bz 36794 4096
ca 36979 4096
ck 8053 274
cm 37134 4096
dm 36623 1140
fj 36909 1934
fk 11917 412
fm 7497 266
gb 37005 4096
gd 37060 2761
gg 24068 1790
gh 37213 4096
gi 37003 4096
gm 36999 4096
gu 37116 3008
gy 36892 4096
ie 37158 4096
im 37255 1495
in 37048 4096
jm 37276 4096
ke 37294 4096
kn 37062 3652
ky 37184 4096
lc 36919 4096
lr 37093 4096
ls 37153 4096
mp 37032 617
mt 37158 4096
mu 37012 4096
mw 37248 4096
na 37137 4096
ng 37127 4096
nz 37442 4096
pg 37333 3904
ph 37281 4096
pk 37239 4096
pw 6556 691
rw 36569 4096
sb 8141 458
sd 37036 4096
sg 37215 4096
sh 2876 974
sl 37008 4096
sx 36672 1745
sz 36842 4096
tc 36996 3064
to 25728 901
tt 37304 4096
ug 37162 4096
us 37410 4096
vc 36742 4096
vg 37033 1650
vi 37113 219
vu 13985 767
za 36839 4096
zm 37079 4096
zw 37233 4096
Table 4: Number of tweets in the training and test sets for the Spanish-speaking countries.
Country train test
ar 108943 4096
bo 108317 4096
cl 108974 4096
co 109063 4096
cr 109556 4096
cu 109054 4096
do 109364 4096
ec 108953 4096
es 108583 4096
gq 13548 4096
gt 109749 4096
hn 108846 4096
mx 109120 4096
ni 109377 4096
pa 108577 4096
pe 108960 4096
pr 12407 1487
py 108992 4096
sv 108769 4096
uy 108672 4096
ve 109327 4096
Table 5: Number of tweets in the training and test sets for the French-speaking countries.
Country train test
be 214731 4096
bf 28514 4096
bj 63347 4096
ca 215745 4096
cd 214853 4096
cf 13445 2122
cg 26441 4096
ch 116815 4096
ci 215555 4096
cm 215520 4096
dj 6331 2237
fr 215940 4096
ga 27051 4096
gn 90619 4096
ht 25939 4096
km 6025 1736
lu 11358 4096
mc 13037 4096
ml 54495 4096
nc 7150 1715
ne 27468 4096
pf 6408 2304
rw 4695 2749
sn 216403 4096
td 15160 4096
tg 44077 4096
Table 6: Number of tweets in the training and test sets for the Dutch-speaking countries.
Country train test
be 16880 4096
nl 102887 4096
Table 7: Number of tweets in the training and test sets for the Portuguese-speaking countries.
Country train test
ao 53243 4096
br 978254 4096
cv 8022 2434
mz 78571 4096
pt 979061 4096
Table 8: Number of tweets in the training and test sets for the Russian-speaking countries.
Country train test
by 652818 4096
kg 145942 4096
kz 336189 4096
ru 962191 4096
Table 9: Number of tweets in the training and test sets for the Turkish-speaking countries.
Country train test
cy 3085 924
tr 550662 4096
Table 10: Number of tweets in the training and test sets for the Chinese-speaking countries.
Country train test
cn 206312 4096
hk 13944 4096
sg 4452 3963
tw 115165 4096
Description

The dataset used to create the self-supervised problems is a collection of Tweets collected from the open stream for several years, i.e., the Spanish collection started on December 11, 2015; English on July 1, 2016; Arabic on January 25, 2017; Russian on October 16, 2018; and the rest of the languages on June 1, 2021. In all the cases, the last day collected was June 9, 2023. The collected Tweets were filtered with the following restrictions: retweets were removed; URLs and usernames were replaced by the tokens _url and _usr, respectively; and only tweets with at least 50 characters were included in the final collection.

The corpora are divided into two sets: the first set is used as a training set, i.e., to estimate the parameters, while the second set corresponds to the test set, which could be used to measure the model’s performance. The basis for this division is a specific date, with tweets published before October 1, 2022, forming the first set. Those published on or after October 3, 2022, are being used to create the test set.

The procedure has two stages. Two datasets were created for each country and language in the first stage. The first one contains \(2^{23}\) (8 million) tweets, and the second has \(2^{12}\) (4,096) tweets; the former will be used to create the training set, and the latter corresponds to the test set. These two sets were constructed using tweets with geographic information and filtered according to the language information provided by Twitter. Each set was meticulously crafted to follow, as closely as possible, a uniform distribution of the days. Within each day, near duplicates were removed. Then, a three-day sliding window was used to remove near duplicates within the window. The final step was to shuffle the data to remove the ordering by date.

In the second stage, training sets are created for each language. Each training set contains \(2^{21}\) (2 million) tweets. The procedure used to develop the training consists of drawing tweets from the sets created in the first stage, which have a size of \(2^{12}\). The sampling procedure aims to develop training sets that follow a uniform distribution by country. We also produce a smaller training set containing \(2^{18}\) (262 thousand) tweets. The procedure is equivalent to the previous one; the aim is to have a uniform distribution of the countries.

It is worth mentioning that we did not have enough information for all the countries and languages to follow an exactly uniform distribution. For example, it can be observed in Table 4 (Spanish) that for Puerto Rico (pr), there are only 12,407 tweets in the training set and 1,487 tweets in the test set, which correspond to the total number of available tweets that met the imposed restrictions.

Performance

The performance of different algorithms is presented in Figure 1 using macro-recall. The best-performing system in almost all cases is DialectId, which is trained on 2 million tweets and has a vocabulary of 500,000 tokens. The exception are Turkish and Dutch, where the best systems is StackBoW trained with only 262k tweets.

The remaining figures provide details on macro-recall by presenting the system’s recall in each country.