dialectid
aims to develop a set of algorithms to detect the dialect of a given text. For example, given a text written in Spanish, dialectid
predicts the Spanish-speaking country where the text comes from.
dialectid
is available for Arabic (ar), German (de), English (en), Spanish (es), French (fr), Dutch (nl), Portuguese (pt), Russian (ru), Turkish (tr), and Chinese (zh).
dialectid
can be install using the conda package manager with the following instruction.
conda install --channel conda-forge dialectid
A more general approach to installing dialectid
is through the use of the command pip, as illustrated in the following instruction.
pip install dialectid
from dialectid import DialectId
= DialectId(lang='es')
detect detect.countries
array(['ar', 'bo', 'cl', 'co', 'cr', 'cu', 'do', 'ec', 'es', 'gq', 'gt',
'hn', 'mx', 'ni', 'pa', 'pe', 'pr', 'py', 'sv', 'uy', 've'],
dtype='<U2')
from dialectid import DialectId
= DialectId(lang='es')
detect 'comiendo unos tacos',
detect.predict(['acompañando el asado con un buen vino'])
array(['mx', 'uy'], dtype='<U2')
from dialectid import DialectId
= DialectId(lang='es')
detect = detect.decision_function(['acompañando el asado con un buen vino'])[0]
df = df.argsort()[::-1]
index for i in index
[(detect.countries[i], df[i]) if df[i] > 0]
[(np.str_('uy'), np.float64(1.5416804610086077)),
(np.str_('py'), np.float64(1.3321806689071498)),
(np.str_('ar'), np.float64(1.2182585368838774))]
from dialectid import DialectId
= DialectId(lang='es', probability=True)
detect = detect.predict_proba(['acompañando el asado con un buen vino'])[0]
prob = prob.argsort()[::-1]
index
[(detect.countries[i], prob[i])for i in index[:4]]
[(np.str_('uy'), np.float64(0.4595517299090438)),
(np.str_('ar'), np.float64(0.35344213246401396)),
(np.str_('py'), np.float64(0.186954384531818)),
(np.str_('cl'), np.float64(2.812471659260275e-05))]
Country | train | test |
---|---|---|
ae | 119600 | 4096 |
bh | 119666 | 4096 |
dj | 2873 | 309 |
dz | 119143 | 4096 |
eg | 119122 | 4096 |
iq | 119655 | 4096 |
jo | 119718 | 4096 |
kw | 119117 | 4096 |
lb | 119370 | 4096 |
ly | 119659 | 4096 |
ma | 119556 | 4096 |
mr | 41017 | 1809 |
om | 119771 | 4096 |
qa | 119362 | 4096 |
sa | 119009 | 4096 |
sd | 120078 | 4096 |
so | 17410 | 561 |
sy | 119159 | 4093 |
td | 4797 | 706 |
tn | 119244 | 4096 |
ye | 119823 | 4096 |
Country | train | test |
---|---|---|
at | 7004 | 4096 |
ch | 4573 | 4096 |
de | 83023 | 4096 |
Country | train | test |
---|---|---|
ag | 37015 | 4096 |
ai | 23965 | 1250 |
au | 36947 | 4096 |
bb | 36702 | 4096 |
bm | 36967 | 4096 |
bs | 37280 | 4096 |
bz | 36794 | 4096 |
ca | 36979 | 4096 |
ck | 8053 | 274 |
cm | 37134 | 4096 |
dm | 36623 | 1140 |
fj | 36909 | 1934 |
fk | 11917 | 412 |
fm | 7497 | 266 |
gb | 37005 | 4096 |
gd | 37060 | 2761 |
gg | 24068 | 1790 |
gh | 37213 | 4096 |
gi | 37003 | 4096 |
gm | 36999 | 4096 |
gu | 37116 | 3008 |
gy | 36892 | 4096 |
ie | 37158 | 4096 |
im | 37255 | 1495 |
in | 37048 | 4096 |
jm | 37276 | 4096 |
ke | 37294 | 4096 |
kn | 37062 | 3652 |
ky | 37184 | 4096 |
lc | 36919 | 4096 |
lr | 37093 | 4096 |
ls | 37153 | 4096 |
mp | 37032 | 617 |
mt | 37158 | 4096 |
mu | 37012 | 4096 |
mw | 37248 | 4096 |
na | 37137 | 4096 |
ng | 37127 | 4096 |
nz | 37442 | 4096 |
pg | 37333 | 3904 |
ph | 37281 | 4096 |
pk | 37239 | 4096 |
pw | 6556 | 691 |
rw | 36569 | 4096 |
sb | 8141 | 458 |
sd | 37036 | 4096 |
sg | 37215 | 4096 |
sh | 2876 | 974 |
sl | 37008 | 4096 |
sx | 36672 | 1745 |
sz | 36842 | 4096 |
tc | 36996 | 3064 |
to | 25728 | 901 |
tt | 37304 | 4096 |
ug | 37162 | 4096 |
us | 37410 | 4096 |
vc | 36742 | 4096 |
vg | 37033 | 1650 |
vi | 37113 | 219 |
vu | 13985 | 767 |
za | 36839 | 4096 |
zm | 37079 | 4096 |
zw | 37233 | 4096 |
Country | train | test |
---|---|---|
ar | 108943 | 4096 |
bo | 108317 | 4096 |
cl | 108974 | 4096 |
co | 109063 | 4096 |
cr | 109556 | 4096 |
cu | 109054 | 4096 |
do | 109364 | 4096 |
ec | 108953 | 4096 |
es | 108583 | 4096 |
gq | 13548 | 4096 |
gt | 109749 | 4096 |
hn | 108846 | 4096 |
mx | 109120 | 4096 |
ni | 109377 | 4096 |
pa | 108577 | 4096 |
pe | 108960 | 4096 |
pr | 12407 | 1487 |
py | 108992 | 4096 |
sv | 108769 | 4096 |
uy | 108672 | 4096 |
ve | 109327 | 4096 |
Country | train | test |
---|---|---|
be | 214731 | 4096 |
bf | 28514 | 4096 |
bj | 63347 | 4096 |
ca | 215745 | 4096 |
cd | 214853 | 4096 |
cf | 13445 | 2122 |
cg | 26441 | 4096 |
ch | 116815 | 4096 |
ci | 215555 | 4096 |
cm | 215520 | 4096 |
dj | 6331 | 2237 |
fr | 215940 | 4096 |
ga | 27051 | 4096 |
gn | 90619 | 4096 |
ht | 25939 | 4096 |
km | 6025 | 1736 |
lu | 11358 | 4096 |
mc | 13037 | 4096 |
ml | 54495 | 4096 |
nc | 7150 | 1715 |
ne | 27468 | 4096 |
pf | 6408 | 2304 |
rw | 4695 | 2749 |
sn | 216403 | 4096 |
td | 15160 | 4096 |
tg | 44077 | 4096 |
Country | train | test |
---|---|---|
be | 16880 | 4096 |
nl | 102887 | 4096 |
Country | train | test |
---|---|---|
ao | 53243 | 4096 |
br | 978254 | 4096 |
cv | 8022 | 2434 |
mz | 78571 | 4096 |
pt | 979061 | 4096 |
Country | train | test |
---|---|---|
by | 652818 | 4096 |
kg | 145942 | 4096 |
kz | 336189 | 4096 |
ru | 962191 | 4096 |
Country | train | test |
---|---|---|
cy | 3085 | 924 |
tr | 550662 | 4096 |
Country | train | test |
---|---|---|
cn | 206312 | 4096 |
hk | 13944 | 4096 |
sg | 4452 | 3963 |
tw | 115165 | 4096 |
The dataset used to create the self-supervised problems is a collection of Tweets collected from the open stream for several years, i.e., the Spanish collection started on December 11, 2015; English on July 1, 2016; Arabic on January 25, 2017; Russian on October 16, 2018; and the rest of the languages on June 1, 2021. In all the cases, the last day collected was June 9, 2023. The collected Tweets were filtered with the following restrictions: retweets were removed; URLs and usernames were replaced by the tokens _url and _usr, respectively; and only tweets with at least 50 characters were included in the final collection.
The corpora are divided into two sets: the first set is used as a training set, i.e., to estimate the parameters, while the second set corresponds to the test set, which could be used to measure the model’s performance. The basis for this division is a specific date, with tweets published before October 1, 2022, forming the first set. Those published on or after October 3, 2022, are being used to create the test set.
The procedure has two stages. Two datasets were created for each country and language in the first stage. The first one contains \(2^{23}\) (8 million) tweets, and the second has \(2^{12}\) (4,096) tweets; the former will be used to create the training set, and the latter corresponds to the test set. These two sets were constructed using tweets with geographic information and filtered according to the language information provided by Twitter. Each set was meticulously crafted to follow, as closely as possible, a uniform distribution of the days. Within each day, near duplicates were removed. Then, a three-day sliding window was used to remove near duplicates within the window. The final step was to shuffle the data to remove the ordering by date.
In the second stage, training sets are created for each language. Each training set contains \(2^{21}\) (2 million) tweets. The procedure used to develop the training consists of drawing tweets from the sets created in the first stage, which have a size of \(2^{12}\). The sampling procedure aims to develop training sets that follow a uniform distribution by country. We also produce a smaller training set containing \(2^{18}\) (262 thousand) tweets. The procedure is equivalent to the previous one; the aim is to have a uniform distribution of the countries.
It is worth mentioning that we did not have enough information for all the countries and languages to follow an exactly uniform distribution. For example, it can be observed in Table 4 (Spanish) that for Puerto Rico (pr), there are only 12,407 tweets in the training set and 1,487 tweets in the test set, which correspond to the total number of available tweets that met the imposed restrictions.
The performance of different algorithms is presented in Figure 1 using macro-recall. The best-performing system in almost all cases is DialectId, which is trained on 2 million tweets and has a vocabulary of 500,000 tokens. The exception are Turkish and Dutch, where the best systems is StackBoW trained with only 262k tweets.
The remaining figures provide details on macro-recall by presenting the system’s recall in each country.