DialectId
aims to develop a set of algorithms that detect the dialect of a given text. For example, given a text in Spanish, DialectId predicts the Spanish-speaking country from which the text comes.
DialectId
is available for Arabic (ar), German (de), English (en), Spanish (es), French (fr), Dutch (nl), Portuguese (pt), Russian (ru), Turkish (tr), and Chinese (zh).
DialectId
can be install using the conda package manager with the following instruction.
conda install --channel conda-forge dialectid
A more general approach to installing DialectId
is through the use of the command pip, as illustrated in the following instruction.
pip install dialectid
DialectId
can be used to predict the dialect of a list of texts using the method predict
as seen in the following lines. The first line imports the DialectId
class, the second instantiates the class in the Spanish language, and finally, the third line predicts two utterances. The first corresponds to an expression that would be common in Mexico, and the second is an expression that could be associated with Argentina, Uruguay, Chile, and other South American countries.
from dialectid import DialectId
= DialectId(lang='es')
detect 'comiendo unos tacos',
detect.predict(['acompañando el asado con un buen vino'])
array(['mx', 'uy'], dtype='<U2')
The available dialects for each language can be identified in the attribute countries
, as seen in the following snippet for Spanish.
from dialectid import DialectId
= DialectId(lang='es')
detect detect.countries
array(['ar', 'bo', 'cl', 'co', 'cr', 'cu', 'do', 'ec', 'es', 'gq', 'gt',
'hn', 'mx', 'ni', 'pa', 'pe', 'pr', 'py', 'sv', 'uy', 've'],
dtype='<U2')
One might be interested in all the countries from which the speaker could come. To facilitate this, one can use the decision_function
method. DialectId uses linear Support Vector Machines (SVM) as classifiers; consequently, the positive values in the decision_function
are interpreted as belonging to the positive class, i.e., a particular country. The following code exemplifies this idea: the first two lines import and instantiate the DialectId
class in Spanish. The third line computes the decision-function values; it returns a two-dimensional array where the first dimension corresponds to the number of texts. In this case, it keeps only the decision-function values, where the positive values indicate the presence of the particular country. The fourth line sorts the values where the highest value is the first element. The fifth line retrieves the country and its associated decision-function values, considering only those countries with positive values.
from dialectid import DialectId
= DialectId(lang='es')
detect = detect.decision_function(['acompañando el asado con un buen vino'])[0]
df = df.argsort()[::-1]
index for i in index
[(detect.countries[i], df[i]) if df[i] > 0]
[(np.str_('uy'), np.float32(1.5416805)),
(np.str_('py'), np.float32(1.3321806)),
(np.str_('ar'), np.float32(1.2182581))]
In some situations, one is interested in the probability instead of the decision-function values of a linear SVM. The probability can be computed using the predict_proba
method. The following code exemplifies this idea: the first line imports the DialectId
class as in previous examples. The second line differs from the last example in that the parameter probability
is set to true. The rest of the lines are almost equivalent to the previous example.
from dialectid import DialectId
= DialectId(lang='es', probability=True)
detect = detect.predict_proba(['acompañando el asado con un buen vino'])[0]
prob = prob.argsort()[::-1]
index
[(detect.countries[i], prob[i])for i in index[:4]]
[(np.str_('uy'), np.float32(0.45955184)),
(np.str_('ar'), np.float32(0.353442)),
(np.str_('py'), np.float32(0.18695451)),
(np.str_('cl'), np.float32(2.8124754e-05))]
Country | train | test | train (orig. dist.) | test (orig. dist.) | Corpus |
---|---|---|---|---|---|
Saudi Arabia | 119009 | 4096 | 1139707 | 1101214 | 61578662 |
Egypt | 119122 | 4096 | 271439 | 287583 | 14665935 |
Kuwait | 119117 | 4096 | 188944 | 187432 | 10208696 |
United Arab Emirates | 119600 | 4096 | 115345 | 105957 | 6232153 |
Oman | 119771 | 4096 | 70484 | 70730 | 3808309 |
Iraq | 119655 | 4096 | 50912 | 63215 | 2750834 |
Qatar | 119362 | 4096 | 48860 | 46962 | 2639967 |
Bahrain | 119666 | 4096 | 45196 | 38131 | 2441971 |
Lebanon | 119370 | 4096 | 35812 | 30455 | 1934983 |
Jordan | 119718 | 4096 | 34619 | 33242 | 1870514 |
Libya | 119659 | 4096 | 31495 | 29417 | 1701716 |
Yemen | 119823 | 4096 | 16917 | 33165 | 914053 |
Algeria | 119143 | 4096 | 16609 | 18617 | 897394 |
Morocco | 119556 | 4096 | 9600 | 16093 | 518739 |
Sudan | 120078 | 4096 | 7662 | 16291 | 413993 |
Tunisia | 119244 | 4096 | 6405 | 7435 | 346082 |
Syria | 119159 | 4093 | 5768 | 9596 | 311660 |
Mauritania | 41017 | 1809 | 844 | 760 | 45624 |
Somalia | 17410 | 561 | 355 | 234 | 19215 |
Chad | 4797 | 706 | 105 | 295 | 5706 |
Djibouti | 2873 | 309 | 63 | 152 | 3420 |
Sum | 2097149 | 73014 | 2097141 | 2096976 | 113309626 |
Country | train | test | train (orig. dist.) | test (orig. dist.) | Corpus |
---|---|---|---|---|---|
Germany | 83023 | 4096 | 80620 | 1110160 | 1262931 |
Austria | 7004 | 4096 | 7004 | 100180 | 109718 |
Switzerland | 4573 | 4096 | 4547 | 64578 | 71231 |
Sum | 94600 | 12288 | 92171 | 1274918 | 1443880 |
Country | train | test | train (orig. dist.) | test (orig. dist.) | Corpus |
---|---|---|---|---|---|
United States | 36492 | 4096 | 1411215 | 1269614 | 1241245184 |
United Kingdom | 36417 | 4096 | 284076 | 304237 | 249861873 |
Canada | 36338 | 4096 | 79421 | 81213 | 69855356 |
India | 36348 | 4096 | 76678 | 128000 | 67442862 |
Nigeria | 36199 | 4096 | 43566 | 73111 | 38319549 |
South Africa | 36323 | 4096 | 42569 | 43805 | 37442472 |
Australia | 36373 | 4096 | 38466 | 46482 | 33833515 |
Philippines | 36599 | 4096 | 36427 | 21966 | 32039887 |
Ireland | 36352 | 4096 | 20796 | 24196 | 18291944 |
Kenya | 36231 | 4096 | 10383 | 20803 | 9132974 |
Pakistan | 36376 | 4096 | 9236 | 16466 | 8124273 |
Ghana | 36395 | 4096 | 8702 | 14811 | 7654670 |
New Zealand | 36361 | 4096 | 6610 | 8397 | 5813959 |
Singapore | 36048 | 4096 | 5608 | 4189 | 4933008 |
Uganda | 36511 | 4096 | 4662 | 15003 | 4100771 |
Jamaica | 36332 | 4096 | 3185 | 3332 | 2801604 |
Zimbabwe | 36237 | 4096 | 1809 | 3387 | 1591827 |
Trinidad and Tobago | 36459 | 4096 | 1725 | 1968 | 1517980 |
Zambia | 36686 | 4096 | 1468 | 2464 | 1291544 |
Namibia | 36553 | 4096 | 1268 | 1752 | 1115587 |
Bahamas | 36223 | 4096 | 1265 | 1110 | 1113202 |
Barbados | 36478 | 4096 | 868 | 766 | 764085 |
Malawi | 36373 | 4096 | 753 | 1944 | 662789 |
Rwanda | 36374 | 4096 | 496 | 946 | 436529 |
Cameroon | 36461 | 4096 | 416 | 785 | 365902 |
Malta | 36405 | 4096 | 398 | 560 | 350352 |
Antigua and Barbuda | 36526 | 4096 | 356 | 347 | 313582 |
Guam | 36494 | 3008 | 351 | 101 | 309229 |
St. Lucia | 36223 | 4096 | 313 | 235 | 275897 |
Eswatini | 36408 | 4096 | 268 | 354 | 236190 |
Mauritius | 36306 | 4096 | 263 | 211 | 231391 |
Bermuda | 36319 | 4096 | 259 | 299 | 227865 |
Isle of Man | 36220 | 1495 | 248 | 50 | 218569 |
Lesotho | 35926 | 4096 | 241 | 491 | 212309 |
Cayman Islands | 36161 | 4096 | 204 | 191 | 180023 |
Gambia | 36296 | 4096 | 204 | 516 | 179764 |
Gibraltar | 36216 | 4096 | 193 | 224 | 170041 |
Sierra Leone | 36278 | 4096 | 183 | 532 | 161814 |
Turks and Caicos Islands | 36277 | 3064 | 179 | 106 | 158077 |
Sudan | 36460 | 4096 | 165 | 177 | 145226 |
St. Vincent and the Grenadines | 36324 | 4096 | 160 | 209 | 140768 |
Belize | 36538 | 4096 | 154 | 211 | 136040 |
Liberia | 36247 | 4096 | 136 | 389 | 120223 |
Grenada | 36573 | 2761 | 134 | 97 | 118559 |
British Virgin Islands | 36276 | 1650 | 126 | 57 | 111011 |
Guyana | 36654 | 4096 | 106 | 193 | 93531 |
St. Kitts and Nevis | 36601 | 3652 | 106 | 125 | 93321 |
United States Virgin Islands | 36592 | 219 | 103 | 7 | 90888 |
Northern Mariana Islands | 36550 | 617 | 100 | 21 | 88606 |
Papua New Guinea | 36038 | 3904 | 89 | 136 | 78435 |
Puerto Rico | 36594 | 3164 | 74 | 113 | 65874 |
Dominica | 36452 | 1140 | 63 | 38 | 55815 |
Sint Maarten | 36311 | 1745 | 57 | 59 | 50880 |
Fiji | 36538 | 1934 | 53 | 65 | 47474 |
Guernsey | 24068 | 1790 | 31 | 62 | 27863 |
Tonga | 25728 | 901 | 31 | 31 | 27690 |
Anguilla | 23965 | 1250 | 30 | 42 | 26826 |
Vanuatu | 13985 | 767 | 17 | 26 | 15601 |
Falkland Islands | 11917 | 412 | 14 | 13 | 13158 |
Micronesia, Fed. Sts. | 7497 | 266 | 10 | 9 | 9306 |
Cook Islands | 8053 | 274 | 10 | 9 | 8961 |
Solomon Islands | 8141 | 458 | 10 | 15 | 9479 |
Palau | 6556 | 691 | 8 | 23 | 7491 |
St. Helena | 2876 | 974 | 5 | 32 | 4458 |
Sum | 2097128 | 204072 | 2097120 | 2097123 | 1844565933 |
Country | train | test | train (orig. dist.) | test (orig. dist.) | Corpus |
---|---|---|---|---|---|
Argentina | 104466 | 4096 | 536137 | 415844 | 156910687 |
Spain | 103924 | 4096 | 401933 | 421172 | 117633385 |
Mexico | 104318 | 4096 | 353432 | 388367 | 103438764 |
Colombia | 104267 | 4096 | 204831 | 261796 | 59947766 |
Chile | 104027 | 4096 | 156319 | 143770 | 45749886 |
Venezuela | 104496 | 4096 | 109073 | 88022 | 31922346 |
Uruguay | 103733 | 4096 | 66209 | 60004 | 19377563 |
Ecuador | 104408 | 4096 | 53037 | 68303 | 15522286 |
Peru | 103907 | 4096 | 52144 | 59480 | 15261118 |
Paraguay | 104617 | 4096 | 33486 | 37244 | 9800404 |
Dominican Republic | 104000 | 4096 | 30142 | 36468 | 8821881 |
Panama | 104014 | 4096 | 25525 | 27081 | 7470575 |
Costa Rica | 104415 | 4096 | 19730 | 16252 | 5774617 |
Guatemala | 103800 | 4096 | 17401 | 22567 | 5092733 |
El Salvador | 104111 | 4096 | 10990 | 12949 | 3216498 |
Honduras | 104020 | 4096 | 8660 | 14988 | 2534698 |
Nicaragua | 104438 | 4096 | 8435 | 6951 | 2468938 |
Bolivia | 103537 | 4096 | 4913 | 6523 | 1438141 |
Cuba | 104570 | 4096 | 3359 | 8783 | 983104 |
Puerto Rico | 103994 | 1487 | 1320 | 149 | 386595 |
Equatorial Guinea | 14090 | 4096 | 64 | 429 | 18783 |
Sum | 2097152 | 83407 | 2097140 | 2097142 | 613770768 |
Country | train | test | train (orig. dist.) | test (orig. dist.) | Corpus |
---|---|---|---|---|---|
France | 215940 | 4096 | 1488423 | 1475608 | 9241569 |
Canada | 215745 | 4096 | 122173 | 134958 | 758570 |
DR Congo | 214853 | 4096 | 93877 | 100146 | 582879 |
Cameroon | 215520 | 4096 | 92835 | 97799 | 576414 |
Belgium | 214731 | 4096 | 65710 | 72143 | 407994 |
Senegal | 216403 | 4096 | 53106 | 47977 | 329734 |
Cote d’Ivoire | 215555 | 4096 | 52423 | 44793 | 325496 |
Switzerland | 116815 | 4096 | 24217 | 19484 | 150366 |
Guinea | 90619 | 4096 | 19163 | 16753 | 118984 |
Benin | 63347 | 4096 | 14310 | 14802 | 88856 |
Mali | 54495 | 4096 | 12232 | 12809 | 75952 |
Togo | 44077 | 4096 | 9698 | 10088 | 60220 |
Burkina Faso | 28514 | 4096 | 6957 | 7326 | 43197 |
Gabon | 27051 | 4096 | 5986 | 6156 | 37173 |
Haiti | 25939 | 4096 | 5909 | 5850 | 36694 |
Niger | 27468 | 4096 | 5900 | 5742 | 36638 |
Congo Republic | 26441 | 4096 | 5584 | 5094 | 34676 |
Chad | 15160 | 4096 | 3452 | 4008 | 21439 |
Monaco | 13037 | 4096 | 2901 | 3029 | 18014 |
Luxembourg | 11358 | 4096 | 2820 | 3506 | 17511 |
Central African Republic | 13445 | 2122 | 2551 | 1453 | 15840 |
New Caledonia | 7150 | 1715 | 1486 | 1222 | 9230 |
French Polynesia | 6408 | 2304 | 1459 | 1610 | 9065 |
Djibouti | 6331 | 2237 | 1429 | 1577 | 8873 |
Comoros | 6025 | 1736 | 1273 | 1197 | 7908 |
Rwanda | 4695 | 2749 | 1263 | 2010 | 7848 |
Sum | 2097122 | 94783 | 2097137 | 2097140 | 13021140 |
Country | train | test | train (orig. dist.) | test (orig. dist.) | Corpus |
---|---|---|---|---|---|
Netherlands | 102887 | 4096 | 102887 | 947365 | 1092054 |
Belgium | 16880 | 4096 | 15640 | 144451 | 166015 |
Sum | 119767 | 8192 | 118527 | 1091816 | 1258069 |
Country | train | test | train (orig. dist.) | test (orig. dist.) | Corpus |
---|---|---|---|---|---|
Brazil | 978254 | 4096 | 2049142 | 2046379 | 79748620 |
Portugal | 979061 | 4096 | 43008 | 45744 | 1673795 |
Mozambique | 78571 | 4096 | 2656 | 2366 | 103393 |
Angola | 53243 | 4096 | 2071 | 2400 | 80610 |
Cabo Verde | 8022 | 2434 | 273 | 260 | 10627 |
Sum | 2097151 | 18818 | 2097150 | 2097149 | 81617045 |
Country | train | test | train (orig. dist.) | test (orig. dist.) | Corpus |
---|---|---|---|---|---|
Russia | 962191 | 4096 | 1887634 | 311120 | 11487343 |
Belarus | 652818 | 4096 | 114108 | 25498 | 694419 |
Kazakhstan | 336189 | 4096 | 67909 | 35634 | 413267 |
Kyrgyz Republic | 145942 | 4096 | 27499 | 18562 | 167350 |
Sum | 2097140 | 16384 | 2097150 | 390814 | 12762379 |
Country | train | test | train (orig. dist.) | test (orig. dist.) | Corpus |
---|---|---|---|---|---|
Türkiye | 550662 | 4096 | 550664 | 184575 | 756902 |
Cyprus | 3085 | 924 | 2936 | 924 | 4036 |
Sum | 553747 | 5020 | 553600 | 185499 | 760938 |
Country | train | test | train (orig. dist.) | test (orig. dist.) | Corpus |
---|---|---|---|---|---|
China | 206312 | 4096 | 206313 | 175991 | 510998 |
Taiwan | 115165 | 4096 | 100741 | 110266 | 249518 |
Hong Kong | 13944 | 4096 | 10315 | 9315 | 25549 |
Singapore | 4452 | 3963 | 4130 | 4623 | 10230 |
Sum | 339873 | 16251 | 321499 | 300195 | 796295 |
The dataset used to create the self-supervised problems is a collection of Tweets collected from the open stream for several years, i.e., the Spanish collection started on December 11, 2015; English on July 1, 2016; Arabic on January 25, 2017; Russian on October 16, 2018; and the rest of the languages on June 1, 2021. In all the cases, the last day collected was June 9, 2023. The collected Tweets were filtered with the following restrictions: retweets were removed; URLs and usernames were replaced by the tokens _url and _usr, respectively; and only tweets with at least 50 characters were included in the final collection. The column Corpus in Table 1 and Figure 1 show the number of tweets collected for the Arabic-speaking countries. The figure shows that there are days when more tweets are collected, and there is a tendency to collect fewer tweets in 2023 due to changes in the Twitter API. The data corresponding to German, English, Spanish, French, Dutch, Portuguese, Russian, Turkish, and Chinese are shown in Table 2, Table 3, Table 4, Table 5, Table 6, Table 7, Table 8, Table 9, and Table 10; and Figure 2, Figure 3, Figure 4, Figure 5, Figure 6, Figure 7, Figure 8, Figure 9, and Figure 10.
The corpora are used to create two pairs of training and test sets. The training sets are drawn from tweets published before October 1, 2022, and the test sets are taken from tweets published on or after October 3, 2022. The procedure for creating the set pairs consists of two stages. In the first stage, the tweets were organized by country and then selected to form a uniform distribution by day. Within each day, near duplicates were removed. Then, a three-day sliding window was used to remove near duplicates within the window. The final step was to shuffle the data to remove the ordering by date, respecting the limit between the training and test sets.
The tweets of the first pair were selected to follow a uniform distribution by country as closely as possible. In this pair, the size of the training set is roughly 2 million tweets, whereas the test set size is \(2^{12}\) (4,096) tweets per country. We also produce a smaller training set containing 262 thousand tweets. The procedure is equivalent to the previous one, aiming to have a uniform distribution of the countries. The column identified with the legend train in Table 1, Table 2, Table 3, Table 4, Table 5, Table 6, Table 7, Table 8, Table 9, and Table 10 shows the size of the training set, and the column test indicates the size of the test set in the first pair of training and test sets. It is worth mentioning that we did not have enough information for all the countries and languages to follow an exactly uniform distribution. For example, Table 4 (Spanish) notes that for Puerto Rico (pr), 1,487 tweets in the test set correspond to the total number of available tweets that meet the imposed restrictions.
The second pair of tweets was selected to follow the original distribution of the corpus; in this case, the training and test set has a maximum size of 2 million tweets. The process of selecting the tweets was set as a convex optimization problem where the objective is to maximize the number of tweets subject to a maximum of 2 million (\(2^{21}\)), and the availability of tweets for each country, and the distribution is given by all the tweets available. The column identified with the legend train (orig. dist.) in Table 1, Table 2, Table 3, Table 4, Table 5, Table 6, Table 7, Table 8, Table 9, and Table 10 shows the size of the training set, and the column test (orig. dist.) indicates the size of the test set in the second pair of training and test sets.
DialectId
is a text classifier based on a Bag of Words (BoW) representation with a linear Support Vector Machine (SVM) as the classifier.
The normalization procedure used in the BoW corresponds to setting all characters to lowercase, removing diacritics, and replacing usernames and URLs with the tags “_usr” and “_url”, respectively.
The BoW representation weights the tokens with the term frequency and inverse document frequency (TF-IDF). The tokens correspond to words, bi-grams of words, and q-grams of characters (with q=4, 3, 2). The tokens and weights were estimated using each language’s training dataset (2 million tweets). The tokens (vocabulary) with higher frequency in the training set were kept. We developed systems for different vocabulary sizes, i.e., \(2^{17}\), \(2^{18}\), and \(2^{19}.\)
The BoW can be used by importing the BoW
class, as seen in the following example, where the good morning text is transformed into a vector space. The first line imports the class, the second line instantiates the class, where the parameter token_max_filter
indicates the vocabulary size, and the third line converts the text into a vector space.
from dialectid import BoW
= BoW(lang='en', token_max_filter=2**18)
bow 'good morning']) bow.transform([
<Compressed Sparse Row sparse matrix of dtype 'float32'
with 36 stored elements and shape (1, 262144)>
Each text in the training set is represented in the vector space, and the associated country is retained for use in a linear SVM using the one-vs-all strategy. The approach creates as many binary classification problems as there are different classes. In the binary problems, each class corresponds to the positive class exactly once, and it is the negative class in the remaining cases. Traditionally, one uses all the information in the approach, which is the case for the reduced training set (262 thousand tweets). However, the negative examples were limited to the maximum number of positive classes or \(2^{14}\) tweets in the whole training set. In both cases, the examples are weighted inversely proportional to class frequencies to treat an imbalanced dataset.
Complementing the previous example, the following code instantiates the DialectId
in Spanish using a vocabulary size of \(2^{18}\) indicated by the parameters lang
and token_max_filter
, respectively.
from dialectid import DialectId
= DialectId(lang='es', token_max_filter=2**18)
detect 'comiendo unos tacos']) detect.predict([
array(['mx'], dtype='<U2')
A drawback of using SVM is that it does not estimate the classification probability. For some applications, it is more amenable to calculate the probability instead of the decision-function value. Thus, the developed systems are calibrated to estimate the probability by training a logistic regression using the SVM’s decision function as inputs. The calibration procedure involves predicting the SVM’s decision function on the reduced training set using stratified k-fold cross-validation (k = 3). The decision functions predicted are the inputs of the logistic regression, and the classes are the ones in the reduced training set; the parameters that weight each example inversely proportional to class frequencies are used in this case. To invoke the model using probability, the parameter probability
must be set to true, as shown in the following example.
from dialectid import DialectId
= DialectId(lang='es', probability=True)
detect 'comiendo unos tacos']) detect.predict_proba([
array([[1.9695617e-06, 4.8897579e-07, 1.8095196e-05, 3.6204598e-05,
1.5155481e-03, 1.1557303e-05, 5.4983921e-06, 2.8960928e-06,
2.2128135e-05, 5.0534654e-05, 1.7656431e-01, 2.0665383e-02,
6.3909459e-01, 1.3323617e-01, 1.0678391e-04, 7.2897665e-06,
2.6716828e-02, 6.4306505e-08, 1.9328916e-03, 8.0165564e-06,
2.7745039e-06]], dtype=float32)
As described previously, there are two training sets: one that follows a uniform distribution in the countries as closely as possible, and the second one that follows the distribution seen in the corpus, namely the original distribution (identified as orig. dist.). The parameter uniform_distribution
indicates which training set is used to estimate the parameters. By default, the parameter is set to true to use the training sets with uniform distribution in the countries.
from dialectid import DialectId
= DialectId(lang='es',
detect =False,
uniform_distribution=True)
probability'comiendo unos tacos']) detect.predict_proba([
array([[5.8839246e-06, 4.0277046e-06, 1.5494716e-05, 3.9904633e-05,
1.5577762e-02, 6.6911220e-05, 9.4599171e-05, 2.6315629e-05,
1.4272113e-05, 4.7489698e-06, 5.5686768e-02, 2.2258271e-02,
8.8939697e-01, 1.3783798e-02, 3.9202135e-04, 8.4752110e-06,
4.6369270e-05, 9.1043790e-07, 2.5635022e-03, 5.8783221e-06,
7.1929417e-06]], dtype=float32)
Language | Spanish | English | Arabic | German | French | Dutch | Portuguese | Russian | Turkish | Chinese |
---|---|---|---|---|---|---|---|---|---|---|
DialectId[19] | 0.3936 | 0.2992 | 0.4080 | 0.5418 | 0.3647 | 0.7708 | 0.6053 | 0.4922 | 0.5423 | 0.7012 |
DialectId[18] | 0.3886 | 0.2960 | 0.4036 | 0.5426 | 0.3607 | 0.7734 | 0.6049 | 0.4885 | 0.5450 | 0.7032 |
DialectId[17] | 0.3818 | 0.2917 | 0.3980 | 0.5426 | 0.3556 | 0.7742 | 0.6026 | 0.4852 | 0.5477 | 0.7026 |
DialectId[19] (prob) | 0.3898 | 0.2796 | 0.4050 | 0.5482 | 0.3632 | 0.7708 | 0.6190 | 0.4837 | 0.5423 | 0.7213 |
DialectId[18] (prob) | 0.3854 | 0.2779 | 0.4011 | 0.5548 | 0.3595 | 0.7734 | 0.6170 | 0.4769 | 0.5450 | 0.7213 |
DialectId[17] (prob) | 0.3779 | 0.2743 | 0.3953 | 0.5519 | 0.3545 | 0.7742 | 0.6151 | 0.4793 | 0.5477 | 0.7187 |
DialectId[19] (262k) | 0.3528 | 0.2828 | 0.3803 | 0.4883 | 0.3587 | 0.7701 | 0.6136 | 0.4604 | 0.5500 | 0.6723 |
DialectId[18] (262k) | 0.3476 | 0.2787 | 0.3767 | 0.4945 | 0.3540 | 0.7731 | 0.6091 | 0.4590 | 0.5532 | 0.6732 |
DialectId[17] (262k) | 0.3412 | 0.2726 | 0.3700 | 0.5002 | 0.3476 | 0.7731 | 0.6041 | 0.4598 | 0.5562 | 0.6740 |
StackBoW (262k) | 0.3331 | 0.2441 | 0.3673 | 0.4893 | 0.3339 | 0.7823 | 0.6000 | 0.4468 | 0.5649 | 0.6859 |
DialectId[19] (Orig. Dist.) | 0.3263 | 0.1262 | 0.3207 | 0.5408 | 0.3025 | 0.7667 | 0.4199 | 0.4477 | 0.5401 | 0.6924 |
Language | Spanish | English | Arabic | German | French | Dutch | Portuguese | Russian | Turkish | Chinese |
---|---|---|---|---|---|---|---|---|---|---|
DialectId[19] | 0.3939 | 0.2973 | 0.4041 | 0.5398 | 0.3625 | 0.7661 | 0.5977 | 0.4823 | 0.5421 | 0.7011 |
DialectId[18] | 0.3879 | 0.2922 | 0.3982 | 0.5413 | 0.3590 | 0.7689 | 0.5980 | 0.4797 | 0.5450 | 0.7030 |
DialectId[17] | 0.3808 | 0.2871 | 0.3942 | 0.5427 | 0.3543 | 0.7711 | 0.5979 | 0.4780 | 0.5473 | 0.7028 |
DialectId[19] (prob) | 0.3899 | 0.2744 | 0.4022 | 0.5443 | 0.3612 | 0.7661 | 0.6141 | 0.4731 | 0.5421 | 0.7229 |
DialectId[18] (prob) | 0.3849 | 0.2721 | 0.3972 | 0.5506 | 0.3579 | 0.7689 | 0.6175 | 0.4698 | 0.5450 | 0.7221 |
DialectId[17] (prob) | 0.3769 | 0.2680 | 0.3924 | 0.5496 | 0.3527 | 0.7711 | 0.6116 | 0.4701 | 0.5473 | 0.7190 |
DialectId[19] (Orig. Dist.) | 0.3264 | 0.1237 | 0.3156 | 0.5389 | 0.3017 | 0.7620 | 0.4207 | 0.4436 | 0.5401 | 0.6922 |
The performance of different algorithms is presented in Figure 11 using macro-recall. The best-performing system in almost all cases is DialectId, which is trained on 2 million tweets and has a vocabulary of 500,000 tokens. The exception are Turkish and Dutch, where the best systems is StackBoW trained with only 262k tweets.
The remaining figures provide details on macro-recall by presenting the system’s recall in each country.
Country | United States | Brazil | United Kingdom | Italy | France | Canada | Germany | Portugal |
---|---|---|---|---|---|---|---|---|
Argentina | 0.000 | 0.523 | 0.000 | 0.445 | 0.000 | 0.000 | 0.032 | 0.000 |
Bolivia | 0.000 | 0.640 | 0.132 | 0.227 | 0.000 | 0.000 | 0.001 | 0.000 |
Chile | 0.000 | 0.114 | 0.162 | 0.091 | 0.015 | 0.283 | 0.336 | 0.000 |
Colombia | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 |
Costa Rica | 0.002 | 0.244 | 0.000 | 0.000 | 0.000 | 0.151 | 0.602 | 0.000 |
Cuba | 0.000 | 0.361 | 0.051 | 0.312 | 0.000 | 0.076 | 0.096 | 0.104 |
Dominican Republic | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
Ecuador | 0.010 | 0.112 | 0.000 | 0.098 | 0.000 | 0.518 | 0.070 | 0.192 |
Spain | 0.000 | 0.000 | 0.325 | 0.000 | 0.035 | 0.000 | 0.000 | 0.640 |
Equatorial Guinea | 0.000 | 0.122 | 0.084 | 0.000 | 0.593 | 0.000 | 0.000 | 0.200 |
Guatemala | 0.001 | 0.000 | 0.000 | 0.000 | 0.021 | 0.863 | 0.115 | 0.000 |
Honduras | 0.890 | 0.000 | 0.000 | 0.030 | 0.000 | 0.055 | 0.000 | 0.026 |
Mexico | 0.260 | 0.032 | 0.000 | 0.000 | 0.000 | 0.707 | 0.000 | 0.000 |
Nicaragua | 0.790 | 0.000 | 0.000 | 0.001 | 0.000 | 0.176 | 0.033 | 0.000 |
Panama | 0.854 | 0.000 | 0.000 | 0.028 | 0.000 | 0.032 | 0.069 | 0.017 |
Peru | 0.000 | 0.038 | 0.000 | 0.944 | 0.000 | 0.018 | 0.000 | 0.000 |
Puerto Rico | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
Paraguay | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
El Salvador | 0.665 | 0.000 | 0.000 | 0.000 | 0.000 | 0.335 | 0.000 | 0.000 |
Uruguay | 0.000 | 0.989 | 0.000 | 0.000 | 0.000 | 0.000 | 0.011 | 0.000 |
Venezuela | 0.042 | 0.000 | 0.000 | 0.164 | 0.000 | 0.455 | 0.000 | 0.339 |
Country | Malaysia | Indonesia | Brazil | Germany | Spain | France | Italy | United Arab Emirates |
---|---|---|---|---|---|---|---|---|
Antigua and Barbuda | 0.044 | 0.000 | 0.545 | 0.000 | 0.412 | 0.000 | 0.000 | 0.000 |
Anguilla | 0.000 | 0.001 | 0.000 | 0.076 | 0.000 | 0.657 | 0.037 | 0.230 |
Australia | 0.000 | 0.000 | 0.000 | 0.253 | 0.510 | 0.236 | 0.000 | 0.000 |
Barbados | 0.000 | 0.000 | 0.958 | 0.000 | 0.034 | 0.009 | 0.000 | 0.000 |
Bermuda | 0.000 | 0.000 | 0.028 | 0.410 | 0.070 | 0.374 | 0.119 | 0.000 |
Bahamas | 0.000 | 0.000 | 0.875 | 0.000 | 0.000 | 0.000 | 0.046 | 0.079 |
Belize | 0.000 | 0.000 | 0.239 | 0.000 | 0.695 | 0.063 | 0.003 | 0.000 |
Canada | 0.000 | 0.000 | 0.000 | 0.934 | 0.000 | 0.066 | 0.000 | 0.000 |
Cook Islands | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
Cameroon | 0.000 | 0.000 | 0.091 | 0.000 | 0.000 | 0.907 | 0.000 | 0.002 |
Dominica | 0.876 | 0.000 | 0.107 | 0.018 | 0.000 | 0.000 | 0.000 | 0.000 |
Fiji | 0.117 | 0.563 | 0.034 | 0.000 | 0.000 | 0.000 | 0.286 | 0.000 |
Falkland Islands | 0.000 | 0.000 | 0.000 | 0.284 | 0.323 | 0.394 | 0.000 | 0.000 |
Micronesia, Fed. Sts. | 0.989 | 0.011 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
United Kingdom | 0.000 | 0.000 | 0.000 | 0.046 | 0.887 | 0.067 | 0.000 | 0.000 |
Grenada | 0.966 | 0.034 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
Guernsey | 0.000 | 0.000 | 0.000 | 0.088 | 0.052 | 0.583 | 0.278 | 0.000 |
Ghana | 0.000 | 0.161 | 0.119 | 0.000 | 0.124 | 0.000 | 0.000 | 0.595 |
Gibraltar | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 |
Gambia | 0.726 | 0.139 | 0.022 | 0.000 | 0.000 | 0.000 | 0.000 | 0.114 |
Guam | 0.691 | 0.000 | 0.229 | 0.000 | 0.000 | 0.000 | 0.080 | 0.000 |
Guyana | 0.000 | 0.000 | 0.958 | 0.000 | 0.000 | 0.016 | 0.026 | 0.000 |
Ireland | 0.000 | 0.000 | 0.000 | 0.000 | 0.825 | 0.175 | 0.000 | 0.000 |
Isle of Man | 0.000 | 0.000 | 0.000 | 0.066 | 0.860 | 0.074 | 0.000 | 0.000 |
India | 0.000 | 0.032 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.968 |
Jamaica | 0.009 | 0.112 | 0.047 | 0.832 | 0.000 | 0.000 | 0.000 | 0.000 |
Kenya | 0.000 | 0.109 | 0.007 | 0.000 | 0.000 | 0.000 | 0.000 | 0.884 |
St. Kitts and Nevis | 0.000 | 0.000 | 0.673 | 0.000 | 0.022 | 0.032 | 0.139 | 0.135 |
Cayman Islands | 0.000 | 0.000 | 0.063 | 0.200 | 0.000 | 0.020 | 0.716 | 0.000 |
St. Lucia | 0.054 | 0.000 | 0.166 | 0.000 | 0.780 | 0.000 | 0.000 | 0.000 |
Liberia | 0.003 | 0.017 | 0.756 | 0.000 | 0.042 | 0.000 | 0.132 | 0.050 |
Lesotho | 0.902 | 0.000 | 0.098 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
Northern Mariana Islands | 0.757 | 0.000 | 0.213 | 0.000 | 0.000 | 0.000 | 0.030 | 0.000 |
Malta | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 |
Mauritius | 0.000 | 0.000 | 0.291 | 0.127 | 0.029 | 0.411 | 0.000 | 0.142 |
Malawi | 0.405 | 0.000 | 0.429 | 0.058 | 0.108 | 0.000 | 0.000 | 0.000 |
Namibia | 0.999 | 0.000 | 0.001 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
Nigeria | 0.000 | 0.446 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.554 |
New Zealand | 0.000 | 0.000 | 0.067 | 0.637 | 0.072 | 0.224 | 0.000 | 0.000 |
Papua New Guinea | 0.694 | 0.243 | 0.000 | 0.063 | 0.000 | 0.000 | 0.000 | 0.000 |
Philippines | 0.000 | 0.753 | 0.052 | 0.195 | 0.000 | 0.000 | 0.000 | 0.000 |
Pakistan | 0.000 | 0.001 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.999 |
Puerto Rico | 0.000 | 0.000 | 0.788 | 0.000 | 0.212 | 0.000 | 0.000 | 0.000 |
Palau | 0.760 | 0.021 | 0.000 | 0.152 | 0.000 | 0.000 | 0.030 | 0.037 |
Rwanda | 0.000 | 0.009 | 0.123 | 0.299 | 0.094 | 0.178 | 0.173 | 0.123 |
Solomon Islands | 0.048 | 0.519 | 0.015 | 0.000 | 0.308 | 0.000 | 0.109 | 0.000 |
Sudan | 0.000 | 0.000 | 0.031 | 0.000 | 0.000 | 0.000 | 0.000 | 0.969 |
Singapore | 0.898 | 0.102 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
St. Helena | 0.000 | 0.000 | 0.095 | 0.093 | 0.164 | 0.205 | 0.443 | 0.000 |
Sierra Leone | 0.062 | 0.094 | 0.291 | 0.176 | 0.000 | 0.000 | 0.015 | 0.362 |
Sint Maarten | 0.000 | 0.000 | 0.137 | 0.000 | 0.000 | 0.863 | 0.000 | 0.000 |
Eswatini | 0.794 | 0.031 | 0.150 | 0.000 | 0.000 | 0.000 | 0.025 | 0.000 |
Turks and Caicos Islands | 0.000 | 0.022 | 0.926 | 0.000 | 0.000 | 0.000 | 0.005 | 0.047 |
Tonga | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 |
Trinidad and Tobago | 0.013 | 0.000 | 0.903 | 0.000 | 0.084 | 0.000 | 0.000 | 0.000 |
Uganda | 0.000 | 0.119 | 0.000 | 0.056 | 0.021 | 0.000 | 0.079 | 0.725 |
United States | 0.000 | 0.000 | 0.088 | 0.912 | 0.000 | 0.000 | 0.000 | 0.000 |
St. Vincent and the Grenadines | 0.932 | 0.000 | 0.068 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
British Virgin Islands | 0.027 | 0.000 | 0.745 | 0.000 | 0.000 | 0.007 | 0.202 | 0.018 |
United States Virgin Islands | 0.000 | 0.000 | 0.000 | 0.360 | 0.000 | 0.609 | 0.032 | 0.000 |
Vanuatu | 0.000 | 0.393 | 0.390 | 0.000 | 0.002 | 0.003 | 0.211 | 0.000 |
South Africa | 0.000 | 0.079 | 0.459 | 0.383 | 0.000 | 0.079 | 0.000 | 0.000 |
Zambia | 0.831 | 0.000 | 0.000 | 0.055 | 0.000 | 0.000 | 0.000 | 0.114 |
Zimbabwe | 0.000 | 0.187 | 0.359 | 0.199 | 0.105 | 0.064 | 0.018 | 0.069 |
Country | United States | United Kingdom | Türkiye | Germany | France | Canada | Australia | Italy |
---|---|---|---|---|---|---|---|---|
United Arab Emirates | 0.057 | 0.617 | 0.027 | 0.080 | 0.000 | 0.040 | 0.152 | 0.028 |
Bahrain | 0.017 | 0.806 | 0.083 | 0.022 | 0.022 | 0.000 | 0.048 | 0.000 |
Djibouti | 0.254 | 0.060 | 0.002 | 0.055 | 0.150 | 0.125 | 0.107 | 0.249 |
Algeria | 0.000 | 0.000 | 0.000 | 0.028 | 0.940 | 0.033 | 0.000 | 0.000 |
Egypt | 0.000 | 0.000 | 0.000 | 0.025 | 0.000 | 0.000 | 0.000 | 0.975 |
Iraq | 0.002 | 0.000 | 0.174 | 0.485 | 0.000 | 0.022 | 0.317 | 0.000 |
Jordan | 0.016 | 0.000 | 0.486 | 0.357 | 0.000 | 0.029 | 0.075 | 0.036 |
Kuwait | 0.004 | 0.885 | 0.061 | 0.000 | 0.009 | 0.000 | 0.042 | 0.000 |
Lebanon | 0.000 | 0.000 | 0.000 | 0.010 | 0.105 | 0.578 | 0.307 | 0.000 |
Libya | 0.000 | 0.000 | 0.319 | 0.151 | 0.009 | 0.144 | 0.000 | 0.377 |
Morocco | 0.041 | 0.000 | 0.018 | 0.092 | 0.749 | 0.101 | 0.000 | 0.000 |
Mauritania | 0.011 | 0.000 | 0.207 | 0.183 | 0.348 | 0.139 | 0.048 | 0.063 |
Oman | 0.000 | 0.393 | 0.000 | 0.000 | 0.000 | 0.000 | 0.607 | 0.000 |
Qatar | 0.065 | 0.829 | 0.040 | 0.053 | 0.000 | 0.000 | 0.000 | 0.013 |
Saudi Arabia | 0.150 | 0.229 | 0.012 | 0.000 | 0.000 | 0.092 | 0.518 | 0.000 |
Sudan | 0.000 | 0.000 | 0.000 | 0.061 | 0.531 | 0.180 | 0.000 | 0.228 |
Somalia | 0.057 | 0.000 | 0.454 | 0.000 | 0.066 | 0.374 | 0.047 | 0.001 |
Syria | 0.000 | 0.000 | 0.599 | 0.401 | 0.000 | 0.000 | 0.000 | 0.000 |
Chad | 0.026 | 0.038 | 0.667 | 0.049 | 0.051 | 0.065 | 0.000 | 0.105 |
Tunisia | 0.000 | 0.000 | 0.018 | 0.040 | 0.591 | 0.160 | 0.027 | 0.165 |
Yemen | 0.654 | 0.078 | 0.046 | 0.118 | 0.000 | 0.042 | 0.044 | 0.016 |
Country | United States | Morocco | Spain | Guadeloupe | United Kingdom | Italy | Algeria | Tanzania |
---|---|---|---|---|---|---|---|---|
Belgium | 0.000 | 0.608 | 0.269 | 0.000 | 0.000 | 0.124 | 0.000 | 0.000 |
Burkina Faso | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
Benin | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
Canada | 0.095 | 0.000 | 0.000 | 0.000 | 0.546 | 0.000 | 0.000 | 0.359 |
DR Congo | 0.000 | 0.000 | 0.000 | 0.000 | 0.862 | 0.138 | 0.000 | 0.000 |
Central African Republic | 0.000 | 0.140 | 0.000 | 0.539 | 0.046 | 0.016 | 0.260 | 0.000 |
Congo Republic | 0.000 | 0.027 | 0.000 | 0.973 | 0.000 | 0.000 | 0.000 | 0.000 |
Switzerland | 0.746 | 0.254 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
Cote d’Ivoire | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
Cameroon | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 |
Djibouti | 0.000 | 0.037 | 0.000 | 0.000 | 0.000 | 0.000 | 0.963 | 0.000 |
France | 0.000 | 0.000 | 0.470 | 0.000 | 0.000 | 0.530 | 0.000 | 0.000 |
Gabon | 0.000 | 0.000 | 0.000 | 0.705 | 0.000 | 0.000 | 0.295 | 0.000 |
Guinea | 0.000 | 0.383 | 0.000 | 0.000 | 0.000 | 0.000 | 0.617 | 0.000 |
Haiti | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
Comoros | 0.000 | 0.343 | 0.000 | 0.083 | 0.000 | 0.000 | 0.574 | 0.000 |
Luxembourg | 0.002 | 0.104 | 0.025 | 0.309 | 0.168 | 0.392 | 0.000 | 0.000 |
Monaco | 0.000 | 0.025 | 0.000 | 0.000 | 0.000 | 0.975 | 0.000 | 0.000 |
Mali | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 |
New Caledonia | 0.000 | 0.000 | 0.064 | 0.859 | 0.000 | 0.000 | 0.078 | 0.000 |
Niger | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 |
French Polynesia | 0.000 | 0.000 | 0.585 | 0.415 | 0.000 | 0.000 | 0.000 | 0.000 |
Rwanda | 0.000 | 0.000 | 0.187 | 0.000 | 0.620 | 0.193 | 0.000 | 0.000 |
Senegal | 0.072 | 0.777 | 0.000 | 0.000 | 0.000 | 0.000 | 0.150 | 0.000 |
Chad | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 |
Togo | 0.000 | 0.000 | 0.000 | 0.276 | 0.035 | 0.000 | 0.170 | 0.519 |