Introduction

DialectId aims to develop a set of algorithms that detect the dialect of a given text. For example, given a text in Spanish, DialectId predicts the Spanish-speaking country from which the text comes.

DialectId is available for Arabic (ar), German (de), English (en), Spanish (es), French (fr), Dutch (nl), Portuguese (pt), Russian (ru), Turkish (tr), and Chinese (zh).

Installing using conda

DialectId can be install using the conda package manager with the following instruction.

conda install --channel conda-forge dialectid
Installing using pip

A more general approach to installing DialectId is through the use of the command pip, as illustrated in the following instruction.

pip install dialectid
Dialect Identification

DialectId can be used to predict the dialect of a list of texts using the method predict as seen in the following lines. The first line imports the DialectId class, the second instantiates the class in the Spanish language, and finally, the third line predicts two utterances. The first corresponds to an expression that would be common in Mexico, and the second is an expression that could be associated with Argentina, Uruguay, Chile, and other South American countries.

from dialectid import DialectId
detect = DialectId(lang='es')
detect.predict(['comiendo unos tacos',
                'acompañando el asado con un buen vino'])
array(['mx', 'uy'], dtype='<U2')
Countries

The available dialects for each language can be identified in the attribute countries, as seen in the following snippet for Spanish.

from dialectid import DialectId
detect = DialectId(lang='es')
detect.countries
array(['ar', 'bo', 'cl', 'co', 'cr', 'cu', 'do', 'ec', 'es', 'gq', 'gt',
       'hn', 'mx', 'ni', 'pa', 'pe', 'pr', 'py', 'sv', 'uy', 've'],
      dtype='<U2')
Decision Function

One might be interested in all the countries from which the speaker could come. To facilitate this, one can use the decision_function method. DialectId uses linear Support Vector Machines (SVM) as classifiers; consequently, the positive values in the decision_function are interpreted as belonging to the positive class, i.e., a particular country. The following code exemplifies this idea: the first two lines import and instantiate the DialectId class in Spanish. The third line computes the decision-function values; it returns a two-dimensional array where the first dimension corresponds to the number of texts. In this case, it keeps only the decision-function values, where the positive values indicate the presence of the particular country. The fourth line sorts the values where the highest value is the first element. The fifth line retrieves the country and its associated decision-function values, considering only those countries with positive values.

from dialectid import DialectId
detect = DialectId(lang='es')
df = detect.decision_function(['acompañando el asado con un buen vino'])[0]
index = df.argsort()[::-1]
[(detect.countries[i], df[i]) for i in index
 if df[i] > 0]
[(np.str_('uy'), np.float32(1.5416805)),
 (np.str_('py'), np.float32(1.3321806)),
 (np.str_('ar'), np.float32(1.2182581))]
Probability

In some situations, one is interested in the probability instead of the decision-function values of a linear SVM. The probability can be computed using the predict_proba method. The following code exemplifies this idea: the first line imports the DialectId class as in previous examples. The second line differs from the last example in that the parameter probability is set to true. The rest of the lines are almost equivalent to the previous example.

from dialectid import DialectId
detect = DialectId(lang='es', probability=True)
prob = detect.predict_proba(['acompañando el asado con un buen vino'])[0]
index = prob.argsort()[::-1]
[(detect.countries[i], prob[i])
 for i in index[:4]]
[(np.str_('uy'), np.float32(0.45955184)),
 (np.str_('ar'), np.float32(0.353442)),
 (np.str_('py'), np.float32(0.18695451)),
 (np.str_('cl'), np.float32(2.8124754e-05))]
Figure 1: Number of tweets in the collection for the Arabic-speaking countries.
Table 1: Number of tweets in the training and test sets for the Arabic-speaking countries.
Country train test train (orig. dist.) test (orig. dist.) Corpus
Saudi Arabia 119009 4096 1139707 1101214 61578662
Egypt 119122 4096 271439 287583 14665935
Kuwait 119117 4096 188944 187432 10208696
United Arab Emirates 119600 4096 115345 105957 6232153
Oman 119771 4096 70484 70730 3808309
Iraq 119655 4096 50912 63215 2750834
Qatar 119362 4096 48860 46962 2639967
Bahrain 119666 4096 45196 38131 2441971
Lebanon 119370 4096 35812 30455 1934983
Jordan 119718 4096 34619 33242 1870514
Libya 119659 4096 31495 29417 1701716
Yemen 119823 4096 16917 33165 914053
Algeria 119143 4096 16609 18617 897394
Morocco 119556 4096 9600 16093 518739
Sudan 120078 4096 7662 16291 413993
Tunisia 119244 4096 6405 7435 346082
Syria 119159 4093 5768 9596 311660
Mauritania 41017 1809 844 760 45624
Somalia 17410 561 355 234 19215
Chad 4797 706 105 295 5706
Djibouti 2873 309 63 152 3420
Sum 2097149 73014 2097141 2096976 113309626
Figure 2: Number of tweets in the collection for the German-speaking countries.
Table 2: Number of tweets in the training and test sets for the German-speaking countries.
Country train test train (orig. dist.) test (orig. dist.) Corpus
Germany 83023 4096 80620 1110160 1262931
Austria 7004 4096 7004 100180 109718
Switzerland 4573 4096 4547 64578 71231
Sum 94600 12288 92171 1274918 1443880
Figure 3: Number of tweets in the collection for the English-speaking countries.
Table 3: Number of tweets in the training and test sets for the English-speaking countries.
Country train test train (orig. dist.) test (orig. dist.) Corpus
United States 36492 4096 1411215 1269614 1241245184
United Kingdom 36417 4096 284076 304237 249861873
Canada 36338 4096 79421 81213 69855356
India 36348 4096 76678 128000 67442862
Nigeria 36199 4096 43566 73111 38319549
South Africa 36323 4096 42569 43805 37442472
Australia 36373 4096 38466 46482 33833515
Philippines 36599 4096 36427 21966 32039887
Ireland 36352 4096 20796 24196 18291944
Kenya 36231 4096 10383 20803 9132974
Pakistan 36376 4096 9236 16466 8124273
Ghana 36395 4096 8702 14811 7654670
New Zealand 36361 4096 6610 8397 5813959
Singapore 36048 4096 5608 4189 4933008
Uganda 36511 4096 4662 15003 4100771
Jamaica 36332 4096 3185 3332 2801604
Zimbabwe 36237 4096 1809 3387 1591827
Trinidad and Tobago 36459 4096 1725 1968 1517980
Zambia 36686 4096 1468 2464 1291544
Namibia 36553 4096 1268 1752 1115587
Bahamas 36223 4096 1265 1110 1113202
Barbados 36478 4096 868 766 764085
Malawi 36373 4096 753 1944 662789
Rwanda 36374 4096 496 946 436529
Cameroon 36461 4096 416 785 365902
Malta 36405 4096 398 560 350352
Antigua and Barbuda 36526 4096 356 347 313582
Guam 36494 3008 351 101 309229
St. Lucia 36223 4096 313 235 275897
Eswatini 36408 4096 268 354 236190
Mauritius 36306 4096 263 211 231391
Bermuda 36319 4096 259 299 227865
Isle of Man 36220 1495 248 50 218569
Lesotho 35926 4096 241 491 212309
Cayman Islands 36161 4096 204 191 180023
Gambia 36296 4096 204 516 179764
Gibraltar 36216 4096 193 224 170041
Sierra Leone 36278 4096 183 532 161814
Turks and Caicos Islands 36277 3064 179 106 158077
Sudan 36460 4096 165 177 145226
St. Vincent and the Grenadines 36324 4096 160 209 140768
Belize 36538 4096 154 211 136040
Liberia 36247 4096 136 389 120223
Grenada 36573 2761 134 97 118559
British Virgin Islands 36276 1650 126 57 111011
Guyana 36654 4096 106 193 93531
St. Kitts and Nevis 36601 3652 106 125 93321
United States Virgin Islands 36592 219 103 7 90888
Northern Mariana Islands 36550 617 100 21 88606
Papua New Guinea 36038 3904 89 136 78435
Puerto Rico 36594 3164 74 113 65874
Dominica 36452 1140 63 38 55815
Sint Maarten 36311 1745 57 59 50880
Fiji 36538 1934 53 65 47474
Guernsey 24068 1790 31 62 27863
Tonga 25728 901 31 31 27690
Anguilla 23965 1250 30 42 26826
Vanuatu 13985 767 17 26 15601
Falkland Islands 11917 412 14 13 13158
Micronesia, Fed. Sts. 7497 266 10 9 9306
Cook Islands 8053 274 10 9 8961
Solomon Islands 8141 458 10 15 9479
Palau 6556 691 8 23 7491
St. Helena 2876 974 5 32 4458
Sum 2097128 204072 2097120 2097123 1844565933
Figure 4: Number of tweets in the collection for the Spanish-speaking countries.
Table 4: Number of tweets in the training and test sets for the Spanish-speaking countries.
Country train test train (orig. dist.) test (orig. dist.) Corpus
Argentina 104466 4096 536137 415844 156910687
Spain 103924 4096 401933 421172 117633385
Mexico 104318 4096 353432 388367 103438764
Colombia 104267 4096 204831 261796 59947766
Chile 104027 4096 156319 143770 45749886
Venezuela 104496 4096 109073 88022 31922346
Uruguay 103733 4096 66209 60004 19377563
Ecuador 104408 4096 53037 68303 15522286
Peru 103907 4096 52144 59480 15261118
Paraguay 104617 4096 33486 37244 9800404
Dominican Republic 104000 4096 30142 36468 8821881
Panama 104014 4096 25525 27081 7470575
Costa Rica 104415 4096 19730 16252 5774617
Guatemala 103800 4096 17401 22567 5092733
El Salvador 104111 4096 10990 12949 3216498
Honduras 104020 4096 8660 14988 2534698
Nicaragua 104438 4096 8435 6951 2468938
Bolivia 103537 4096 4913 6523 1438141
Cuba 104570 4096 3359 8783 983104
Puerto Rico 103994 1487 1320 149 386595
Equatorial Guinea 14090 4096 64 429 18783
Sum 2097152 83407 2097140 2097142 613770768
Figure 5: Number of tweets in the collection for the French-speaking countries.
Table 5: Number of tweets in the training and test sets for the French-speaking countries.
Country train test train (orig. dist.) test (orig. dist.) Corpus
France 215940 4096 1488423 1475608 9241569
Canada 215745 4096 122173 134958 758570
DR Congo 214853 4096 93877 100146 582879
Cameroon 215520 4096 92835 97799 576414
Belgium 214731 4096 65710 72143 407994
Senegal 216403 4096 53106 47977 329734
Cote d’Ivoire 215555 4096 52423 44793 325496
Switzerland 116815 4096 24217 19484 150366
Guinea 90619 4096 19163 16753 118984
Benin 63347 4096 14310 14802 88856
Mali 54495 4096 12232 12809 75952
Togo 44077 4096 9698 10088 60220
Burkina Faso 28514 4096 6957 7326 43197
Gabon 27051 4096 5986 6156 37173
Haiti 25939 4096 5909 5850 36694
Niger 27468 4096 5900 5742 36638
Congo Republic 26441 4096 5584 5094 34676
Chad 15160 4096 3452 4008 21439
Monaco 13037 4096 2901 3029 18014
Luxembourg 11358 4096 2820 3506 17511
Central African Republic 13445 2122 2551 1453 15840
New Caledonia 7150 1715 1486 1222 9230
French Polynesia 6408 2304 1459 1610 9065
Djibouti 6331 2237 1429 1577 8873
Comoros 6025 1736 1273 1197 7908
Rwanda 4695 2749 1263 2010 7848
Sum 2097122 94783 2097137 2097140 13021140
Figure 6: Number of tweets in the collection for the Dutch-speaking countries.
Table 6: Number of tweets in the training and test sets for the Dutch-speaking countries.
Country train test train (orig. dist.) test (orig. dist.) Corpus
Netherlands 102887 4096 102887 947365 1092054
Belgium 16880 4096 15640 144451 166015
Sum 119767 8192 118527 1091816 1258069
Figure 7: Number of tweets in the collection for the Portuguese-speaking countries.
Table 7: Number of tweets in the training and test sets for the Portuguese-speaking countries.
Country train test train (orig. dist.) test (orig. dist.) Corpus
Brazil 978254 4096 2049142 2046379 79748620
Portugal 979061 4096 43008 45744 1673795
Mozambique 78571 4096 2656 2366 103393
Angola 53243 4096 2071 2400 80610
Cabo Verde 8022 2434 273 260 10627
Sum 2097151 18818 2097150 2097149 81617045
Figure 8: Number of tweets in the collection for the Russian-speaking countries.
Table 8: Number of tweets in the training and test sets for the Russian-speaking countries.
Country train test train (orig. dist.) test (orig. dist.) Corpus
Russia 962191 4096 1887634 311120 11487343
Belarus 652818 4096 114108 25498 694419
Kazakhstan 336189 4096 67909 35634 413267
Kyrgyz Republic 145942 4096 27499 18562 167350
Sum 2097140 16384 2097150 390814 12762379
Figure 9: Number of tweets in the collection for the Turkish-speaking countries.
Table 9: Number of tweets in the training and test sets for the Turkish-speaking countries.
Country train test train (orig. dist.) test (orig. dist.) Corpus
Türkiye 550662 4096 550664 184575 756902
Cyprus 3085 924 2936 924 4036
Sum 553747 5020 553600 185499 760938
Figure 10: Number of tweets in the collection for the Chinese-speaking countries.
Table 10: Number of tweets in the training and test sets for the Chinese-speaking countries.
Country train test train (orig. dist.) test (orig. dist.) Corpus
China 206312 4096 206313 175991 510998
Taiwan 115165 4096 100741 110266 249518
Hong Kong 13944 4096 10315 9315 25549
Singapore 4452 3963 4130 4623 10230
Sum 339873 16251 321499 300195 796295
Description

The dataset used to create the self-supervised problems is a collection of Tweets collected from the open stream for several years, i.e., the Spanish collection started on December 11, 2015; English on July 1, 2016; Arabic on January 25, 2017; Russian on October 16, 2018; and the rest of the languages on June 1, 2021. In all the cases, the last day collected was June 9, 2023. The collected Tweets were filtered with the following restrictions: retweets were removed; URLs and usernames were replaced by the tokens _url and _usr, respectively; and only tweets with at least 50 characters were included in the final collection. The column Corpus in Table 1 and Figure 1 show the number of tweets collected for the Arabic-speaking countries. The figure shows that there are days when more tweets are collected, and there is a tendency to collect fewer tweets in 2023 due to changes in the Twitter API. The data corresponding to German, English, Spanish, French, Dutch, Portuguese, Russian, Turkish, and Chinese are shown in Table 2, Table 3, Table 4, Table 5, Table 6, Table 7, Table 8, Table 9, and Table 10; and Figure 2, Figure 3, Figure 4, Figure 5, Figure 6, Figure 7, Figure 8, Figure 9, and Figure 10.

The corpora are used to create two pairs of training and test sets. The training sets are drawn from tweets published before October 1, 2022, and the test sets are taken from tweets published on or after October 3, 2022. The procedure for creating the set pairs consists of two stages. In the first stage, the tweets were organized by country and then selected to form a uniform distribution by day. Within each day, near duplicates were removed. Then, a three-day sliding window was used to remove near duplicates within the window. The final step was to shuffle the data to remove the ordering by date, respecting the limit between the training and test sets.

The tweets of the first pair were selected to follow a uniform distribution by country as closely as possible. In this pair, the size of the training set is roughly 2 million tweets, whereas the test set size is \(2^{12}\) (4,096) tweets per country. We also produce a smaller training set containing 262 thousand tweets. The procedure is equivalent to the previous one, aiming to have a uniform distribution of the countries. The column identified with the legend train in Table 1, Table 2, Table 3, Table 4, Table 5, Table 6, Table 7, Table 8, Table 9, and Table 10 shows the size of the training set, and the column test indicates the size of the test set in the first pair of training and test sets. It is worth mentioning that we did not have enough information for all the countries and languages to follow an exactly uniform distribution. For example, Table 4 (Spanish) notes that for Puerto Rico (pr), 1,487 tweets in the test set correspond to the total number of available tweets that meet the imposed restrictions.

The second pair of tweets was selected to follow the original distribution of the corpus; in this case, the training and test set has a maximum size of 2 million tweets. The process of selecting the tweets was set as a convex optimization problem where the objective is to maximize the number of tweets subject to a maximum of 2 million (\(2^{21}\)), and the availability of tweets for each country, and the distribution is given by all the tweets available. The column identified with the legend train (orig. dist.) in Table 1, Table 2, Table 3, Table 4, Table 5, Table 6, Table 7, Table 8, Table 9, and Table 10 shows the size of the training set, and the column test (orig. dist.) indicates the size of the test set in the second pair of training and test sets.

DialectId is a text classifier based on a Bag of Words (BoW) representation with a linear Support Vector Machine (SVM) as the classifier.

The normalization procedure used in the BoW corresponds to setting all characters to lowercase, removing diacritics, and replacing usernames and URLs with the tags “_usr” and “_url”, respectively.

The BoW representation weights the tokens with the term frequency and inverse document frequency (TF-IDF). The tokens correspond to words, bi-grams of words, and q-grams of characters (with q=4, 3, 2). The tokens and weights were estimated using each language’s training dataset (2 million tweets). The tokens (vocabulary) with higher frequency in the training set were kept. We developed systems for different vocabulary sizes, i.e., \(2^{17}\), \(2^{18}\), and \(2^{19}.\)

The BoW can be used by importing the BoW class, as seen in the following example, where the good morning text is transformed into a vector space. The first line imports the class, the second line instantiates the class, where the parameter token_max_filter indicates the vocabulary size, and the third line converts the text into a vector space.

from dialectid import BoW
bow = BoW(lang='en', token_max_filter=2**18)
bow.transform(['good morning'])
<Compressed Sparse Row sparse matrix of dtype 'float32'
    with 36 stored elements and shape (1, 262144)>

Each text in the training set is represented in the vector space, and the associated country is retained for use in a linear SVM using the one-vs-all strategy. The approach creates as many binary classification problems as there are different classes. In the binary problems, each class corresponds to the positive class exactly once, and it is the negative class in the remaining cases. Traditionally, one uses all the information in the approach, which is the case for the reduced training set (262 thousand tweets). However, the negative examples were limited to the maximum number of positive classes or \(2^{14}\) tweets in the whole training set. In both cases, the examples are weighted inversely proportional to class frequencies to treat an imbalanced dataset.

Complementing the previous example, the following code instantiates the DialectId in Spanish using a vocabulary size of \(2^{18}\) indicated by the parameters lang and token_max_filter, respectively.

from dialectid import DialectId
detect = DialectId(lang='es', token_max_filter=2**18)
detect.predict(['comiendo unos tacos'])
array(['mx'], dtype='<U2')

A drawback of using SVM is that it does not estimate the classification probability. For some applications, it is more amenable to calculate the probability instead of the decision-function value. Thus, the developed systems are calibrated to estimate the probability by training a logistic regression using the SVM’s decision function as inputs. The calibration procedure involves predicting the SVM’s decision function on the reduced training set using stratified k-fold cross-validation (k = 3). The decision functions predicted are the inputs of the logistic regression, and the classes are the ones in the reduced training set; the parameters that weight each example inversely proportional to class frequencies are used in this case. To invoke the model using probability, the parameter probability must be set to true, as shown in the following example.

from dialectid import DialectId
detect = DialectId(lang='es', probability=True)
detect.predict_proba(['comiendo unos tacos'])
array([[1.9695617e-06, 4.8897579e-07, 1.8095196e-05, 3.6204598e-05,
        1.5155481e-03, 1.1557303e-05, 5.4983921e-06, 2.8960928e-06,
        2.2128135e-05, 5.0534654e-05, 1.7656431e-01, 2.0665383e-02,
        6.3909459e-01, 1.3323617e-01, 1.0678391e-04, 7.2897665e-06,
        2.6716828e-02, 6.4306505e-08, 1.9328916e-03, 8.0165564e-06,
        2.7745039e-06]], dtype=float32)

As described previously, there are two training sets: one that follows a uniform distribution in the countries as closely as possible, and the second one that follows the distribution seen in the corpus, namely the original distribution (identified as orig. dist.). The parameter uniform_distribution indicates which training set is used to estimate the parameters. By default, the parameter is set to true to use the training sets with uniform distribution in the countries.

from dialectid import DialectId
detect = DialectId(lang='es',
                   uniform_distribution=False,
                   probability=True)
detect.predict_proba(['comiendo unos tacos'])
array([[5.8839246e-06, 4.0277046e-06, 1.5494716e-05, 3.9904633e-05,
        1.5577762e-02, 6.6911220e-05, 9.4599171e-05, 2.6315629e-05,
        1.4272113e-05, 4.7489698e-06, 5.5686768e-02, 2.2258271e-02,
        8.8939697e-01, 1.3783798e-02, 3.9202135e-04, 8.4752110e-06,
        4.6369270e-05, 9.1043790e-07, 2.5635022e-03, 5.8783221e-06,
        7.1929417e-06]], dtype=float32)
Table 11: Performance of the different algorithms and languages.
Language Spanish English Arabic German French Dutch Portuguese Russian Turkish Chinese
DialectId[19] 0.3936 0.2992 0.4080 0.5418 0.3647 0.7708 0.6053 0.4922 0.5423 0.7012
DialectId[18] 0.3886 0.2960 0.4036 0.5426 0.3607 0.7734 0.6049 0.4885 0.5450 0.7032
DialectId[17] 0.3818 0.2917 0.3980 0.5426 0.3556 0.7742 0.6026 0.4852 0.5477 0.7026
DialectId[19] (prob) 0.3898 0.2796 0.4050 0.5482 0.3632 0.7708 0.6190 0.4837 0.5423 0.7213
DialectId[18] (prob) 0.3854 0.2779 0.4011 0.5548 0.3595 0.7734 0.6170 0.4769 0.5450 0.7213
DialectId[17] (prob) 0.3779 0.2743 0.3953 0.5519 0.3545 0.7742 0.6151 0.4793 0.5477 0.7187
DialectId[19] (262k) 0.3528 0.2828 0.3803 0.4883 0.3587 0.7701 0.6136 0.4604 0.5500 0.6723
DialectId[18] (262k) 0.3476 0.2787 0.3767 0.4945 0.3540 0.7731 0.6091 0.4590 0.5532 0.6732
DialectId[17] (262k) 0.3412 0.2726 0.3700 0.5002 0.3476 0.7731 0.6041 0.4598 0.5562 0.6740
StackBoW (262k) 0.3331 0.2441 0.3673 0.4893 0.3339 0.7823 0.6000 0.4468 0.5649 0.6859
DialectId[19] (Orig. Dist.) 0.3263 0.1262 0.3207 0.5408 0.3025 0.7667 0.4199 0.4477 0.5401 0.6924
Figure 11: Performance of the different algorithms and languages.
Table 12: Performance of the different algorithms and languages on the original distribution.
Language Spanish English Arabic German French Dutch Portuguese Russian Turkish Chinese
DialectId[19] 0.3939 0.2973 0.4041 0.5398 0.3625 0.7661 0.5977 0.4823 0.5421 0.7011
DialectId[18] 0.3879 0.2922 0.3982 0.5413 0.3590 0.7689 0.5980 0.4797 0.5450 0.7030
DialectId[17] 0.3808 0.2871 0.3942 0.5427 0.3543 0.7711 0.5979 0.4780 0.5473 0.7028
DialectId[19] (prob) 0.3899 0.2744 0.4022 0.5443 0.3612 0.7661 0.6141 0.4731 0.5421 0.7229
DialectId[18] (prob) 0.3849 0.2721 0.3972 0.5506 0.3579 0.7689 0.6175 0.4698 0.5450 0.7221
DialectId[17] (prob) 0.3769 0.2680 0.3924 0.5496 0.3527 0.7711 0.6116 0.4701 0.5473 0.7190
DialectId[19] (Orig. Dist.) 0.3264 0.1237 0.3156 0.5389 0.3017 0.7620 0.4207 0.4436 0.5401 0.6922
Figure 12: Distributions of Arabic-speaking countries.
Figure 13: Distributions of German-speaking countries.
Figure 14: Distributions of English-speaking countries.
Figure 15: Distributions of Spanish-speaking countries.
Figure 16: Distributions of French-speaking countries.
Figure 17: Distributions of Dutch-speaking countries.
Figure 18: Distributions of Portuguese-speaking countries.
Figure 19: Distributions of Russian-speaking countries.
Figure 20: Distributions of Turkish-speaking countries.
Figure 21: Distributions of Chinese-speaking countries.
Performance

The performance of different algorithms is presented in Figure 11 using macro-recall. The best-performing system in almost all cases is DialectId, which is trained on 2 million tweets and has a vocabulary of 500,000 tokens. The exception are Turkish and Dutch, where the best systems is StackBoW trained with only 262k tweets.

The remaining figures provide details on macro-recall by presenting the system’s recall in each country.

Figure 22: Distribution estimated with DialectId on the Tweets having as geographic informacion United States.
Figure 23: Distribution estimated with DialectId on the Tweets having as geographic informacion Brazil.
Figure 24: Distribution estimated with DialectId on the Tweets having as geographic informacion Great Britain.
Figure 25: Distribution estimated with DialectId on the Tweets having as geographic informacion Italy.
Figure 26: Distribution estimated with DialectId on the Tweets having as geographic informacion France.
Figure 27: Distribution estimated with DialectId on the Tweets having as geographic informacion Canada.
Figure 28: Distribution estimated with DialectId on the Tweets having as geographic informacion Germany.
Figure 29: Distribution estimated with DialectId on the Tweets having as geographic informacion Portugal.
Description
Table 13: Probability of the origin of Tweets in different non-Spanish-speaking countries.
Country United States Brazil United Kingdom Italy France Canada Germany Portugal
Argentina 0.000 0.523 0.000 0.445 0.000 0.000 0.032 0.000
Bolivia 0.000 0.640 0.132 0.227 0.000 0.000 0.001 0.000
Chile 0.000 0.114 0.162 0.091 0.015 0.283 0.336 0.000
Colombia 0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.000
Costa Rica 0.002 0.244 0.000 0.000 0.000 0.151 0.602 0.000
Cuba 0.000 0.361 0.051 0.312 0.000 0.076 0.096 0.104
Dominican Republic 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Ecuador 0.010 0.112 0.000 0.098 0.000 0.518 0.070 0.192
Spain 0.000 0.000 0.325 0.000 0.035 0.000 0.000 0.640
Equatorial Guinea 0.000 0.122 0.084 0.000 0.593 0.000 0.000 0.200
Guatemala 0.001 0.000 0.000 0.000 0.021 0.863 0.115 0.000
Honduras 0.890 0.000 0.000 0.030 0.000 0.055 0.000 0.026
Mexico 0.260 0.032 0.000 0.000 0.000 0.707 0.000 0.000
Nicaragua 0.790 0.000 0.000 0.001 0.000 0.176 0.033 0.000
Panama 0.854 0.000 0.000 0.028 0.000 0.032 0.069 0.017
Peru 0.000 0.038 0.000 0.944 0.000 0.018 0.000 0.000
Puerto Rico 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Paraguay 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000
El Salvador 0.665 0.000 0.000 0.000 0.000 0.335 0.000 0.000
Uruguay 0.000 0.989 0.000 0.000 0.000 0.000 0.011 0.000
Venezuela 0.042 0.000 0.000 0.164 0.000 0.455 0.000 0.339
Figure 30: Distribution estimated with DialectId on the Tweets having as geographic informacion Malaysia.
Figure 31: Distribution estimated with DialectId on the Tweets having as geographic informacion Indonesia.
Figure 32: Distribution estimated with DialectId on the Tweets having as geographic informacion Brasil.
Figure 33: Distribution estimated with DialectId on the Tweets having as geographic informacion Germany.
Figure 34: Distribution estimated with DialectId on the Tweets having as geographic informacion Spain.
Figure 35: Distribution estimated with DialectId on the Tweets having as geographic informacion France.
Figure 36: Distribution estimated with DialectId on the Tweets having as geographic informacion Italy.
Figure 37: Distribution estimated with DialectId on the Tweets having as geographic informacion United Arab Emirates.
Description
Table 14: Probability of the origin of Tweets in different non-English-speaking countries.
Country Malaysia Indonesia Brazil Germany Spain France Italy United Arab Emirates
Antigua and Barbuda 0.044 0.000 0.545 0.000 0.412 0.000 0.000 0.000
Anguilla 0.000 0.001 0.000 0.076 0.000 0.657 0.037 0.230
Australia 0.000 0.000 0.000 0.253 0.510 0.236 0.000 0.000
Barbados 0.000 0.000 0.958 0.000 0.034 0.009 0.000 0.000
Bermuda 0.000 0.000 0.028 0.410 0.070 0.374 0.119 0.000
Bahamas 0.000 0.000 0.875 0.000 0.000 0.000 0.046 0.079
Belize 0.000 0.000 0.239 0.000 0.695 0.063 0.003 0.000
Canada 0.000 0.000 0.000 0.934 0.000 0.066 0.000 0.000
Cook Islands 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000
Cameroon 0.000 0.000 0.091 0.000 0.000 0.907 0.000 0.002
Dominica 0.876 0.000 0.107 0.018 0.000 0.000 0.000 0.000
Fiji 0.117 0.563 0.034 0.000 0.000 0.000 0.286 0.000
Falkland Islands 0.000 0.000 0.000 0.284 0.323 0.394 0.000 0.000
Micronesia, Fed. Sts. 0.989 0.011 0.000 0.000 0.000 0.000 0.000 0.000
United Kingdom 0.000 0.000 0.000 0.046 0.887 0.067 0.000 0.000
Grenada 0.966 0.034 0.000 0.000 0.000 0.000 0.000 0.000
Guernsey 0.000 0.000 0.000 0.088 0.052 0.583 0.278 0.000
Ghana 0.000 0.161 0.119 0.000 0.124 0.000 0.000 0.595
Gibraltar 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000
Gambia 0.726 0.139 0.022 0.000 0.000 0.000 0.000 0.114
Guam 0.691 0.000 0.229 0.000 0.000 0.000 0.080 0.000
Guyana 0.000 0.000 0.958 0.000 0.000 0.016 0.026 0.000
Ireland 0.000 0.000 0.000 0.000 0.825 0.175 0.000 0.000
Isle of Man 0.000 0.000 0.000 0.066 0.860 0.074 0.000 0.000
India 0.000 0.032 0.000 0.000 0.000 0.000 0.000 0.968
Jamaica 0.009 0.112 0.047 0.832 0.000 0.000 0.000 0.000
Kenya 0.000 0.109 0.007 0.000 0.000 0.000 0.000 0.884
St. Kitts and Nevis 0.000 0.000 0.673 0.000 0.022 0.032 0.139 0.135
Cayman Islands 0.000 0.000 0.063 0.200 0.000 0.020 0.716 0.000
St. Lucia 0.054 0.000 0.166 0.000 0.780 0.000 0.000 0.000
Liberia 0.003 0.017 0.756 0.000 0.042 0.000 0.132 0.050
Lesotho 0.902 0.000 0.098 0.000 0.000 0.000 0.000 0.000
Northern Mariana Islands 0.757 0.000 0.213 0.000 0.000 0.000 0.030 0.000
Malta 0.000 0.000 0.000 0.000 0.000 0.000 1.000 0.000
Mauritius 0.000 0.000 0.291 0.127 0.029 0.411 0.000 0.142
Malawi 0.405 0.000 0.429 0.058 0.108 0.000 0.000 0.000
Namibia 0.999 0.000 0.001 0.000 0.000 0.000 0.000 0.000
Nigeria 0.000 0.446 0.000 0.000 0.000 0.000 0.000 0.554
New Zealand 0.000 0.000 0.067 0.637 0.072 0.224 0.000 0.000
Papua New Guinea 0.694 0.243 0.000 0.063 0.000 0.000 0.000 0.000
Philippines 0.000 0.753 0.052 0.195 0.000 0.000 0.000 0.000
Pakistan 0.000 0.001 0.000 0.000 0.000 0.000 0.000 0.999
Puerto Rico 0.000 0.000 0.788 0.000 0.212 0.000 0.000 0.000
Palau 0.760 0.021 0.000 0.152 0.000 0.000 0.030 0.037
Rwanda 0.000 0.009 0.123 0.299 0.094 0.178 0.173 0.123
Solomon Islands 0.048 0.519 0.015 0.000 0.308 0.000 0.109 0.000
Sudan 0.000 0.000 0.031 0.000 0.000 0.000 0.000 0.969
Singapore 0.898 0.102 0.000 0.000 0.000 0.000 0.000 0.000
St. Helena 0.000 0.000 0.095 0.093 0.164 0.205 0.443 0.000
Sierra Leone 0.062 0.094 0.291 0.176 0.000 0.000 0.015 0.362
Sint Maarten 0.000 0.000 0.137 0.000 0.000 0.863 0.000 0.000
Eswatini 0.794 0.031 0.150 0.000 0.000 0.000 0.025 0.000
Turks and Caicos Islands 0.000 0.022 0.926 0.000 0.000 0.000 0.005 0.047
Tonga 0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.000
Trinidad and Tobago 0.013 0.000 0.903 0.000 0.084 0.000 0.000 0.000
Uganda 0.000 0.119 0.000 0.056 0.021 0.000 0.079 0.725
United States 0.000 0.000 0.088 0.912 0.000 0.000 0.000 0.000
St. Vincent and the Grenadines 0.932 0.000 0.068 0.000 0.000 0.000 0.000 0.000
British Virgin Islands 0.027 0.000 0.745 0.000 0.000 0.007 0.202 0.018
United States Virgin Islands 0.000 0.000 0.000 0.360 0.000 0.609 0.032 0.000
Vanuatu 0.000 0.393 0.390 0.000 0.002 0.003 0.211 0.000
South Africa 0.000 0.079 0.459 0.383 0.000 0.079 0.000 0.000
Zambia 0.831 0.000 0.000 0.055 0.000 0.000 0.000 0.114
Zimbabwe 0.000 0.187 0.359 0.199 0.105 0.064 0.018 0.069
Figure 38: Distribution estimated with DialectId on the Tweets having as geographic informacion United States.
Figure 39: Distribution estimated with DialectId on the Tweets having as geographic informacion Great Britain.
Figure 40: Distribution estimated with DialectId on the Tweets having as geographic informacion Turkey.
Figure 41: Distribution estimated with DialectId on the Tweets having as geographic informacion Germany.
Figure 42: Distribution estimated with DialectId on the Tweets having as geographic informacion France.
Figure 43: Distribution estimated with DialectId on the Tweets having as geographic informacion Canada.
Figure 44: Distribution estimated with DialectId on the Tweets having as geographic informacion Australia.
Figure 45: Distribution estimated with DialectId on the Tweets having as geographic informacion Italy.
Description
Table 15: Probability of the origin of Tweets in different non-Arabic-speaking countries.
Country United States United Kingdom Türkiye Germany France Canada Australia Italy
United Arab Emirates 0.057 0.617 0.027 0.080 0.000 0.040 0.152 0.028
Bahrain 0.017 0.806 0.083 0.022 0.022 0.000 0.048 0.000
Djibouti 0.254 0.060 0.002 0.055 0.150 0.125 0.107 0.249
Algeria 0.000 0.000 0.000 0.028 0.940 0.033 0.000 0.000
Egypt 0.000 0.000 0.000 0.025 0.000 0.000 0.000 0.975
Iraq 0.002 0.000 0.174 0.485 0.000 0.022 0.317 0.000
Jordan 0.016 0.000 0.486 0.357 0.000 0.029 0.075 0.036
Kuwait 0.004 0.885 0.061 0.000 0.009 0.000 0.042 0.000
Lebanon 0.000 0.000 0.000 0.010 0.105 0.578 0.307 0.000
Libya 0.000 0.000 0.319 0.151 0.009 0.144 0.000 0.377
Morocco 0.041 0.000 0.018 0.092 0.749 0.101 0.000 0.000
Mauritania 0.011 0.000 0.207 0.183 0.348 0.139 0.048 0.063
Oman 0.000 0.393 0.000 0.000 0.000 0.000 0.607 0.000
Qatar 0.065 0.829 0.040 0.053 0.000 0.000 0.000 0.013
Saudi Arabia 0.150 0.229 0.012 0.000 0.000 0.092 0.518 0.000
Sudan 0.000 0.000 0.000 0.061 0.531 0.180 0.000 0.228
Somalia 0.057 0.000 0.454 0.000 0.066 0.374 0.047 0.001
Syria 0.000 0.000 0.599 0.401 0.000 0.000 0.000 0.000
Chad 0.026 0.038 0.667 0.049 0.051 0.065 0.000 0.105
Tunisia 0.000 0.000 0.018 0.040 0.591 0.160 0.027 0.165
Yemen 0.654 0.078 0.046 0.118 0.000 0.042 0.044 0.016
Figure 46: Distribution estimated with DialectId on the Tweets having as geographic informacion United States.
Figure 47: Distribution estimated with DialectId on the Tweets having as geographic informacion Moroco.
Figure 48: Distribution estimated with DialectId on the Tweets having as geographic informacion Spain.
Figure 49: Distribution estimated with DialectId on the Tweets having as geographic informacion Guadeloupe.
Figure 50: Distribution estimated with DialectId on the Tweets having as geographic informacion Great Britain.
Figure 51: Distribution estimated with DialectId on the Tweets having as geographic informacion Italy.
Figure 52: Distribution estimated with DialectId on the Tweets having as geographic informacion Algeria.
Figure 53: Distribution estimated with DialectId on the Tweets having as geographic informacion Tanzania.
Description
Table 16: Probability of the origin of Tweets in different non-French-speaking countries.
Country United States Morocco Spain Guadeloupe United Kingdom Italy Algeria Tanzania
Belgium 0.000 0.608 0.269 0.000 0.000 0.124 0.000 0.000
Burkina Faso 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000
Benin 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000
Canada 0.095 0.000 0.000 0.000 0.546 0.000 0.000 0.359
DR Congo 0.000 0.000 0.000 0.000 0.862 0.138 0.000 0.000
Central African Republic 0.000 0.140 0.000 0.539 0.046 0.016 0.260 0.000
Congo Republic 0.000 0.027 0.000 0.973 0.000 0.000 0.000 0.000
Switzerland 0.746 0.254 0.000 0.000 0.000 0.000 0.000 0.000
Cote d’Ivoire 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000
Cameroon 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000
Djibouti 0.000 0.037 0.000 0.000 0.000 0.000 0.963 0.000
France 0.000 0.000 0.470 0.000 0.000 0.530 0.000 0.000
Gabon 0.000 0.000 0.000 0.705 0.000 0.000 0.295 0.000
Guinea 0.000 0.383 0.000 0.000 0.000 0.000 0.617 0.000
Haiti 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Comoros 0.000 0.343 0.000 0.083 0.000 0.000 0.574 0.000
Luxembourg 0.002 0.104 0.025 0.309 0.168 0.392 0.000 0.000
Monaco 0.000 0.025 0.000 0.000 0.000 0.975 0.000 0.000
Mali 0.000 0.000 0.000 0.000 0.000 0.000 1.000 0.000
New Caledonia 0.000 0.000 0.064 0.859 0.000 0.000 0.078 0.000
Niger 0.000 0.000 0.000 0.000 0.000 0.000 1.000 0.000
French Polynesia 0.000 0.000 0.585 0.415 0.000 0.000 0.000 0.000
Rwanda 0.000 0.000 0.187 0.000 0.620 0.193 0.000 0.000
Senegal 0.072 0.777 0.000 0.000 0.000 0.000 0.150 0.000
Chad 0.000 0.000 0.000 0.000 0.000 0.000 1.000 0.000
Togo 0.000 0.000 0.000 0.276 0.035 0.000 0.170 0.519