DialectId aims to develop a set of algorithms that detect the dialect of a given text. For example, given a text in Spanish, DialectId predicts the Spanish-speaking country from which the text comes.
DialectId is available for Arabic (ar), German (de), English (en), Spanish (es), French (fr), Dutch (nl), Portuguese (pt), Russian (ru), Turkish (tr), and Chinese (zh).
DialectId can be install using the conda package manager with the following instruction.
conda install --channel conda-forge dialectid
A more general approach to installing DialectId is through the use of the command pip, as illustrated in the following instruction.
pip install dialectid
DialectId can be used to predict the dialect of a list of texts using the method predict as seen in the following lines. The first line imports the DialectId class, the second instantiates the class in the Spanish language, and finally, the third line predicts two utterances. The first corresponds to an expression that would be common in Mexico, and the second is an expression that could be associated with Argentina, Uruguay, Chile, and other South American countries.
from dialectid import DialectId
detect = DialectId(lang='es')
detect.predict(['comiendo unos tacos',
'acompañando el asado con un buen vino'])array(['mx', 'uy'], dtype='<U2')
The available dialects for each language can be identified in the attribute countries, as seen in the following snippet for Spanish.
from dialectid import DialectId
detect = DialectId(lang='es')
detect.countriesarray(['ar', 'bo', 'cl', 'co', 'cr', 'cu', 'do', 'ec', 'es', 'gq', 'gt',
'hn', 'mx', 'ni', 'pa', 'pe', 'pr', 'py', 'sv', 'uy', 've'],
dtype='<U2')
One might be interested in all the countries from which the speaker could come. To facilitate this, one can use the decision_function method. DialectId uses linear Support Vector Machines (SVM) as classifiers; consequently, the positive values in the decision_function are interpreted as belonging to the positive class, i.e., a particular country. The following code exemplifies this idea: the first two lines import and instantiate the DialectId class in Spanish. The third line computes the decision-function values; it returns a two-dimensional array where the first dimension corresponds to the number of texts. In this case, it keeps only the decision-function values, where the positive values indicate the presence of the particular country. The fourth line sorts the values where the highest value is the first element. The fifth line retrieves the country and its associated decision-function values, considering only those countries with positive values.
from dialectid import DialectId
detect = DialectId(lang='es')
df = detect.decision_function(['acompañando el asado con un buen vino'])[0]
index = df.argsort()[::-1]
[(detect.countries[i], df[i]) for i in index
if df[i] > 0][(np.str_('uy'), np.float32(1.5416805)),
(np.str_('py'), np.float32(1.3321806)),
(np.str_('ar'), np.float32(1.2182581))]
In the case where one is interested in the positive classes, as described in the previous example, DialectId implements the DialectId.positive method to retrieve the positive labels in a list of texts, as shown in the following example.
from dialectid import DialectId
detect = DialectId(lang='es')
pos = detect.positive(['acompañando el asado con un buen vino'])[0]
pos{'ar': np.float32(1.2182581),
'py': np.float32(1.3321806),
'uy': np.float32(1.5416805)}
In some situations, one is interested in the probability instead of the decision-function values of a linear SVM. The probability can be computed using the predict_proba method. The following code exemplifies this idea: the first line imports the DialectId class as in previous examples. The second line differs from the last example in that the parameter probability is set to true. The rest of the lines are almost equivalent to the previous example.
from dialectid import DialectId
detect = DialectId(lang='es', probability=True)
prob = detect.predict_proba(['acompañando el asado con un buen vino'])[0]
index = prob.argsort()[::-1]
[(detect.countries[i], prob[i])
for i in index[:4]][(np.str_('uy'), np.float32(0.45955184)),
(np.str_('ar'), np.float32(0.353442)),
(np.str_('py'), np.float32(0.18695451)),
(np.str_('cl'), np.float32(2.8124754e-05))]
The DialectId.positive method can also be used when one is interested in the probabilities of the positive classes, as shown in the following lines.
from dialectid import DialectId
detect = DialectId(lang='es', probability=True)
pos = detect.positive(['acompañando el asado con un buen vino'])[0]
pos{'ar': np.float32(0.353442),
'py': np.float32(0.18695451),
'uy': np.float32(0.45955184)}
| Country | train | test | train (orig. dist.) | test (orig. dist.) | Corpus |
|---|---|---|---|---|---|
| Saudi Arabia | 119009 | 4096 | 1139707 | 1101214 | 61578662 |
| Egypt | 119122 | 4096 | 271439 | 287583 | 14665935 |
| Kuwait | 119117 | 4096 | 188944 | 187432 | 10208696 |
| United Arab Emirates | 119600 | 4096 | 115345 | 105957 | 6232153 |
| Oman | 119771 | 4096 | 70484 | 70730 | 3808309 |
| Iraq | 119655 | 4096 | 50912 | 63215 | 2750834 |
| Qatar | 119362 | 4096 | 48860 | 46962 | 2639967 |
| Bahrain | 119666 | 4096 | 45196 | 38131 | 2441971 |
| Lebanon | 119370 | 4096 | 35812 | 30455 | 1934983 |
| Jordan | 119718 | 4096 | 34619 | 33242 | 1870514 |
| Libya | 119659 | 4096 | 31495 | 29417 | 1701716 |
| Yemen | 119823 | 4096 | 16917 | 33165 | 914053 |
| Algeria | 119143 | 4096 | 16609 | 18617 | 897394 |
| Morocco | 119556 | 4096 | 9600 | 16093 | 518739 |
| Sudan | 120078 | 4096 | 7662 | 16291 | 413993 |
| Tunisia | 119244 | 4096 | 6405 | 7435 | 346082 |
| Syria | 119159 | 4093 | 5768 | 9596 | 311660 |
| Mauritania | 41017 | 1809 | 844 | 760 | 45624 |
| Somalia | 17410 | 561 | 355 | 234 | 19215 |
| Chad | 4797 | 706 | 105 | 295 | 5706 |
| Djibouti | 2873 | 309 | 63 | 152 | 3420 |
| Sum | 2097149 | 73014 | 2097141 | 2096976 | 113309626 |
| Country | train | test | train (orig. dist.) | test (orig. dist.) | Corpus |
|---|---|---|---|---|---|
| Germany | 83023 | 4096 | 80620 | 1110160 | 1262931 |
| Austria | 7004 | 4096 | 7004 | 100180 | 109718 |
| Switzerland | 4573 | 4096 | 4547 | 64578 | 71231 |
| Sum | 94600 | 12288 | 92171 | 1274918 | 1443880 |
| Country | train | test | train (orig. dist.) | test (orig. dist.) | Corpus |
|---|---|---|---|---|---|
| United States | 36492 | 4096 | 1411215 | 1269614 | 1241245184 |
| United Kingdom | 36417 | 4096 | 284076 | 304237 | 249861873 |
| Canada | 36338 | 4096 | 79421 | 81213 | 69855356 |
| India | 36348 | 4096 | 76678 | 128000 | 67442862 |
| Nigeria | 36199 | 4096 | 43566 | 73111 | 38319549 |
| South Africa | 36323 | 4096 | 42569 | 43805 | 37442472 |
| Australia | 36373 | 4096 | 38466 | 46482 | 33833515 |
| Philippines | 36599 | 4096 | 36427 | 21966 | 32039887 |
| Ireland | 36352 | 4096 | 20796 | 24196 | 18291944 |
| Kenya | 36231 | 4096 | 10383 | 20803 | 9132974 |
| Pakistan | 36376 | 4096 | 9236 | 16466 | 8124273 |
| Ghana | 36395 | 4096 | 8702 | 14811 | 7654670 |
| New Zealand | 36361 | 4096 | 6610 | 8397 | 5813959 |
| Singapore | 36048 | 4096 | 5608 | 4189 | 4933008 |
| Uganda | 36511 | 4096 | 4662 | 15003 | 4100771 |
| Jamaica | 36332 | 4096 | 3185 | 3332 | 2801604 |
| Zimbabwe | 36237 | 4096 | 1809 | 3387 | 1591827 |
| Trinidad and Tobago | 36459 | 4096 | 1725 | 1968 | 1517980 |
| Zambia | 36686 | 4096 | 1468 | 2464 | 1291544 |
| Namibia | 36553 | 4096 | 1268 | 1752 | 1115587 |
| Bahamas | 36223 | 4096 | 1265 | 1110 | 1113202 |
| Barbados | 36478 | 4096 | 868 | 766 | 764085 |
| Malawi | 36373 | 4096 | 753 | 1944 | 662789 |
| Rwanda | 36374 | 4096 | 496 | 946 | 436529 |
| Cameroon | 36461 | 4096 | 416 | 785 | 365902 |
| Malta | 36405 | 4096 | 398 | 560 | 350352 |
| Antigua and Barbuda | 36526 | 4096 | 356 | 347 | 313582 |
| Guam | 36494 | 3008 | 351 | 101 | 309229 |
| St. Lucia | 36223 | 4096 | 313 | 235 | 275897 |
| Eswatini | 36408 | 4096 | 268 | 354 | 236190 |
| Mauritius | 36306 | 4096 | 263 | 211 | 231391 |
| Bermuda | 36319 | 4096 | 259 | 299 | 227865 |
| Isle of Man | 36220 | 1495 | 248 | 50 | 218569 |
| Lesotho | 35926 | 4096 | 241 | 491 | 212309 |
| Cayman Islands | 36161 | 4096 | 204 | 191 | 180023 |
| Gambia | 36296 | 4096 | 204 | 516 | 179764 |
| Gibraltar | 36216 | 4096 | 193 | 224 | 170041 |
| Sierra Leone | 36278 | 4096 | 183 | 532 | 161814 |
| Turks and Caicos Islands | 36277 | 3064 | 179 | 106 | 158077 |
| Sudan | 36460 | 4096 | 165 | 177 | 145226 |
| St. Vincent and the Grenadines | 36324 | 4096 | 160 | 209 | 140768 |
| Belize | 36538 | 4096 | 154 | 211 | 136040 |
| Liberia | 36247 | 4096 | 136 | 389 | 120223 |
| Grenada | 36573 | 2761 | 134 | 97 | 118559 |
| British Virgin Islands | 36276 | 1650 | 126 | 57 | 111011 |
| Guyana | 36654 | 4096 | 106 | 193 | 93531 |
| St. Kitts and Nevis | 36601 | 3652 | 106 | 125 | 93321 |
| United States Virgin Islands | 36592 | 219 | 103 | 7 | 90888 |
| Northern Mariana Islands | 36550 | 617 | 100 | 21 | 88606 |
| Papua New Guinea | 36038 | 3904 | 89 | 136 | 78435 |
| Puerto Rico | 36594 | 3164 | 74 | 113 | 65874 |
| Dominica | 36452 | 1140 | 63 | 38 | 55815 |
| Sint Maarten | 36311 | 1745 | 57 | 59 | 50880 |
| Fiji | 36538 | 1934 | 53 | 65 | 47474 |
| Guernsey | 24068 | 1790 | 31 | 62 | 27863 |
| Tonga | 25728 | 901 | 31 | 31 | 27690 |
| Anguilla | 23965 | 1250 | 30 | 42 | 26826 |
| Vanuatu | 13985 | 767 | 17 | 26 | 15601 |
| Falkland Islands | 11917 | 412 | 14 | 13 | 13158 |
| Micronesia, Fed. Sts. | 7497 | 266 | 10 | 9 | 9306 |
| Cook Islands | 8053 | 274 | 10 | 9 | 8961 |
| Solomon Islands | 8141 | 458 | 10 | 15 | 9479 |
| Palau | 6556 | 691 | 8 | 23 | 7491 |
| St. Helena | 2876 | 974 | 5 | 32 | 4458 |
| Sum | 2097128 | 204072 | 2097120 | 2097123 | 1844565933 |
| Country | train | test | train (orig. dist.) | test (orig. dist.) | Corpus |
|---|---|---|---|---|---|
| Argentina | 104466 | 4096 | 536137 | 415844 | 156910687 |
| Spain | 103924 | 4096 | 401933 | 421172 | 117633385 |
| Mexico | 104318 | 4096 | 353432 | 388367 | 103438764 |
| Colombia | 104267 | 4096 | 204831 | 261796 | 59947766 |
| Chile | 104027 | 4096 | 156319 | 143770 | 45749886 |
| Venezuela | 104496 | 4096 | 109073 | 88022 | 31922346 |
| Uruguay | 103733 | 4096 | 66209 | 60004 | 19377563 |
| Ecuador | 104408 | 4096 | 53037 | 68303 | 15522286 |
| Peru | 103907 | 4096 | 52144 | 59480 | 15261118 |
| Paraguay | 104617 | 4096 | 33486 | 37244 | 9800404 |
| Dominican Republic | 104000 | 4096 | 30142 | 36468 | 8821881 |
| Panama | 104014 | 4096 | 25525 | 27081 | 7470575 |
| Costa Rica | 104415 | 4096 | 19730 | 16252 | 5774617 |
| Guatemala | 103800 | 4096 | 17401 | 22567 | 5092733 |
| El Salvador | 104111 | 4096 | 10990 | 12949 | 3216498 |
| Honduras | 104020 | 4096 | 8660 | 14988 | 2534698 |
| Nicaragua | 104438 | 4096 | 8435 | 6951 | 2468938 |
| Bolivia | 103537 | 4096 | 4913 | 6523 | 1438141 |
| Cuba | 104570 | 4096 | 3359 | 8783 | 983104 |
| Puerto Rico | 103994 | 1487 | 1320 | 149 | 386595 |
| Equatorial Guinea | 14090 | 4096 | 64 | 429 | 18783 |
| Sum | 2097152 | 83407 | 2097140 | 2097142 | 613770768 |
| Country | train | test | train (orig. dist.) | test (orig. dist.) | Corpus |
|---|---|---|---|---|---|
| France | 215940 | 4096 | 1488423 | 1475608 | 9241569 |
| Canada | 215745 | 4096 | 122173 | 134958 | 758570 |
| DR Congo | 214853 | 4096 | 93877 | 100146 | 582879 |
| Cameroon | 215520 | 4096 | 92835 | 97799 | 576414 |
| Belgium | 214731 | 4096 | 65710 | 72143 | 407994 |
| Senegal | 216403 | 4096 | 53106 | 47977 | 329734 |
| Cote d’Ivoire | 215555 | 4096 | 52423 | 44793 | 325496 |
| Switzerland | 116815 | 4096 | 24217 | 19484 | 150366 |
| Guinea | 90619 | 4096 | 19163 | 16753 | 118984 |
| Benin | 63347 | 4096 | 14310 | 14802 | 88856 |
| Mali | 54495 | 4096 | 12232 | 12809 | 75952 |
| Togo | 44077 | 4096 | 9698 | 10088 | 60220 |
| Burkina Faso | 28514 | 4096 | 6957 | 7326 | 43197 |
| Gabon | 27051 | 4096 | 5986 | 6156 | 37173 |
| Haiti | 25939 | 4096 | 5909 | 5850 | 36694 |
| Niger | 27468 | 4096 | 5900 | 5742 | 36638 |
| Congo Republic | 26441 | 4096 | 5584 | 5094 | 34676 |
| Chad | 15160 | 4096 | 3452 | 4008 | 21439 |
| Monaco | 13037 | 4096 | 2901 | 3029 | 18014 |
| Luxembourg | 11358 | 4096 | 2820 | 3506 | 17511 |
| Central African Republic | 13445 | 2122 | 2551 | 1453 | 15840 |
| New Caledonia | 7150 | 1715 | 1486 | 1222 | 9230 |
| French Polynesia | 6408 | 2304 | 1459 | 1610 | 9065 |
| Djibouti | 6331 | 2237 | 1429 | 1577 | 8873 |
| Comoros | 6025 | 1736 | 1273 | 1197 | 7908 |
| Rwanda | 4695 | 2749 | 1263 | 2010 | 7848 |
| Sum | 2097122 | 94783 | 2097137 | 2097140 | 13021140 |
| Country | train | test | train (orig. dist.) | test (orig. dist.) | Corpus |
|---|---|---|---|---|---|
| Netherlands | 102887 | 4096 | 102887 | 947365 | 1092054 |
| Belgium | 16880 | 4096 | 15640 | 144451 | 166015 |
| Sum | 119767 | 8192 | 118527 | 1091816 | 1258069 |
| Country | train | test | train (orig. dist.) | test (orig. dist.) | Corpus |
|---|---|---|---|---|---|
| Brazil | 978254 | 4096 | 2049142 | 2046379 | 79748620 |
| Portugal | 979061 | 4096 | 43008 | 45744 | 1673795 |
| Mozambique | 78571 | 4096 | 2656 | 2366 | 103393 |
| Angola | 53243 | 4096 | 2071 | 2400 | 80610 |
| Cabo Verde | 8022 | 2434 | 273 | 260 | 10627 |
| Sum | 2097151 | 18818 | 2097150 | 2097149 | 81617045 |
| Country | train | test | train (orig. dist.) | test (orig. dist.) | Corpus |
|---|---|---|---|---|---|
| Russia | 962191 | 4096 | 1887634 | 311120 | 11487343 |
| Belarus | 652818 | 4096 | 114108 | 25498 | 694419 |
| Kazakhstan | 336189 | 4096 | 67909 | 35634 | 413267 |
| Kyrgyz Republic | 145942 | 4096 | 27499 | 18562 | 167350 |
| Sum | 2097140 | 16384 | 2097150 | 390814 | 12762379 |
| Country | train | test | train (orig. dist.) | test (orig. dist.) | Corpus |
|---|---|---|---|---|---|
| Türkiye | 550662 | 4096 | 550664 | 184575 | 756902 |
| Cyprus | 3085 | 924 | 2936 | 924 | 4036 |
| Sum | 553747 | 5020 | 553600 | 185499 | 760938 |
| Country | train | test | train (orig. dist.) | test (orig. dist.) | Corpus |
|---|---|---|---|---|---|
| China | 206312 | 4096 | 206313 | 175991 | 510998 |
| Taiwan | 115165 | 4096 | 100741 | 110266 | 249518 |
| Hong Kong | 13944 | 4096 | 10315 | 9315 | 25549 |
| Singapore | 4452 | 3963 | 4130 | 4623 | 10230 |
| Sum | 339873 | 16251 | 321499 | 300195 | 796295 |
The dataset used to create the self-supervised problems is a collection of Tweets collected from the open stream for several years, i.e., the Spanish collection started on December 11, 2015; English on July 1, 2016; Arabic on January 25, 2017; Russian on October 16, 2018; and the rest of the languages on June 1, 2021. In all the cases, the last day collected was June 9, 2023. The collected Tweets were filtered with the following restrictions: retweets were removed; URLs and usernames were replaced by the tokens _url and _usr, respectively; and only tweets with at least 50 characters were included in the final collection. The column Corpus in Table 1 and Figure 1 show the number of tweets collected for the Arabic-speaking countries. The figure shows that there are days when more tweets are collected, and there is a tendency to collect fewer tweets in 2023 due to changes in the Twitter API. The data corresponding to German, English, Spanish, French, Dutch, Portuguese, Russian, Turkish, and Chinese are shown in Table 2, Table 3, Table 4, Table 5, Table 6, Table 7, Table 8, Table 9, and Table 10; and Figure 2, Figure 3, Figure 4, Figure 5, Figure 6, Figure 7, Figure 8, Figure 9, and Figure 10.
The corpora are used to create two pairs of training and test sets. The training sets are drawn from tweets published before October 1, 2022, and the test sets are taken from tweets published on or after October 3, 2022. The procedure for creating the set pairs consists of two stages. In the first stage, the tweets were organized by country and then selected to form a uniform distribution by day. Within each day, near duplicates were removed. Then, a three-day sliding window was used to remove near duplicates within the window. The final step was to shuffle the data to remove the ordering by date, respecting the limit between the training and test sets.
The tweets of the first pair were selected to follow a uniform distribution by country as closely as possible. In this pair, the size of the training set is roughly 2 million tweets, whereas the test set size is \(2^{12}\) (4,096) tweets per country. We also produce a smaller training set containing 262 thousand tweets. The procedure is equivalent to the previous one, aiming to have a uniform distribution of the countries. The column identified with the legend train in Table 1, Table 2, Table 3, Table 4, Table 5, Table 6, Table 7, Table 8, Table 9, and Table 10 shows the size of the training set, and the column test indicates the size of the test set in the first pair of training and test sets. It is worth mentioning that we did not have enough information for all the countries and languages to follow an exactly uniform distribution. For example, Table 4 (Spanish) notes that for Puerto Rico (pr), 1,487 tweets in the test set correspond to the total number of available tweets that meet the imposed restrictions.
The second pair of tweets was selected to follow the original distribution of the corpus; in this case, the training and test set has a maximum size of 2 million tweets. The process of selecting the tweets was set as a convex optimization problem where the objective is to maximize the number of tweets subject to a maximum of 2 million (\(2^{21}\)), and the availability of tweets for each country, and the distribution is given by all the tweets available. The column identified with the legend train (orig. dist.) in Table 1, Table 2, Table 3, Table 4, Table 5, Table 6, Table 7, Table 8, Table 9, and Table 10 shows the size of the training set, and the column test (orig. dist.) indicates the size of the test set in the second pair of training and test sets.
DialectId is a text classifier based on a Bag of Words (BoW) representation with a linear Support Vector Machine (SVM) as the classifier.
The normalization procedure used in the BoW corresponds to setting all characters to lowercase, removing diacritics, and replacing usernames and URLs with the tags “_usr” and “_url”, respectively.
The BoW representation weights the tokens with the term frequency and inverse document frequency (TF-IDF). The tokens correspond to words, bi-grams of words, and q-grams of characters (with q=4, 3, 2). The tokens and weights were estimated using each language’s training dataset (2 million tweets). The tokens (vocabulary) with higher frequency in the training set were kept. We developed systems for different vocabulary sizes, i.e., \(2^{17}\), \(2^{18}\), and \(2^{19}.\)
The BoW can be used by importing the BoW class, as seen in the following example, where the good morning text is transformed into a vector space. The first line imports the class, the second line instantiates the class, where the parameter token_max_filter indicates the vocabulary size, and the third line converts the text into a vector space.
from dialectid import BoW
bow = BoW(lang='en', token_max_filter=2**18)
bow.transform(['good morning'])<Compressed Sparse Row sparse matrix of dtype 'float32'
with 36 stored elements and shape (1, 262144)>
Each text in the training set is represented in the vector space, and the associated country is retained for use in a linear SVM using the one-vs-all strategy. The approach creates as many binary classification problems as there are different classes. In the binary problems, each class corresponds to the positive class exactly once, and it is the negative class in the remaining cases. Traditionally, one uses all the information in the approach, which is the case for the reduced training set (262 thousand tweets). Nonetheless, in the full training set, the negative examples were limited to the maximum number of positive elements or \(2^{14}\) tweets. In both cases, the examples are weighted inversely proportional to class frequencies to treat an imbalanced dataset.
Complementing the previous example, the following code instantiates the DialectId in Spanish using a vocabulary size of \(2^{18}\) indicated by the parameters lang and token_max_filter, respectively.
from dialectid import DialectId
detect = DialectId(lang='es', token_max_filter=2**18)
detect.predict(['comiendo unos tacos'])array(['mx'], dtype='<U2')
A drawback of using SVM is that it does not estimate the classification probability. For some applications, it is more amenable to calculate the probability instead of the decision-function value. Thus, the developed systems are calibrated to estimate the probability by training a logistic regression using the SVM’s decision function as inputs. The calibration procedure involves predicting the SVM’s decision function on the reduced training set using stratified k-fold cross-validation (k = 3). The decision functions predicted are the inputs of the logistic regression, and the classes are the ones in the reduced training set; the parameters that weight each example inversely proportional to class frequencies are used in this case. To invoke the model using probability, the parameter probability must be set to true, as shown in the following example.
from dialectid import DialectId
detect = DialectId(lang='es', probability=True)
detect.predict_proba(['comiendo unos tacos'])array([[1.9695617e-06, 4.8897579e-07, 1.8095196e-05, 3.6204598e-05,
1.5155481e-03, 1.1557303e-05, 5.4983921e-06, 2.8960928e-06,
2.2128135e-05, 5.0534654e-05, 1.7656431e-01, 2.0665383e-02,
6.3909459e-01, 1.3323617e-01, 1.0678391e-04, 7.2897665e-06,
2.6716828e-02, 6.4306505e-08, 1.9328916e-03, 8.0165564e-06,
2.7745039e-06]], dtype=float32)
As described previously, there are two training sets: one that follows a uniform distribution in the countries as closely as possible, and the second one that follows the distribution seen in the corpus, namely the original distribution (identified as orig. dist.). The parameter uniform_distribution indicates which training set is used to estimate the parameters. By default, the parameter is set to true to use the training sets with uniform distribution in the countries.
from dialectid import DialectId
detect = DialectId(lang='es',
uniform_distribution=False,
probability=True)
detect.predict_proba(['comiendo unos tacos'])array([[5.8839246e-06, 4.0277046e-06, 1.5494716e-05, 3.9904633e-05,
1.5577762e-02, 6.6911220e-05, 9.4599171e-05, 2.6315629e-05,
1.4272113e-05, 4.7489698e-06, 5.5686768e-02, 2.2258271e-02,
8.8939697e-01, 1.3783798e-02, 3.9202135e-04, 8.4752110e-06,
4.6369270e-05, 9.1043790e-07, 2.5635022e-03, 5.8783221e-06,
7.1929417e-06]], dtype=float32)
| Language | Spanish | English | Arabic | German | French | Dutch | Portuguese | Russian | Turkish | Chinese |
|---|---|---|---|---|---|---|---|---|---|---|
| DialectId[19] | 0.4380 | 0.3322 | 0.4330 | 0.5424 | 0.3653 | 0.7708 | 0.6153 | 0.4894 | 0.5421 | 0.7009 |
| DialectId[18] | 0.4310 | 0.3284 | 0.4269 | 0.5441 | 0.3623 | 0.7734 | 0.6134 | 0.4866 | 0.5445 | 0.7035 |
| DialectId[17] | 0.4204 | 0.3221 | 0.4219 | 0.5470 | 0.3570 | 0.7742 | 0.6110 | 0.4839 | 0.5476 | 0.7049 |
| DialectId[19] (prob) | 0.4338 | 0.3132 | 0.4282 | 0.5586 | 0.3661 | 0.7708 | 0.6250 | 0.4810 | 0.5421 | 0.7287 |
| DialectId[18] (prob) | 0.4273 | 0.3082 | 0.4233 | 0.5585 | 0.3613 | 0.7734 | 0.6214 | 0.4758 | 0.5445 | 0.7305 |
| DialectId[17] (prob) | 0.4162 | 0.3034 | 0.4176 | 0.5575 | 0.3564 | 0.7742 | 0.6190 | 0.4752 | 0.5476 | 0.7267 |
| DialectId[19] (262k) | 0.3530 | 0.2827 | 0.3821 | 0.4901 | 0.3594 | 0.7708 | 0.6127 | 0.4624 | 0.5521 | 0.6718 |
| DialectId[18] (262k) | 0.3480 | 0.2792 | 0.3771 | 0.4944 | 0.3549 | 0.7734 | 0.6081 | 0.4590 | 0.5528 | 0.6753 |
| DialectId[17] (262k) | 0.3421 | 0.2738 | 0.3715 | 0.5006 | 0.3495 | 0.7742 | 0.6058 | 0.4612 | 0.5577 | 0.6761 |
| StackBoW (262k) | 0.3331 | 0.2441 | 0.3673 | 0.4893 | 0.3339 | 0.7823 | 0.6000 | 0.4468 | 0.5649 | 0.6859 |
| Language | Spanish | English | Arabic | German | French | Dutch | Portuguese | Russian | Turkish | Chinese |
|---|---|---|---|---|---|---|---|---|---|---|
| DialectId[19] | 0.301 (1.0) | 0.103 (1.0) | 0.245 (1.0) | 0.499 (1.0) | 0.166 (1.0) | 0.754 (1.0) | 0.230 (1.0) | 0.359 (1.0) | 0.575 (1.0) | 0.714 (1.0) |
| DialectId[19] (prob) | 0.275 (1.0) | 0.081 (1.0) | 0.222 (1.0) | 0.471 (1.0) | 0.165 (1.0) | 0.754 (1.0) | 0.227 (1.0) | 0.341 (1.0) | 0.577 (1.0) | 0.637 (1.0) |
| DialectId[19] (Bias) | 0.337 (1.0) | 0.135 (1.0) | 0.297 (1.0) | 0.545 (1.0) | 0.205 (1.0) | 0.754 (1.0) | 0.316 (1.0) | 0.243 (1.0) | 0.498 (1.0) | 0.791 (1.0) |
| DialectId[19] (Orig. Dist.) | 0.288 (1.0) | 0.217 (1.0) | 0.232 (1.0) | 0.496 (1.0) | 0.185 (1.0) | 0.758 (1.0) | 0.374 (1.0) | 0.370 (1.0) | 0.580 (1.0) | 0.729 (1.0) |
| DialectId[19] (Pos.) | 0.843 (0.11) | 0.927 (0.02) | 0.793 (0.10) | 0.574 (0.77) | 0.833 (0.04) | 0.766 (1.00) | 0.701 (0.68) | 0.611 (0.42) | 0.542 (1.00) | 0.740 (0.85) |
| DialectId[19] (One) | 0.760 (0.33) | 0.600 (0.30) | 0.686 (0.33) | 0.682 (0.28) | 0.661 (0.27) | 0.921 (0.39) | 0.862 (0.21) | 0.871 (0.13) | 0.528 (0.82) | 0.753 (0.65) |
The performance, using macro-recall, of DialectId with different parameters and StackBoW (Graff, Moctezuma, and Téllez (2025), Graff et al. (2020)), used as baseline, is presented in Table 11 and Table 12. Table 11 shows the performance on the test set that has at most 2048 in all countries, and Table 12 presents the performance on the test set that follows the original distribution across countries.
The notation used is as follows: the number in brackets indicates the vocabulary size, the systems with the number 626k in parentheses show the performance of the systems trained with the small training set. The system identified with the label Orig. Dist. indicates that it is trained on the training set that follows the original distribution; the rest of the systems are trained with the training set that has a uniform distribution across countries.
It can be observed in Table 11 that DialectId outperforms the baseline (StackBoW) in almost all languages except in Dutch and Turkish. It is essential to note that the training set size of these languages is less than 600k tweets. DialectId trained with the uniform distribution training set outperformed the system trained with the original distribution training set; this behaviour is expected because the original distribution training set provides fewer examples in the minority classes; however, macro-recall gives the same weight to all classes. DialectId with a vocabulary size of \(2^{19}\) obtained the best performance in Spanish, English, Arabic, French, Portuguese, Russian, and Chinese. DialectId using a vocabulary size of \(2^{18}\) obtained the best performance in German; this could be the result that the training set size is 94k tweets, which might not be enough to train a greater vocabulary size. The other language with fewer examples is Chinese; in this case, the difference in performance of DialectId with \(2^{18}\) and \(2^{19}\) is not statistically significant, as can be seen in Figure 11.
DialectId aims to estimate the likelihood of origin of a text; one of its applications could be to calculate the distribution of dialects of a set of texts. The performance has been presented using macro-recall; however, this measure does not provide information about the closeness of the distribution computed with DialectId. To provide information, Table 13 presents the Pearson correlation coefficient in the test set, which follows the original distribution of the DialectId with a vocabulary size of \(2^{19}\) trained with the two training sets. It can be observed that in all the countries, the correlation is above 0.9. The lowest value is in Spanish, where the system trained with the uniform distribution achieved 0.9063, while the other system achieved 0.9824. DialectId trained with the original distribution has correlation coefficients above 0.98 in all the cases; however, it is the system with the lowest macro-recall in all the cases. The table includes the system DialectId(Bias), which is equivalent to DialectId trained on the uniform distribution, with the difference that the probabilities are weighted by the proportion of each country measured in the training set of the original distribution. It can be observed that in all the cases, DialectId(Bias) has a correlation greater than 0.97.
To complement the information presented in Table 13, Figure 12, Figure 13, Figure 14, Figure 15, Figure 16, Figure 17, Figure 18, Figure 19, Figure 20, and Figure 21 present these distributions for Arabic, German, English, Spanish, French, Dutch, Portuguese, Russian, Turkish, and Chinese; all the figures follow an equivalent notation. For example, Figure 15 shows in the blue line the distribution measured in the test set, the broad orange line presents the distribution obtained with the prediction made by DialectId trained in the uniform distribution, and the wide green line presents the distribution obtained with the DialectId (trained with the original distribution) predictions.
The figure also includes, in thin lines, estimated distributions, with the two versions of DialectId, from a dataset where there is no geographic information, so it is impossible to measure the actual distribution. The dataset comes from the same period as the test set, and it follows a treatment equivalent to the test set, such as the near duplicates are removed among other constraints. It can be observed that the thin lines follow the wide lines in almost all countries, except in the Dominican Republic. These later distributions (i.e., thin lines) show one of the applications of DialectId, which is to estimate the dialect of texts from a collection where the information is not available.
| Language | Spanish | English | Arabic | German | French | Dutch | Portuguese | Russian | Turkish | Chinese |
|---|---|---|---|---|---|---|---|---|---|---|
| DialectId[19] | 0.9776 | 0.9408 | 0.9731 | 1.0000 | 0.9598 | 1.0000 | 0.9955 | 0.9623 | 1.0000 | 1.0000 |
| DialectId[19] (prob) | 0.9242 | 0.7442 | 0.9195 | 0.9997 | 0.6631 | 1.0000 | 0.9960 | 0.8960 | 1.0000 | 0.9997 |
| DialectId[19] (Bias) | 0.9820 | 0.9870 | 0.9936 | 1.0000 | 0.9982 | 1.0000 | 0.9999 | 0.9993 | 1.0000 | 1.0000 |
| DialectId[19] (Orig. Dist.) | 0.9866 | 0.9824 | 0.9806 | 0.9999 | 0.9899 | 1.0000 | 0.9991 | 0.9931 | 1.0000 | 1.0000 |
| DialectId (Pos.) | 0.8796 | 0.9695 | 0.9825 | 1.0000 | 0.8818 | 1.0000 | 0.9981 | 0.9890 | 1.0000 | 0.9954 |
| DialectId (One) | 0.9719 | 0.9668 | 0.9755 | 0.9998 | 0.9743 | 1.0000 | 0.9999 | 0.9998 | 1.0000 | 0.9973 |
| Country | United States | Brazil | United Kingdom | Italy | France | Canada | Germany | Portugal |
|---|---|---|---|---|---|---|---|---|
| Argentina | 0.000 | 0.925 | 0.000 | 0.075 | 0.000 | 0.000 | 0.000 | 0.000 |
| Bolivia | 0.005 | 0.638 | 0.023 | 0.239 | 0.000 | 0.095 | 0.000 | 0.000 |
| Chile | 0.000 | 0.279 | 0.073 | 0.021 | 0.010 | 0.340 | 0.255 | 0.021 |
| Colombia | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 |
| Costa Rica | 0.961 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.039 | 0.000 |
| Cuba | 0.000 | 0.445 | 0.034 | 0.241 | 0.000 | 0.106 | 0.117 | 0.057 |
| Dominican Republic | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| Ecuador | 0.304 | 0.118 | 0.022 | 0.135 | 0.000 | 0.421 | 0.000 | 0.000 |
| Spain | 0.000 | 0.000 | 0.131 | 0.000 | 0.004 | 0.000 | 0.000 | 0.865 |
| Equatorial Guinea | 0.790 | 0.143 | 0.033 | 0.000 | 0.033 | 0.000 | 0.000 | 0.000 |
| Guatemala | 0.658 | 0.126 | 0.000 | 0.000 | 0.000 | 0.170 | 0.045 | 0.000 |
| Honduras | 0.997 | 0.003 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| Mexico | 0.606 | 0.000 | 0.000 | 0.000 | 0.000 | 0.394 | 0.000 | 0.000 |
| Nicaragua | 0.865 | 0.000 | 0.000 | 0.000 | 0.000 | 0.135 | 0.000 | 0.000 |
| Panama | 0.904 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.096 | 0.000 |
| Peru | 0.049 | 0.430 | 0.000 | 0.174 | 0.000 | 0.301 | 0.046 | 0.000 |
| Puerto Rico | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| Paraguay | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| El Salvador | 0.464 | 0.263 | 0.000 | 0.013 | 0.000 | 0.260 | 0.000 | 0.000 |
| Uruguay | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| Venezuela | 0.375 | 0.000 | 0.000 | 0.154 | 0.000 | 0.361 | 0.000 | 0.111 |
| Country | Malaysia | Indonesia | Brazil | Germany | Spain | France | Italy | United Arab Emirates |
|---|---|---|---|---|---|---|---|---|
| Antigua and Barbuda | 0.000 | 0.000 | 0.726 | 0.036 | 0.049 | 0.043 | 0.126 | 0.020 |
| Anguilla | 0.894 | 0.106 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| Australia | 0.000 | 0.000 | 0.000 | 0.077 | 0.674 | 0.119 | 0.130 | 0.000 |
| Barbados | 0.045 | 0.033 | 0.726 | 0.028 | 0.006 | 0.102 | 0.061 | 0.000 |
| Bermuda | 0.080 | 0.101 | 0.150 | 0.073 | 0.200 | 0.248 | 0.145 | 0.002 |
| Bahamas | 0.000 | 0.000 | 0.751 | 0.000 | 0.000 | 0.005 | 0.244 | 0.000 |
| Belize | 0.560 | 0.122 | 0.040 | 0.036 | 0.083 | 0.015 | 0.072 | 0.072 |
| Canada | 0.000 | 0.000 | 0.045 | 0.225 | 0.000 | 0.728 | 0.002 | 0.000 |
| Cook Islands | 0.894 | 0.106 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| Cameroon | 0.000 | 0.045 | 0.464 | 0.000 | 0.000 | 0.491 | 0.000 | 0.000 |
| Dominica | 0.828 | 0.100 | 0.036 | 0.000 | 0.000 | 0.036 | 0.000 | 0.000 |
| Fiji | 0.654 | 0.170 | 0.000 | 0.100 | 0.000 | 0.000 | 0.000 | 0.076 |
| Falkland Islands | 0.894 | 0.106 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| Micronesia, Fed. Sts. | 0.894 | 0.106 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| United Kingdom | 0.000 | 0.000 | 0.000 | 0.000 | 0.999 | 0.001 | 0.000 | 0.000 |
| Grenada | 0.832 | 0.096 | 0.036 | 0.000 | 0.036 | 0.000 | 0.000 | 0.000 |
| Guernsey | 0.860 | 0.104 | 0.000 | 0.000 | 0.000 | 0.036 | 0.000 | 0.000 |
| Ghana | 0.134 | 0.059 | 0.254 | 0.229 | 0.009 | 0.075 | 0.010 | 0.230 |
| Gibraltar | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 | 0.000 | 0.000 |
| Gambia | 0.012 | 0.029 | 0.000 | 0.284 | 0.024 | 0.123 | 0.516 | 0.012 |
| Guam | 0.059 | 0.064 | 0.091 | 0.044 | 0.172 | 0.298 | 0.226 | 0.045 |
| Guyana | 0.735 | 0.092 | 0.074 | 0.031 | 0.000 | 0.000 | 0.018 | 0.050 |
| Ireland | 0.000 | 0.000 | 0.000 | 0.131 | 0.597 | 0.067 | 0.205 | 0.000 |
| Isle of Man | 0.585 | 0.093 | 0.000 | 0.028 | 0.047 | 0.000 | 0.247 | 0.000 |
| India | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
| Jamaica | 0.001 | 0.000 | 0.590 | 0.008 | 0.110 | 0.280 | 0.000 | 0.011 |
| Kenya | 0.019 | 0.224 | 0.000 | 0.057 | 0.000 | 0.000 | 0.067 | 0.633 |
| St. Kitts and Nevis | 0.883 | 0.106 | 0.000 | 0.000 | 0.000 | 0.011 | 0.000 | 0.000 |
| Cayman Islands | 0.495 | 0.093 | 0.108 | 0.115 | 0.120 | 0.027 | 0.036 | 0.006 |
| St. Lucia | 0.247 | 0.031 | 0.300 | 0.088 | 0.064 | 0.080 | 0.133 | 0.057 |
| Liberia | 0.358 | 0.160 | 0.479 | 0.000 | 0.000 | 0.000 | 0.003 | 0.000 |
| Lesotho | 0.549 | 0.061 | 0.036 | 0.123 | 0.036 | 0.072 | 0.123 | 0.000 |
| Northern Mariana Islands | 0.901 | 0.099 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| Malta | 0.000 | 0.000 | 0.046 | 0.068 | 0.047 | 0.037 | 0.801 | 0.000 |
| Mauritius | 0.070 | 0.057 | 0.000 | 0.060 | 0.076 | 0.433 | 0.093 | 0.213 |
| Malawi | 0.000 | 0.041 | 0.548 | 0.044 | 0.027 | 0.026 | 0.239 | 0.076 |
| Namibia | 0.096 | 0.009 | 0.551 | 0.008 | 0.000 | 0.015 | 0.296 | 0.026 |
| Nigeria | 0.048 | 0.238 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.715 |
| New Zealand | 0.083 | 0.009 | 0.046 | 0.568 | 0.028 | 0.180 | 0.087 | 0.000 |
| Papua New Guinea | 0.761 | 0.167 | 0.000 | 0.000 | 0.000 | 0.000 | 0.036 | 0.036 |
| Philippines | 0.128 | 0.506 | 0.107 | 0.258 | 0.000 | 0.000 | 0.000 | 0.000 |
| Pakistan | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
| Puerto Rico | 0.771 | 0.093 | 0.009 | 0.000 | 0.000 | 0.049 | 0.077 | 0.000 |
| Palau | 0.894 | 0.106 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| Rwanda | 0.000 | 0.013 | 0.000 | 0.037 | 0.055 | 0.332 | 0.192 | 0.370 |
| Solomon Islands | 0.894 | 0.106 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| Sudan | 0.125 | 0.023 | 0.350 | 0.025 | 0.082 | 0.036 | 0.099 | 0.260 |
| Singapore | 0.905 | 0.095 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| St. Helena | 0.894 | 0.106 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| Sierra Leone | 0.060 | 0.319 | 0.095 | 0.067 | 0.002 | 0.055 | 0.231 | 0.171 |
| Sint Maarten | 0.797 | 0.131 | 0.000 | 0.000 | 0.000 | 0.072 | 0.000 | 0.000 |
| Eswatini | 0.358 | 0.177 | 0.185 | 0.001 | 0.125 | 0.110 | 0.044 | 0.000 |
| Turks and Caicos Islands | 0.410 | 0.017 | 0.291 | 0.058 | 0.062 | 0.073 | 0.060 | 0.029 |
| Tonga | 0.862 | 0.138 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| Trinidad and Tobago | 0.365 | 0.000 | 0.630 | 0.000 | 0.000 | 0.001 | 0.004 | 0.000 |
| Uganda | 0.000 | 0.000 | 0.314 | 0.066 | 0.000 | 0.000 | 0.210 | 0.410 |
| United States | 0.000 | 0.000 | 0.839 | 0.049 | 0.000 | 0.000 | 0.113 | 0.000 |
| St. Vincent and the Grenadines | 0.756 | 0.080 | 0.036 | 0.000 | 0.020 | 0.036 | 0.072 | 0.000 |
| British Virgin Islands | 0.784 | 0.094 | 0.072 | 0.000 | 0.014 | 0.000 | 0.036 | 0.000 |
| United States Virgin Islands | 0.860 | 0.097 | 0.000 | 0.000 | 0.000 | 0.043 | 0.000 | 0.000 |
| Vanuatu | 0.894 | 0.106 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| South Africa | 0.008 | 0.000 | 0.893 | 0.023 | 0.034 | 0.021 | 0.021 | 0.000 |
| Zambia | 0.104 | 0.037 | 0.340 | 0.063 | 0.000 | 0.049 | 0.333 | 0.073 |
| Zimbabwe | 0.000 | 0.000 | 0.541 | 0.089 | 0.064 | 0.029 | 0.218 | 0.059 |
| Country | United States | United Kingdom | Türkiye | Germany | France | Canada | Australia | Italy |
|---|---|---|---|---|---|---|---|---|
| United Arab Emirates | 0.046 | 0.422 | 0.270 | 0.061 | 0.003 | 0.055 | 0.124 | 0.019 |
| Bahrain | 0.004 | 0.670 | 0.267 | 0.000 | 0.038 | 0.000 | 0.022 | 0.000 |
| Djibouti | 0.810 | 0.150 | 0.000 | 0.000 | 0.000 | 0.041 | 0.000 | 0.000 |
| Algeria | 0.048 | 0.000 | 0.039 | 0.000 | 0.891 | 0.000 | 0.022 | 0.000 |
| Egypt | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
| Iraq | 0.000 | 0.000 | 0.287 | 0.379 | 0.000 | 0.034 | 0.299 | 0.000 |
| Jordan | 0.000 | 0.000 | 0.496 | 0.441 | 0.000 | 0.040 | 0.000 | 0.022 |
| Kuwait | 0.020 | 0.798 | 0.159 | 0.000 | 0.000 | 0.000 | 0.022 | 0.000 |
| Lebanon | 0.000 | 0.000 | 0.000 | 0.016 | 0.089 | 0.426 | 0.470 | 0.000 |
| Libya | 0.000 | 0.000 | 0.448 | 0.229 | 0.006 | 0.128 | 0.000 | 0.189 |
| Morocco | 0.000 | 0.000 | 0.000 | 0.070 | 0.887 | 0.043 | 0.000 | 0.000 |
| Mauritania | 0.035 | 0.031 | 0.177 | 0.245 | 0.214 | 0.107 | 0.025 | 0.167 |
| Oman | 0.000 | 0.281 | 0.209 | 0.000 | 0.000 | 0.000 | 0.509 | 0.000 |
| Qatar | 0.093 | 0.625 | 0.252 | 0.000 | 0.030 | 0.000 | 0.000 | 0.000 |
| Saudi Arabia | 0.237 | 0.241 | 0.002 | 0.000 | 0.000 | 0.088 | 0.432 | 0.000 |
| Sudan | 0.000 | 0.026 | 0.000 | 0.070 | 0.576 | 0.169 | 0.000 | 0.160 |
| Somalia | 0.150 | 0.202 | 0.251 | 0.036 | 0.074 | 0.246 | 0.041 | 0.000 |
| Syria | 0.000 | 0.000 | 0.479 | 0.463 | 0.026 | 0.000 | 0.032 | 0.000 |
| Chad | 0.843 | 0.157 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| Tunisia | 0.047 | 0.041 | 0.064 | 0.066 | 0.517 | 0.133 | 0.016 | 0.115 |
| Yemen | 0.713 | 0.069 | 0.085 | 0.052 | 0.000 | 0.040 | 0.040 | 0.000 |
| Country | United States | Morocco | Spain | Guadeloupe | United Kingdom | Italy | Algeria | Tanzania |
|---|---|---|---|---|---|---|---|---|
| Belgium | 0.000 | 0.283 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.717 |
| Burkina Faso | 0.307 | 0.076 | 0.276 | 0.000 | 0.202 | 0.139 | 0.000 | 0.000 |
| Benin | 0.000 | 0.135 | 0.000 | 0.583 | 0.000 | 0.000 | 0.283 | 0.000 |
| Canada | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| DR Congo | 0.056 | 0.000 | 0.000 | 0.000 | 0.426 | 0.000 | 0.000 | 0.518 |
| Central African Republic | 0.260 | 0.643 | 0.070 | 0.000 | 0.027 | 0.000 | 0.000 | 0.000 |
| Congo Republic | 0.000 | 0.052 | 0.000 | 0.410 | 0.011 | 0.002 | 0.428 | 0.096 |
| Switzerland | 0.135 | 0.639 | 0.000 | 0.000 | 0.186 | 0.000 | 0.040 | 0.000 |
| Cote d’Ivoire | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
| Cameroon | 0.013 | 0.031 | 0.000 | 0.462 | 0.090 | 0.000 | 0.298 | 0.105 |
| Djibouti | 0.000 | 0.267 | 0.000 | 0.000 | 0.000 | 0.000 | 0.545 | 0.188 |
| France | 0.000 | 0.000 | 0.152 | 0.529 | 0.000 | 0.318 | 0.000 | 0.000 |
| Gabon | 0.000 | 0.170 | 0.000 | 0.166 | 0.000 | 0.092 | 0.455 | 0.117 |
| Guinea | 0.000 | 0.204 | 0.000 | 0.000 | 0.000 | 0.000 | 0.796 | 0.000 |
| Haiti | 0.955 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.045 |
| Comoros | 0.000 | 0.404 | 0.000 | 0.000 | 0.000 | 0.000 | 0.596 | 0.000 |
| Luxembourg | 0.000 | 0.067 | 0.231 | 0.054 | 0.339 | 0.300 | 0.009 | 0.000 |
| Monaco | 0.000 | 0.000 | 0.000 | 0.303 | 0.000 | 0.682 | 0.000 | 0.016 |
| Mali | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 | 0.000 |
| New Caledonia | 0.108 | 0.000 | 0.173 | 0.045 | 0.204 | 0.339 | 0.132 | 0.000 |
| Niger | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.800 | 0.200 |
| French Polynesia | 0.074 | 0.168 | 0.020 | 0.336 | 0.000 | 0.235 | 0.166 | 0.000 |
| Rwanda | 0.096 | 0.000 | 0.000 | 0.000 | 0.753 | 0.000 | 0.150 | 0.000 |
| Senegal | 0.000 | 0.783 | 0.022 | 0.000 | 0.000 | 0.000 | 0.195 | 0.000 |
| Chad | 0.000 | 0.148 | 0.000 | 0.164 | 0.000 | 0.000 | 0.688 | 0.000 |
| Togo | 0.000 | 0.070 | 0.000 | 0.406 | 0.000 | 0.000 | 0.352 | 0.173 |