Dialect Identification (dialectid)

Introduction

DialectId aims to develop a set of algorithms that detect the dialect of a given text. For example, given a text in Spanish, DialectId predicts the Spanish-speaking country from which the text comes.

DialectId is available for Arabic (ar), German (de), English (en), Spanish (es), French (fr), Dutch (nl), Portuguese (pt), Russian (ru), Turkish (tr), and Chinese (zh).

Installing using conda

DialectId can be install using the conda package manager with the following instruction.

conda install --channel conda-forge dialectid

Installing using pip

A more general approach to installing DialectId is through the use of the command pip, as illustrated in the following instruction.

pip install dialectid

Dialect Identification

DialectId can be used to predict the dialect of a list of texts using the method predict as seen in the following lines. The first line imports the DialectId class, the second instantiates the class in the Spanish language, and finally, the third line predicts two utterances. The first corresponds to an expression that would be common in Mexico, and the second is an expression that could be associated with Argentina, Uruguay, Chile, and other South American countries.

from dialectid import DialectId
detect = DialectId(lang='es')
detect.predict(['comiendo unos tacos',
                'acompañando el asado con un buen vino'])

array(['mx', 'uy'], dtype='<U2')

Countries

The available dialects for each language can be identified in the attribute countries, as seen in the following snippet for Spanish.

from dialectid import DialectId
detect = DialectId(lang='es')
detect.countries

array(['ar', 'bo', 'cl', 'co', 'cr', 'cu', 'do', 'ec', 'es', 'gq', 'gt',
       'hn', 'mx', 'ni', 'pa', 'pe', 'pr', 'py', 'sv', 'uy', 've'],
      dtype='<U2')

Decision Function

One might be interested in all the countries from which the speaker could come. To facilitate this, one can use the decision_function method. DialectId uses linear Support Vector Machines (SVM) as classifiers; consequently, the positive values in the decision_function are interpreted as belonging to the positive class, i.e., a particular country. The following code exemplifies this idea: the first two lines import and instantiate the DialectId class in Spanish. The third line computes the decision-function values; it returns a two-dimensional array where the first dimension corresponds to the number of texts. In this case, it keeps only the decision-function values, where the positive values indicate the presence of the particular country. The fourth line sorts the values where the highest value is the first element. The fifth line retrieves the country and its associated decision-function values, considering only those countries with positive values.

from dialectid import DialectId
detect = DialectId(lang='es')
df = detect.decision_function(['acompañando el asado con un buen vino'])[0]
index = df.argsort()[::-1]
[(detect.countries[i], df[i]) for i in index
 if df[i] > 0]

[(np.str_('uy'), np.float32(1.5416805)),
 (np.str_('py'), np.float32(1.3321806)),
 (np.str_('ar'), np.float32(1.2182581))]

Probability

In some situations, one is interested in the probability instead of the decision-function values of a linear SVM. The probability can be computed using the predict_proba method. The following code exemplifies this idea: the first line imports the DialectId class as in previous examples. The second line differs from the last example in that the parameter probability is set to true. The rest of the lines are almost equivalent to the previous example.

from dialectid import DialectId
detect = DialectId(lang='es', probability=True)
prob = detect.predict_proba(['acompañando el asado con un buen vino'])[0]
index = prob.argsort()[::-1]
[(detect.countries[i], prob[i])
 for i in index[:4]]

[(np.str_('uy'), np.float32(0.45955184)),
 (np.str_('ar'), np.float32(0.353442)),
 (np.str_('py'), np.float32(0.18695451)),
 (np.str_('cl'), np.float32(2.8124754e-05))]

Figure 1: Number of tweets in the collection for the Arabic-speaking countries.

Table 1: Number of tweets in the training and test sets for the Arabic-speaking countries.

Country	train	test	train (orig. dist.)	test (orig. dist.)	Corpus
Saudi Arabia	119009	4096	1139707	1101214	61578662
Egypt	119122	4096	271439	287583	14665935
Kuwait	119117	4096	188944	187432	10208696
United Arab Emirates	119600	4096	115345	105957	6232153
Oman	119771	4096	70484	70730	3808309
Iraq	119655	4096	50912	63215	2750834
Qatar	119362	4096	48860	46962	2639967
Bahrain	119666	4096	45196	38131	2441971
Lebanon	119370	4096	35812	30455	1934983
Jordan	119718	4096	34619	33242	1870514
Libya	119659	4096	31495	29417	1701716
Yemen	119823	4096	16917	33165	914053
Algeria	119143	4096	16609	18617	897394
Morocco	119556	4096	9600	16093	518739
Sudan	120078	4096	7662	16291	413993
Tunisia	119244	4096	6405	7435	346082
Syria	119159	4093	5768	9596	311660
Mauritania	41017	1809	844	760	45624
Somalia	17410	561	355	234	19215
Chad	4797	706	105	295	5706
Djibouti	2873	309	63	152	3420
Sum	2097149	73014	2097141	2096976	113309626

Figure 2: Number of tweets in the collection for the German-speaking countries.

Table 2: Number of tweets in the training and test sets for the German-speaking countries.

Country	train	test	train (orig. dist.)	test (orig. dist.)	Corpus
Germany	83023	4096	80620	1110160	1262931
Austria	7004	4096	7004	100180	109718
Switzerland	4573	4096	4547	64578	71231
Sum	94600	12288	92171	1274918	1443880

Figure 3: Number of tweets in the collection for the English-speaking countries.

Table 3: Number of tweets in the training and test sets for the English-speaking countries.

Country	train	test	train (orig. dist.)	test (orig. dist.)	Corpus
United States	36492	4096	1411215	1269614	1241245184
United Kingdom	36417	4096	284076	304237	249861873
Canada	36338	4096	79421	81213	69855356
India	36348	4096	76678	128000	67442862
Nigeria	36199	4096	43566	73111	38319549
South Africa	36323	4096	42569	43805	37442472
Australia	36373	4096	38466	46482	33833515
Philippines	36599	4096	36427	21966	32039887
Ireland	36352	4096	20796	24196	18291944
Kenya	36231	4096	10383	20803	9132974
Pakistan	36376	4096	9236	16466	8124273
Ghana	36395	4096	8702	14811	7654670
New Zealand	36361	4096	6610	8397	5813959
Singapore	36048	4096	5608	4189	4933008
Uganda	36511	4096	4662	15003	4100771
Jamaica	36332	4096	3185	3332	2801604
Zimbabwe	36237	4096	1809	3387	1591827
Trinidad and Tobago	36459	4096	1725	1968	1517980
Zambia	36686	4096	1468	2464	1291544
Namibia	36553	4096	1268	1752	1115587
Bahamas	36223	4096	1265	1110	1113202
Barbados	36478	4096	868	766	764085
Malawi	36373	4096	753	1944	662789
Rwanda	36374	4096	496	946	436529
Cameroon	36461	4096	416	785	365902
Malta	36405	4096	398	560	350352
Antigua and Barbuda	36526	4096	356	347	313582
Guam	36494	3008	351	101	309229
St. Lucia	36223	4096	313	235	275897
Eswatini	36408	4096	268	354	236190
Mauritius	36306	4096	263	211	231391
Bermuda	36319	4096	259	299	227865
Isle of Man	36220	1495	248	50	218569
Lesotho	35926	4096	241	491	212309
Cayman Islands	36161	4096	204	191	180023
Gambia	36296	4096	204	516	179764
Gibraltar	36216	4096	193	224	170041
Sierra Leone	36278	4096	183	532	161814
Turks and Caicos Islands	36277	3064	179	106	158077
Sudan	36460	4096	165	177	145226
St. Vincent and the Grenadines	36324	4096	160	209	140768
Belize	36538	4096	154	211	136040
Liberia	36247	4096	136	389	120223
Grenada	36573	2761	134	97	118559
British Virgin Islands	36276	1650	126	57	111011
Guyana	36654	4096	106	193	93531
St. Kitts and Nevis	36601	3652	106	125	93321
United States Virgin Islands	36592	219	103	7	90888
Northern Mariana Islands	36550	617	100	21	88606
Papua New Guinea	36038	3904	89	136	78435
Puerto Rico	36594	3164	74	113	65874
Dominica	36452	1140	63	38	55815
Sint Maarten	36311	1745	57	59	50880
Fiji	36538	1934	53	65	47474
Guernsey	24068	1790	31	62	27863
Tonga	25728	901	31	31	27690
Anguilla	23965	1250	30	42	26826
Vanuatu	13985	767	17	26	15601
Falkland Islands	11917	412	14	13	13158
Micronesia, Fed. Sts.	7497	266	10	9	9306
Cook Islands	8053	274	10	9	8961
Solomon Islands	8141	458	10	15	9479
Palau	6556	691	8	23	7491
St. Helena	2876	974	5	32	4458
Sum	2097128	204072	2097120	2097123	1844565933

Figure 4: Number of tweets in the collection for the Spanish-speaking countries.

Table 4: Number of tweets in the training and test sets for the Spanish-speaking countries.

Country	train	test	train (orig. dist.)	test (orig. dist.)	Corpus
Argentina	104466	4096	536137	415844	156910687
Spain	103924	4096	401933	421172	117633385
Mexico	104318	4096	353432	388367	103438764
Colombia	104267	4096	204831	261796	59947766
Chile	104027	4096	156319	143770	45749886
Venezuela	104496	4096	109073	88022	31922346
Uruguay	103733	4096	66209	60004	19377563
Ecuador	104408	4096	53037	68303	15522286
Peru	103907	4096	52144	59480	15261118
Paraguay	104617	4096	33486	37244	9800404
Dominican Republic	104000	4096	30142	36468	8821881
Panama	104014	4096	25525	27081	7470575
Costa Rica	104415	4096	19730	16252	5774617
Guatemala	103800	4096	17401	22567	5092733
El Salvador	104111	4096	10990	12949	3216498
Honduras	104020	4096	8660	14988	2534698
Nicaragua	104438	4096	8435	6951	2468938
Bolivia	103537	4096	4913	6523	1438141
Cuba	104570	4096	3359	8783	983104
Puerto Rico	103994	1487	1320	149	386595
Equatorial Guinea	14090	4096	64	429	18783
Sum	2097152	83407	2097140	2097142	613770768

Figure 5: Number of tweets in the collection for the French-speaking countries.

Table 5: Number of tweets in the training and test sets for the French-speaking countries.

Country	train	test	train (orig. dist.)	test (orig. dist.)	Corpus
France	215940	4096	1488423	1475608	9241569
Canada	215745	4096	122173	134958	758570
DR Congo	214853	4096	93877	100146	582879
Cameroon	215520	4096	92835	97799	576414
Belgium	214731	4096	65710	72143	407994
Senegal	216403	4096	53106	47977	329734
Cote d’Ivoire	215555	4096	52423	44793	325496
Switzerland	116815	4096	24217	19484	150366
Guinea	90619	4096	19163	16753	118984
Benin	63347	4096	14310	14802	88856
Mali	54495	4096	12232	12809	75952
Togo	44077	4096	9698	10088	60220
Burkina Faso	28514	4096	6957	7326	43197
Gabon	27051	4096	5986	6156	37173
Haiti	25939	4096	5909	5850	36694
Niger	27468	4096	5900	5742	36638
Congo Republic	26441	4096	5584	5094	34676
Chad	15160	4096	3452	4008	21439
Monaco	13037	4096	2901	3029	18014
Luxembourg	11358	4096	2820	3506	17511
Central African Republic	13445	2122	2551	1453	15840
New Caledonia	7150	1715	1486	1222	9230
French Polynesia	6408	2304	1459	1610	9065
Djibouti	6331	2237	1429	1577	8873
Comoros	6025	1736	1273	1197	7908
Rwanda	4695	2749	1263	2010	7848
Sum	2097122	94783	2097137	2097140	13021140

Figure 6: Number of tweets in the collection for the Dutch-speaking countries.

Table 6: Number of tweets in the training and test sets for the Dutch-speaking countries.

Country	train	test	train (orig. dist.)	test (orig. dist.)	Corpus
Netherlands	102887	4096	102887	947365	1092054
Belgium	16880	4096	15640	144451	166015
Sum	119767	8192	118527	1091816	1258069

Figure 7: Number of tweets in the collection for the Portuguese-speaking countries.

Table 7: Number of tweets in the training and test sets for the Portuguese-speaking countries.

Country	train	test	train (orig. dist.)	test (orig. dist.)	Corpus
Brazil	978254	4096	2049142	2046379	79748620
Portugal	979061	4096	43008	45744	1673795
Mozambique	78571	4096	2656	2366	103393
Angola	53243	4096	2071	2400	80610
Cabo Verde	8022	2434	273	260	10627
Sum	2097151	18818	2097150	2097149	81617045

Figure 8: Number of tweets in the collection for the Russian-speaking countries.

Table 8: Number of tweets in the training and test sets for the Russian-speaking countries.

Country	train	test	train (orig. dist.)	test (orig. dist.)	Corpus
Russia	962191	4096	1887634	311120	11487343
Belarus	652818	4096	114108	25498	694419
Kazakhstan	336189	4096	67909	35634	413267
Kyrgyz Republic	145942	4096	27499	18562	167350
Sum	2097140	16384	2097150	390814	12762379

Figure 9: Number of tweets in the collection for the Turkish-speaking countries.

Table 9: Number of tweets in the training and test sets for the Turkish-speaking countries.

Country	train	test	train (orig. dist.)	test (orig. dist.)	Corpus
Türkiye	550662	4096	550664	184575	756902
Cyprus	3085	924	2936	924	4036
Sum	553747	5020	553600	185499	760938

Figure 10: Number of tweets in the collection for the Chinese-speaking countries.

Table 10: Number of tweets in the training and test sets for the Chinese-speaking countries.

Country	train	test	train (orig. dist.)	test (orig. dist.)	Corpus
China	206312	4096	206313	175991	510998
Taiwan	115165	4096	100741	110266	249518
Hong Kong	13944	4096	10315	9315	25549
Singapore	4452	3963	4130	4623	10230
Sum	339873	16251	321499	300195	796295

Description

The dataset used to create the self-supervised problems is a collection of Tweets collected from the open stream for several years, i.e., the Spanish collection started on December 11, 2015; English on July 1, 2016; Arabic on January 25, 2017; Russian on October 16, 2018; and the rest of the languages on June 1, 2021. In all the cases, the last day collected was June 9, 2023. The collected Tweets were filtered with the following restrictions: retweets were removed; URLs and usernames were replaced by the tokens _url and _usr, respectively; and only tweets with at least 50 characters were included in the final collection. The column Corpus in Table 1 and Figure 1 show the number of tweets collected for the Arabic-speaking countries. The figure shows that there are days when more tweets are collected, and there is a tendency to collect fewer tweets in 2023 due to changes in the Twitter API. The data corresponding to German, English, Spanish, French, Dutch, Portuguese, Russian, Turkish, and Chinese are shown in Table 2, Table 3, Table 4, Table 5, Table 6, Table 7, Table 8, Table 9, and Table 10; and Figure 2, Figure 3, Figure 4, Figure 5, Figure 6, Figure 7, Figure 8, Figure 9, and Figure 10.

The corpora are used to create two pairs of training and test sets. The training sets are drawn from tweets published before October 1, 2022, and the test sets are taken from tweets published on or after October 3, 2022. The procedure for creating the set pairs consists of two stages. In the first stage, the tweets were organized by country and then selected to form a uniform distribution by day. Within each day, near duplicates were removed. Then, a three-day sliding window was used to remove near duplicates within the window. The final step was to shuffle the data to remove the ordering by date, respecting the limit between the training and test sets.

The tweets of the first pair were selected to follow a uniform distribution by country as closely as possible. In this pair, the size of the training set is roughly 2 million tweets, whereas the test set size is \(2^{12}\) (4,096) tweets per country. We also produce a smaller training set containing 262 thousand tweets. The procedure is equivalent to the previous one, aiming to have a uniform distribution of the countries. The column identified with the legend train in Table 1, Table 2, Table 3, Table 4, Table 5, Table 6, Table 7, Table 8, Table 9, and Table 10 shows the size of the training set, and the column test indicates the size of the test set in the first pair of training and test sets. It is worth mentioning that we did not have enough information for all the countries and languages to follow an exactly uniform distribution. For example, Table 4 (Spanish) notes that for Puerto Rico (pr), 1,487 tweets in the test set correspond to the total number of available tweets that meet the imposed restrictions.

The second pair of tweets was selected to follow the original distribution of the corpus; in this case, the training and test set has a maximum size of 2 million tweets. The process of selecting the tweets was set as a convex optimization problem where the objective is to maximize the number of tweets subject to a maximum of 2 million (\(2^{21}\)), and the availability of tweets for each country, and the distribution is given by all the tweets available. The column identified with the legend train (orig. dist.) in Table 1, Table 2, Table 3, Table 4, Table 5, Table 6, Table 7, Table 8, Table 9, and Table 10 shows the size of the training set, and the column test (orig. dist.) indicates the size of the test set in the second pair of training and test sets.

DialectId is a text classifier based on a Bag of Words (BoW) representation with a linear Support Vector Machine (SVM) as the classifier.

The normalization procedure used in the BoW corresponds to setting all characters to lowercase, removing diacritics, and replacing usernames and URLs with the tags “_usr” and “_url”, respectively.

The BoW representation weights the tokens with the term frequency and inverse document frequency (TF-IDF). The tokens correspond to words, bi-grams of words, and q-grams of characters (with q=4, 3, 2). The tokens and weights were estimated using each language’s training dataset (2 million tweets). The tokens (vocabulary) with higher frequency in the training set were kept. We developed systems for different vocabulary sizes, i.e., \(2^{17}\), \(2^{18}\), and \(2^{19}.\)

The BoW can be used by importing the BoW class, as seen in the following example, where the good morning text is transformed into a vector space. The first line imports the class, the second line instantiates the class, where the parameter token_max_filter indicates the vocabulary size, and the third line converts the text into a vector space.

from dialectid import BoW
bow = BoW(lang='en', token_max_filter=2**18)
bow.transform(['good morning'])

<Compressed Sparse Row sparse matrix of dtype 'float32'
    with 36 stored elements and shape (1, 262144)>

Each text in the training set is represented in the vector space, and the associated country is retained for use in a linear SVM using the one-vs-all strategy. The approach creates as many binary classification problems as there are different classes. In the binary problems, each class corresponds to the positive class exactly once, and it is the negative class in the remaining cases. Traditionally, one uses all the information in the approach, which is the case for the reduced training set (262 thousand tweets). However, the negative examples were limited to the maximum number of positive classes or \(2^{14}\) tweets in the whole training set. In both cases, the examples are weighted inversely proportional to class frequencies to treat an imbalanced dataset.

Complementing the previous example, the following code instantiates the DialectId in Spanish using a vocabulary size of \(2^{18}\) indicated by the parameters lang and token_max_filter, respectively.

from dialectid import DialectId
detect = DialectId(lang='es', token_max_filter=2**18)
detect.predict(['comiendo unos tacos'])

array(['mx'], dtype='<U2')

A drawback of using SVM is that it does not estimate the classification probability. For some applications, it is more amenable to calculate the probability instead of the decision-function value. Thus, the developed systems are calibrated to estimate the probability by training a logistic regression using the SVM’s decision function as inputs. The calibration procedure involves predicting the SVM’s decision function on the reduced training set using stratified k-fold cross-validation (k = 3). The decision functions predicted are the inputs of the logistic regression, and the classes are the ones in the reduced training set; the parameters that weight each example inversely proportional to class frequencies are used in this case. To invoke the model using probability, the parameter probability must be set to true, as shown in the following example.

from dialectid import DialectId
detect = DialectId(lang='es', probability=True)
detect.predict_proba(['comiendo unos tacos'])

array([[1.9695617e-06, 4.8897579e-07, 1.8095196e-05, 3.6204598e-05,
        1.5155481e-03, 1.1557303e-05, 5.4983921e-06, 2.8960928e-06,
        2.2128135e-05, 5.0534654e-05, 1.7656431e-01, 2.0665383e-02,
        6.3909459e-01, 1.3323617e-01, 1.0678391e-04, 7.2897665e-06,
        2.6716828e-02, 6.4306505e-08, 1.9328916e-03, 8.0165564e-06,
        2.7745039e-06]], dtype=float32)

As described previously, there are two training sets: one that follows a uniform distribution in the countries as closely as possible, and the second one that follows the distribution seen in the corpus, namely the original distribution (identified as orig. dist.). The parameter uniform_distribution indicates which training set is used to estimate the parameters. By default, the parameter is set to true to use the training sets with uniform distribution in the countries.

from dialectid import DialectId
detect = DialectId(lang='es',
                   uniform_distribution=False,
                   probability=True)
detect.predict_proba(['comiendo unos tacos'])

array([[5.8839246e-06, 4.0277046e-06, 1.5494716e-05, 3.9904633e-05,
        1.5577762e-02, 6.6911220e-05, 9.4599171e-05, 2.6315629e-05,
        1.4272113e-05, 4.7489698e-06, 5.5686768e-02, 2.2258271e-02,
        8.8939697e-01, 1.3783798e-02, 3.9202135e-04, 8.4752110e-06,
        4.6369270e-05, 9.1043790e-07, 2.5635022e-03, 5.8783221e-06,
        7.1929417e-06]], dtype=float32)

Table 11: Performance of the different algorithms and languages.

Language	Spanish	English	Arabic	German	French	Dutch	Portuguese	Russian	Turkish	Chinese
DialectId[19]	0.3936	0.2992	0.4080	0.5418	0.3647	0.7708	0.6053	0.4922	0.5423	0.7012
DialectId[18]	0.3886	0.2960	0.4036	0.5426	0.3607	0.7734	0.6049	0.4885	0.5450	0.7032
DialectId[17]	0.3818	0.2917	0.3980	0.5426	0.3556	0.7742	0.6026	0.4852	0.5477	0.7026
DialectId[19] (prob)	0.3898	0.2796	0.4050	0.5482	0.3632	0.7708	0.6190	0.4837	0.5423	0.7213
DialectId[18] (prob)	0.3854	0.2779	0.4011	0.5548	0.3595	0.7734	0.6170	0.4769	0.5450	0.7213
DialectId[17] (prob)	0.3779	0.2743	0.3953	0.5519	0.3545	0.7742	0.6151	0.4793	0.5477	0.7187
DialectId[19] (262k)	0.3528	0.2828	0.3803	0.4883	0.3587	0.7701	0.6136	0.4604	0.5500	0.6723
DialectId[18] (262k)	0.3476	0.2787	0.3767	0.4945	0.3540	0.7731	0.6091	0.4590	0.5532	0.6732
DialectId[17] (262k)	0.3412	0.2726	0.3700	0.5002	0.3476	0.7731	0.6041	0.4598	0.5562	0.6740
StackBoW (262k)	0.3331	0.2441	0.3673	0.4893	0.3339	0.7823	0.6000	0.4468	0.5649	0.6859
DialectId[19] (Orig. Dist.)	0.3263	0.1262	0.3207	0.5408	0.3025	0.7667	0.4199	0.4477	0.5401	0.6924

Figure 11: Performance of the different algorithms and languages.

Table 12: Performance of the different algorithms and languages on the original distribution.

Language	Spanish	English	Arabic	German	French	Dutch	Portuguese	Russian	Turkish	Chinese
DialectId[19]	0.3939	0.2973	0.4041	0.5398	0.3625	0.7661	0.5977	0.4823	0.5421	0.7011
DialectId[18]	0.3879	0.2922	0.3982	0.5413	0.3590	0.7689	0.5980	0.4797	0.5450	0.7030
DialectId[17]	0.3808	0.2871	0.3942	0.5427	0.3543	0.7711	0.5979	0.4780	0.5473	0.7028
DialectId[19] (prob)	0.3899	0.2744	0.4022	0.5443	0.3612	0.7661	0.6141	0.4731	0.5421	0.7229
DialectId[18] (prob)	0.3849	0.2721	0.3972	0.5506	0.3579	0.7689	0.6175	0.4698	0.5450	0.7221
DialectId[17] (prob)	0.3769	0.2680	0.3924	0.5496	0.3527	0.7711	0.6116	0.4701	0.5473	0.7190
DialectId[19] (Orig. Dist.)	0.3264	0.1237	0.3156	0.5389	0.3017	0.7620	0.4207	0.4436	0.5401	0.6922

Figure 12: Distributions of Arabic-speaking countries.

Figure 13: Distributions of German-speaking countries.

Figure 14: Distributions of English-speaking countries.

Figure 15: Distributions of Spanish-speaking countries.

Figure 16: Distributions of French-speaking countries.

Figure 17: Distributions of Dutch-speaking countries.

Figure 18: Distributions of Portuguese-speaking countries.

Figure 19: Distributions of Russian-speaking countries.

Figure 20: Distributions of Turkish-speaking countries.

Figure 21: Distributions of Chinese-speaking countries.

Performance

The performance of different algorithms is presented in Figure 11 using macro-recall. The best-performing system in almost all cases is DialectId, which is trained on 2 million tweets and has a vocabulary of 500,000 tokens. The exception are Turkish and Dutch, where the best systems is StackBoW trained with only 262k tweets.

The remaining figures provide details on macro-recall by presenting the system’s recall in each country.

Figure 22: Distribution estimated with DialectId on the Tweets having as geographic informacion United States.

Figure 23: Distribution estimated with DialectId on the Tweets having as geographic informacion Brazil.

Figure 24: Distribution estimated with DialectId on the Tweets having as geographic informacion Great Britain.

Figure 25: Distribution estimated with DialectId on the Tweets having as geographic informacion Italy.

Figure 26: Distribution estimated with DialectId on the Tweets having as geographic informacion France.

Figure 27: Distribution estimated with DialectId on the Tweets having as geographic informacion Canada.

Figure 28: Distribution estimated with DialectId on the Tweets having as geographic informacion Germany.

Figure 29: Distribution estimated with DialectId on the Tweets having as geographic informacion Portugal.

Description

Table 13: Probability of the origin of Tweets in different non-Spanish-speaking countries.

Country	United States	Brazil	United Kingdom	Italy	France	Canada	Germany	Portugal
Argentina	0.000	0.523	0.000	0.445	0.000	0.000	0.032	0.000
Bolivia	0.000	0.640	0.132	0.227	0.000	0.000	0.001	0.000
Chile	0.000	0.114	0.162	0.091	0.015	0.283	0.336	0.000
Colombia	0.000	0.000	0.000	0.000	0.000	1.000	0.000	0.000
Costa Rica	0.002	0.244	0.000	0.000	0.000	0.151	0.602	0.000
Cuba	0.000	0.361	0.051	0.312	0.000	0.076	0.096	0.104
Dominican Republic	1.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
Ecuador	0.010	0.112	0.000	0.098	0.000	0.518	0.070	0.192
Spain	0.000	0.000	0.325	0.000	0.035	0.000	0.000	0.640
Equatorial Guinea	0.000	0.122	0.084	0.000	0.593	0.000	0.000	0.200
Guatemala	0.001	0.000	0.000	0.000	0.021	0.863	0.115	0.000
Honduras	0.890	0.000	0.000	0.030	0.000	0.055	0.000	0.026
Mexico	0.260	0.032	0.000	0.000	0.000	0.707	0.000	0.000
Nicaragua	0.790	0.000	0.000	0.001	0.000	0.176	0.033	0.000
Panama	0.854	0.000	0.000	0.028	0.000	0.032	0.069	0.017
Peru	0.000	0.038	0.000	0.944	0.000	0.018	0.000	0.000
Puerto Rico	1.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
Paraguay	0.000	1.000	0.000	0.000	0.000	0.000	0.000	0.000
El Salvador	0.665	0.000	0.000	0.000	0.000	0.335	0.000	0.000
Uruguay	0.000	0.989	0.000	0.000	0.000	0.000	0.011	0.000
Venezuela	0.042	0.000	0.000	0.164	0.000	0.455	0.000	0.339

Figure 30: Distribution estimated with DialectId on the Tweets having as geographic informacion Malaysia.

Figure 31: Distribution estimated with DialectId on the Tweets having as geographic informacion Indonesia.

Figure 32: Distribution estimated with DialectId on the Tweets having as geographic informacion Brasil.

Figure 33: Distribution estimated with DialectId on the Tweets having as geographic informacion Germany.

Figure 34: Distribution estimated with DialectId on the Tweets having as geographic informacion Spain.

Figure 35: Distribution estimated with DialectId on the Tweets having as geographic informacion France.

Figure 36: Distribution estimated with DialectId on the Tweets having as geographic informacion Italy.

Figure 37: Distribution estimated with DialectId on the Tweets having as geographic informacion United Arab Emirates.

Description

Table 14: Probability of the origin of Tweets in different non-English-speaking countries.

Country	Malaysia	Indonesia	Brazil	Germany	Spain	France	Italy	United Arab Emirates
Antigua and Barbuda	0.044	0.000	0.545	0.000	0.412	0.000	0.000	0.000
Anguilla	0.000	0.001	0.000	0.076	0.000	0.657	0.037	0.230
Australia	0.000	0.000	0.000	0.253	0.510	0.236	0.000	0.000
Barbados	0.000	0.000	0.958	0.000	0.034	0.009	0.000	0.000
Bermuda	0.000	0.000	0.028	0.410	0.070	0.374	0.119	0.000
Bahamas	0.000	0.000	0.875	0.000	0.000	0.000	0.046	0.079
Belize	0.000	0.000	0.239	0.000	0.695	0.063	0.003	0.000
Canada	0.000	0.000	0.000	0.934	0.000	0.066	0.000	0.000
Cook Islands	0.000	0.000	0.000	0.000	0.000	0.000	0.000	1.000
Cameroon	0.000	0.000	0.091	0.000	0.000	0.907	0.000	0.002
Dominica	0.876	0.000	0.107	0.018	0.000	0.000	0.000	0.000
Fiji	0.117	0.563	0.034	0.000	0.000	0.000	0.286	0.000
Falkland Islands	0.000	0.000	0.000	0.284	0.323	0.394	0.000	0.000
Micronesia, Fed. Sts.	0.989	0.011	0.000	0.000	0.000	0.000	0.000	0.000
United Kingdom	0.000	0.000	0.000	0.046	0.887	0.067	0.000	0.000
Grenada	0.966	0.034	0.000	0.000	0.000	0.000	0.000	0.000
Guernsey	0.000	0.000	0.000	0.088	0.052	0.583	0.278	0.000
Ghana	0.000	0.161	0.119	0.000	0.124	0.000	0.000	0.595
Gibraltar	0.000	0.000	0.000	0.000	1.000	0.000	0.000	0.000
Gambia	0.726	0.139	0.022	0.000	0.000	0.000	0.000	0.114
Guam	0.691	0.000	0.229	0.000	0.000	0.000	0.080	0.000
Guyana	0.000	0.000	0.958	0.000	0.000	0.016	0.026	0.000
Ireland	0.000	0.000	0.000	0.000	0.825	0.175	0.000	0.000
Isle of Man	0.000	0.000	0.000	0.066	0.860	0.074	0.000	0.000
India	0.000	0.032	0.000	0.000	0.000	0.000	0.000	0.968
Jamaica	0.009	0.112	0.047	0.832	0.000	0.000	0.000	0.000
Kenya	0.000	0.109	0.007	0.000	0.000	0.000	0.000	0.884
St. Kitts and Nevis	0.000	0.000	0.673	0.000	0.022	0.032	0.139	0.135
Cayman Islands	0.000	0.000	0.063	0.200	0.000	0.020	0.716	0.000
St. Lucia	0.054	0.000	0.166	0.000	0.780	0.000	0.000	0.000
Liberia	0.003	0.017	0.756	0.000	0.042	0.000	0.132	0.050
Lesotho	0.902	0.000	0.098	0.000	0.000	0.000	0.000	0.000
Northern Mariana Islands	0.757	0.000	0.213	0.000	0.000	0.000	0.030	0.000
Malta	0.000	0.000	0.000	0.000	0.000	0.000	1.000	0.000
Mauritius	0.000	0.000	0.291	0.127	0.029	0.411	0.000	0.142
Malawi	0.405	0.000	0.429	0.058	0.108	0.000	0.000	0.000
Namibia	0.999	0.000	0.001	0.000	0.000	0.000	0.000	0.000
Nigeria	0.000	0.446	0.000	0.000	0.000	0.000	0.000	0.554
New Zealand	0.000	0.000	0.067	0.637	0.072	0.224	0.000	0.000
Papua New Guinea	0.694	0.243	0.000	0.063	0.000	0.000	0.000	0.000
Philippines	0.000	0.753	0.052	0.195	0.000	0.000	0.000	0.000
Pakistan	0.000	0.001	0.000	0.000	0.000	0.000	0.000	0.999
Puerto Rico	0.000	0.000	0.788	0.000	0.212	0.000	0.000	0.000
Palau	0.760	0.021	0.000	0.152	0.000	0.000	0.030	0.037
Rwanda	0.000	0.009	0.123	0.299	0.094	0.178	0.173	0.123
Solomon Islands	0.048	0.519	0.015	0.000	0.308	0.000	0.109	0.000
Sudan	0.000	0.000	0.031	0.000	0.000	0.000	0.000	0.969
Singapore	0.898	0.102	0.000	0.000	0.000	0.000	0.000	0.000
St. Helena	0.000	0.000	0.095	0.093	0.164	0.205	0.443	0.000
Sierra Leone	0.062	0.094	0.291	0.176	0.000	0.000	0.015	0.362
Sint Maarten	0.000	0.000	0.137	0.000	0.000	0.863	0.000	0.000
Eswatini	0.794	0.031	0.150	0.000	0.000	0.000	0.025	0.000
Turks and Caicos Islands	0.000	0.022	0.926	0.000	0.000	0.000	0.005	0.047
Tonga	0.000	0.000	0.000	0.000	0.000	1.000	0.000	0.000
Trinidad and Tobago	0.013	0.000	0.903	0.000	0.084	0.000	0.000	0.000
Uganda	0.000	0.119	0.000	0.056	0.021	0.000	0.079	0.725
United States	0.000	0.000	0.088	0.912	0.000	0.000	0.000	0.000
St. Vincent and the Grenadines	0.932	0.000	0.068	0.000	0.000	0.000	0.000	0.000
British Virgin Islands	0.027	0.000	0.745	0.000	0.000	0.007	0.202	0.018
United States Virgin Islands	0.000	0.000	0.000	0.360	0.000	0.609	0.032	0.000
Vanuatu	0.000	0.393	0.390	0.000	0.002	0.003	0.211	0.000
South Africa	0.000	0.079	0.459	0.383	0.000	0.079	0.000	0.000
Zambia	0.831	0.000	0.000	0.055	0.000	0.000	0.000	0.114
Zimbabwe	0.000	0.187	0.359	0.199	0.105	0.064	0.018	0.069

Figure 38: Distribution estimated with DialectId on the Tweets having as geographic informacion United States.

Figure 39: Distribution estimated with DialectId on the Tweets having as geographic informacion Great Britain.

Figure 40: Distribution estimated with DialectId on the Tweets having as geographic informacion Turkey.

Figure 41: Distribution estimated with DialectId on the Tweets having as geographic informacion Germany.

Figure 42: Distribution estimated with DialectId on the Tweets having as geographic informacion France.

Figure 43: Distribution estimated with DialectId on the Tweets having as geographic informacion Canada.

Figure 44: Distribution estimated with DialectId on the Tweets having as geographic informacion Australia.

Figure 45: Distribution estimated with DialectId on the Tweets having as geographic informacion Italy.

Description

Table 15: Probability of the origin of Tweets in different non-Arabic-speaking countries.

Country	United States	United Kingdom	Türkiye	Germany	France	Canada	Australia	Italy
United Arab Emirates	0.057	0.617	0.027	0.080	0.000	0.040	0.152	0.028
Bahrain	0.017	0.806	0.083	0.022	0.022	0.000	0.048	0.000
Djibouti	0.254	0.060	0.002	0.055	0.150	0.125	0.107	0.249
Algeria	0.000	0.000	0.000	0.028	0.940	0.033	0.000	0.000
Egypt	0.000	0.000	0.000	0.025	0.000	0.000	0.000	0.975
Iraq	0.002	0.000	0.174	0.485	0.000	0.022	0.317	0.000
Jordan	0.016	0.000	0.486	0.357	0.000	0.029	0.075	0.036
Kuwait	0.004	0.885	0.061	0.000	0.009	0.000	0.042	0.000
Lebanon	0.000	0.000	0.000	0.010	0.105	0.578	0.307	0.000
Libya	0.000	0.000	0.319	0.151	0.009	0.144	0.000	0.377
Morocco	0.041	0.000	0.018	0.092	0.749	0.101	0.000	0.000
Mauritania	0.011	0.000	0.207	0.183	0.348	0.139	0.048	0.063
Oman	0.000	0.393	0.000	0.000	0.000	0.000	0.607	0.000
Qatar	0.065	0.829	0.040	0.053	0.000	0.000	0.000	0.013
Saudi Arabia	0.150	0.229	0.012	0.000	0.000	0.092	0.518	0.000
Sudan	0.000	0.000	0.000	0.061	0.531	0.180	0.000	0.228
Somalia	0.057	0.000	0.454	0.000	0.066	0.374	0.047	0.001
Syria	0.000	0.000	0.599	0.401	0.000	0.000	0.000	0.000
Chad	0.026	0.038	0.667	0.049	0.051	0.065	0.000	0.105
Tunisia	0.000	0.000	0.018	0.040	0.591	0.160	0.027	0.165
Yemen	0.654	0.078	0.046	0.118	0.000	0.042	0.044	0.016

Figure 46: Distribution estimated with DialectId on the Tweets having as geographic informacion United States.

Figure 47: Distribution estimated with DialectId on the Tweets having as geographic informacion Moroco.

Figure 48: Distribution estimated with DialectId on the Tweets having as geographic informacion Spain.

Figure 49: Distribution estimated with DialectId on the Tweets having as geographic informacion Guadeloupe.

Figure 50: Distribution estimated with DialectId on the Tweets having as geographic informacion Great Britain.

Figure 51: Distribution estimated with DialectId on the Tweets having as geographic informacion Italy.

Figure 52: Distribution estimated with DialectId on the Tweets having as geographic informacion Algeria.

Figure 53: Distribution estimated with DialectId on the Tweets having as geographic informacion Tanzania.

Description

Table 16: Probability of the origin of Tweets in different non-French-speaking countries.

Country	United States	Morocco	Spain	Guadeloupe	United Kingdom	Italy	Algeria	Tanzania
Belgium	0.000	0.608	0.269	0.000	0.000	0.124	0.000	0.000
Burkina Faso	0.000	0.000	0.000	0.000	0.000	0.000	0.000	1.000
Benin	0.000	0.000	0.000	0.000	0.000	0.000	0.000	1.000
Canada	0.095	0.000	0.000	0.000	0.546	0.000	0.000	0.359
DR Congo	0.000	0.000	0.000	0.000	0.862	0.138	0.000	0.000
Central African Republic	0.000	0.140	0.000	0.539	0.046	0.016	0.260	0.000
Congo Republic	0.000	0.027	0.000	0.973	0.000	0.000	0.000	0.000
Switzerland	0.746	0.254	0.000	0.000	0.000	0.000	0.000	0.000
Cote d’Ivoire	0.000	0.000	0.000	0.000	0.000	0.000	0.000	1.000
Cameroon	0.000	0.000	0.000	1.000	0.000	0.000	0.000	0.000
Djibouti	0.000	0.037	0.000	0.000	0.000	0.000	0.963	0.000
France	0.000	0.000	0.470	0.000	0.000	0.530	0.000	0.000
Gabon	0.000	0.000	0.000	0.705	0.000	0.000	0.295	0.000
Guinea	0.000	0.383	0.000	0.000	0.000	0.000	0.617	0.000
Haiti	1.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
Comoros	0.000	0.343	0.000	0.083	0.000	0.000	0.574	0.000
Luxembourg	0.002	0.104	0.025	0.309	0.168	0.392	0.000	0.000
Monaco	0.000	0.025	0.000	0.000	0.000	0.975	0.000	0.000
Mali	0.000	0.000	0.000	0.000	0.000	0.000	1.000	0.000
New Caledonia	0.000	0.000	0.064	0.859	0.000	0.000	0.078	0.000
Niger	0.000	0.000	0.000	0.000	0.000	0.000	1.000	0.000
French Polynesia	0.000	0.000	0.585	0.415	0.000	0.000	0.000	0.000
Rwanda	0.000	0.000	0.187	0.000	0.620	0.193	0.000	0.000
Senegal	0.072	0.777	0.000	0.000	0.000	0.000	0.150	0.000
Chad	0.000	0.000	0.000	0.000	0.000	0.000	1.000	0.000
Togo	0.000	0.000	0.000	0.276	0.035	0.000	0.170	0.519