Regionalized resources for the Spanish-language

Lexical and semantic comparative between regions

Affinity matrices of the Spanish language regional vocabularies using lexical (left) and semantic models (right.)

Lexical affinity matrix uses raw vocabularies as bag of words under the cosine distance (weights are empirical probability of the token occurring in each country in our Twitter corpus.) You can see more details in this notebook.

Semantic comparison. It represents each region as the all-knn graph using each regional word embeddings as input for the knn. We used k=33. The procedure is detailed here and this notebook.

Visualization of these affinity matrices

2D UMAP projections (colors are computed using a 3D projections). For these visualizations are based on $k$ nearest neighbor graphs ($k=3$) which capture local features of the affinity matrix.

From lexical affinity matrix. Take a look on this notebook for more details and different $knn$ graphs.

From semantic-based affinity matrix. Take a look on this notebook for more details and for more details and different $knn$ graphs.

Geographic view

Countries' vocabulary.

Vocabulary and geographical similarity visualization, i.e., similar colors imply similarity (computed from 3D UMAP projetions). This notebook shows more maps for different configurations of UMAP.

Regional word embeddings (semantic) and vocabularies (lexical)

Regional word embeddings for variations of the Spanish language. For the following UMAP 2D projections we selected a subset of more than 100K tokens that appear in at least five regions (see this notebook for more details.)

All regions combined - Spanish language.

Argentinean variation of the Spanish language

Mexican variation of the Spanish language

Colors are also 3D UMAP projections; therefore, color clusters are also meaningful. Please note that UMAP projections are pretty stable regarding local shapes, but they can vary from a global perspective. The different forms indicate that other regions have different definitions of common tokens. Therefore, tasks that strongly influence regional meanings may take advantage of regional resources.

Look at this notebook to see some vertices of the graph used to generate the visualizations. Each word is a vertex of the graph that is also connected with other words. Note that definitions vary from region to region. For instance, see the description of iglesia (church), which US Spanish speakers define as Evangeliques and other regions Catholics. Another example comes from the america token, which refers to geographic terms in almost any region and football soccer teams for the MX region.

You can find projections for all Spanish-speaking countries in this notebook.

Resources

We extracted vocabularies and created BERT language models and word-embedding models with fastText for 26 countries having the Spanish language as one of their official or de facto languages. We segmented our collection per country such that models could learn regionalisms. We created nine language models, called BILMA, for eight countries having enough data in our corpus to learn them. The ninth is a larger model using data from all 26 countries. Our BILMA models use the Keras framework; please visit the BILMA repository for more details and usage tutorials. In the case of fastText word embeddings, we learned four different dimensionalities per region using default hyper-parameters. We provide two kinds of models, bin and vec; the prior is a binary model, and the second is an ASCII version of the same model. The ASCII version can be parsed and used without fastText and also used by the same fastText to create supervised (classification) models with pretrained word embeddings.

country	code	sample	voc	BILMA	FT 300d	FT 32d	FT 16d	FT 8d
Argentina	AR	id	voc	model	bin vec	bin vec	bin vec	bin vec
Bolivia	BO	id	voc	-	bin vec	bin vec	bin vec	bin vec
Brazil	BR	id	voc	-	bin vec	bin vec	bin vec	bin vec
Canadá	CA	id	voc	-	bin vec	bin vec	bin vec	bin vec
Chile	CL	id	voc	model	bin vec	bin vec	bin vec	bin vec
Colombia	CO	id	voc	model	bin vec	bin vec	bin vec	bin vec
Costa Rica	CR	id	voc	-	bin vec	bin vec	bin vec	bin vec
Cuba	CU	id	voc	-	bin vec	bin vec	bin vec	bin vec
República Dominicana	DO	id	voc	-	bin vec	bin vec	bin vec	bin vec
Ecuador	EC	id	voc	-	bin vec	bin vec	bin vec	bin vec
España	ES	id	voc	model	bin vec	bin vec	bin vec	bin vec
Francia	FR	id	voc	-	bin vec	bin vec	bin vec	bin vec
Great Britain	GB	id	voc	-	bin vec	bin vec	bin vec	bin vec
Guinea Equatorial	GQ	id	voc	-	bin vec	bin vec	bin vec	bin vec
Guatemala	GT	id	voc	-	bin vec	bin vec	bin vec	bin vec
Honduras	HN	id	voc	-	bin vec	bin vec	bin vec	bin vec
México	MX	id	voc	model	bin vec	bin vec	bin vec	bin vec
Nicaragua	NI	id	voc	-	bin vec	bin vec	bin vec	bin vec
Panamá	PA	id	voc	-	bin vec	bin vec	bin vec	bin vec
Perú	PE	id	voc	-	bin vec	bin vec	bin vec	bin vec
Puerto Rico	PR	id	voc	-	bin vec	bin vec	bin vec	bin vec
Paraguay	PY	id	voc	-	bin vec	bin vec	bin vec	bin vec
El Salvador	SV	id	voc	-	bin vec	bin vec	bin vec	bin vec
United States of America	US	id	voc	model	bin vec	bin vec	bin vec	bin vec
Uruguay	UY	id	voc	model	bin vec	bin vec	bin vec	bin vec
Venezuela	VE	id	voc	model	bin vec	bin vec	bin vec	bin vec
ALL	ALL	-	voc	model	bin vec	bin vec	bin vec	bin vec

Our ALL model is learned from the entire corpora.

OurBILMA models are also available in Huggingface https://huggingface.co/guillermoruiz/bilma. You can see how to work with BILMA with colab to complete phrases and how to use it as embedding generators

Note 1: All BILMA models use a common vocabulary file.

Note 2: You may need to use save as link instead of just click for downloading models.

Corpora

We collected Spanish tweets from 2016 to 2019 using the Twitter API (public stream) to create our manuscript and our resources. The final corpora is described below:

country	code	number of users	number of tweets	number of tokens
Argentina	AR	1,376K	234.22M	2,887.92M
Bolivia	BO	36K	1.15M	20.99M
Chile	CL	415K	45.29M	719.24M
Colombia	CO	701K	61.54M	918.51M
Costa Rica	CR	79K	7.51M	101.67M
Cuba	CU	32K	0.37M	6.30M
Dominican Republic	DO	112K	7.65M	122.06M
Ecuador	EC	207K	13.76M	226.03M
El Salvador	SV	49K	2.71M	44.46M
Equatorial Guinea	GQ	1K	8.93K	0.14M
Guatemala	GT	74K	5.22M	75.79M
Honduras	HN	35K	2.14M	31.26M
Mexico	MX	1,517K	115.53M	1,635.69M
Nicaragua	NI	35K	3.34M	42.47M
Panama	PA	83K	6.62M	108.74M
Paraguay	PY	106K	10.28M	141.75M
Peru	PE	271K	15.38M	241.60M
Puerto Rico	PR	18K	0.58M	7.64M
Spain	ES	1,278K	121.42M	1,908.07M
Uruguay	UY	157K	30.83M	351.81M
Venezuela	VE	421K	35.48M	556.12M

Brazil	BR	1,604K	27.20M	142.22M
Canada	CA	149K	1.55M	21.58M
France	FR	292K	2.43M	27.73M
Great Britain	GB	380K	2.68M	34.62M
United States of America	US	2,652K	40.83M	501.86M
Total		12M	795.74M	10,876.25M

Preprocessing

We preprocessed messages as follows:

lower casing
diacritic marks removed
grouped hashtags, users and numbers
normalize symbol repetitions (max. 2 repeats)
laughs were normalized to four letters
words, punctuactions, and emojis are used as tokens

We only consider Twitter as data source; messages with URLs were discarded, retweets and fourth square and other automatic messages were removed. Short messages were also discarded. Please check our manuscript for more details https://arxiv.org/abs/2110.06128 or contact us.

More resources

Clone our github repository where you can find more metadata, resources, and code: regional vocabulary, regional embeddings, knn graphs of concepts (tokens), token comparison among regions, among others.

Citing

If you find these resources useful, please give us a star in the github repository. Also, if you use them for research purposes, please consider to cite us:

Regionalized models for Spanish language variations based on Twitter. Eric S. Tellez, Daniela Moctezuma, Sabino Miranda, Mario Graff, and Guillermo Ruiz. https://arxiv.org/abs/2110.06128.

@misc{tellez2022regionalized,
      title={Regionalized models for Spanish language variations based on Twitter}, 
      author={Eric S. Tellez and Daniela Moctezuma and Sabino Miranda and Mario Graff and Guillermo Ruiz},
      year={2022},
      eprint={2110.06128},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contact us

If you want to know more about these resources, or you need something not shared in this repo/site, please contact us.

Eric S. Tellez - eric.tellez@infotec.mx
Daniela Moctezuma
Sabino Miranda
Mario Graff
Guillermo Ruiz

regional-spanish-models