regional-spanish-models

Regionalized resources for the Spanish language (NLP)

View on GitHub

Regionalized resources for the Spanish-language

Lexical and semantic comparative between regions

Affinity matrices of the Spanish language regional vocabularies using lexical (left) and semantic models (right.)

lexical Lexical affinity matrix uses raw vocabularies as bag of words under the cosine distance (weights are empirical probability of the token occurring in each country in our Twitter corpus.) You can see more details in this notebook.
semantic Semantic comparison. It represents each region as the all-knn graph using each regional word embeddings as input for the knn. We used k=33. The procedure is detailed here and this notebook.

Visualization of these affinity matrices

2D UMAP projections (colors are computed using a 3D projections). For these visualizations are based on $k$ nearest neighbor graphs ($k=3$) which capture local features of the affinity matrix.

lexical 2d projection From lexical affinity matrix. Take a look on this notebook for more details and different $knn$ graphs.
semantic 2d projection From semantic-based affinity matrix. Take a look on this notebook for more details and for more details and different $knn$ graphs.

Geographic view

lexical 2d projection Countries' vocabulary.
semantic 2d projection

Vocabulary and geographical similarity visualization, i.e., similar colors imply similarity (computed from 3D UMAP projetions). This notebook shows more maps for different configurations of UMAP.

Regional word embeddings (semantic) and vocabularies (lexical)

Regional word embeddings for variations of the Spanish language. For the following UMAP 2D projections we selected a subset of more than 100K tokens that appear in at least five regions (see this notebook for more details.)

lexical All regions combined - Spanish language.
lexical Argentinean variation of the Spanish language
lexical Mexican variation of the Spanish language

Colors are also 3D UMAP projections; therefore, color clusters are also meaningful. Please note that UMAP projections are pretty stable regarding local shapes, but they can vary from a global perspective. The different forms indicate that other regions have different definitions of common tokens. Therefore, tasks that strongly influence regional meanings may take advantage of regional resources.

Look at this notebook to see some vertices of the graph used to generate the visualizations. Each word is a vertex of the graph that is also connected with other words. Note that definitions vary from region to region. For instance, see the description of iglesia (church), which US Spanish speakers define as Evangeliques and other regions Catholics. Another example comes from the america token, which refers to geographic terms in almost any region and football soccer teams for the MX region.

You can find projections for all Spanish-speaking countries in this notebook.

Resources

We extracted vocabularies and created BERT language models and word-embedding models with fastText for 26 countries having the Spanish language as one of their official or de facto languages. We segmented our collection per country such that models could learn regionalisms. We created nine language models, called BILMA, for eight countries having enough data in our corpus to learn them. The ninth is a larger model using data from all 26 countries. Our BILMA models use the Keras framework; please visit the BILMA repository for more details and usage tutorials. In the case of fastText word embeddings, we learned four different dimensionalities per region using default hyper-parameters. We provide two kinds of models, bin and vec; the prior is a binary model, and the second is an ASCII version of the same model. The ASCII version can be parsed and used without fastText and also used by the same fastText to create supervised (classification) models with pretrained word embeddings.

country code sample voc BILMA FT 300d FT 32d FT 16d FT 8d
Argentina AR id voc model bin vec bin vec bin vec bin vec
Bolivia BO id voc - bin vec bin vec bin vec bin vec
Brazil BR id voc - bin vec bin vec bin vec bin vec
Canadá CA id voc - bin vec bin vec bin vec bin vec
Chile CL id voc model bin vec bin vec bin vec bin vec
Colombia CO id voc model bin vec bin vec bin vec bin vec
Costa Rica CR id voc - bin vec bin vec bin vec bin vec
Cuba CU id voc - bin vec bin vec bin vec bin vec
República Dominicana DO id voc - bin vec bin vec bin vec bin vec
Ecuador EC id voc - bin vec bin vec bin vec bin vec
España ES id voc model bin vec bin vec bin vec bin vec
Francia FR id voc - bin vec bin vec bin vec bin vec
Great Britain GB id voc - bin vec bin vec bin vec bin vec
Guinea Equatorial GQ id voc - bin vec bin vec bin vec bin vec
Guatemala GT id voc - bin vec bin vec bin vec bin vec
Honduras HN id voc - bin vec bin vec bin vec bin vec
México MX id voc model bin vec bin vec bin vec bin vec
Nicaragua NI id voc - bin vec bin vec bin vec bin vec
Panamá PA id voc - bin vec bin vec bin vec bin vec
Perú PE id voc - bin vec bin vec bin vec bin vec
Puerto Rico PR id voc - bin vec bin vec bin vec bin vec
Paraguay PY id voc - bin vec bin vec bin vec bin vec
El Salvador SV id voc - bin vec bin vec bin vec bin vec
United States of America US id voc model bin vec bin vec bin vec bin vec
Uruguay UY id voc model bin vec bin vec bin vec bin vec
Venezuela VE id voc model bin vec bin vec bin vec bin vec
ALL ALL - voc model bin vec bin vec bin vec bin vec

Our ALL model is learned from the entire corpora.

OurBILMA models are also available in Huggingface https://huggingface.co/guillermoruiz/bilma. You can see how to work with BILMA with colab to complete phrases and how to use it as embedding generators

Note 1: All BILMA models use a common vocabulary file.

Note 2: You may need to use save as link instead of just click for downloading models.

Corpora

We collected Spanish tweets from 2016 to 2019 using the Twitter API (public stream) to create our manuscript and our resources. The final corpora is described below:

country code number of users number of tweets number of tokens
Argentina AR 1,376K 234.22M 2,887.92M
Bolivia BO 36K 1.15M 20.99M
Chile CL 415K 45.29M 719.24M
Colombia CO 701K 61.54M 918.51M
Costa Rica CR 79K 7.51M 101.67M
Cuba CU 32K 0.37M 6.30M
Dominican Republic DO 112K 7.65M 122.06M
Ecuador EC 207K 13.76M 226.03M
El Salvador SV 49K 2.71M 44.46M
Equatorial Guinea GQ 1K 8.93K 0.14M
Guatemala GT 74K 5.22M 75.79M
Honduras HN 35K 2.14M 31.26M
Mexico MX 1,517K 115.53M 1,635.69M
Nicaragua NI 35K 3.34M 42.47M
Panama PA 83K 6.62M 108.74M
Paraguay PY 106K 10.28M 141.75M
Peru PE 271K 15.38M 241.60M
Puerto Rico PR 18K 0.58M 7.64M
Spain ES 1,278K 121.42M 1,908.07M
Uruguay UY 157K 30.83M 351.81M
Venezuela VE 421K 35.48M 556.12M
         
Brazil BR 1,604K 27.20M 142.22M
Canada CA 149K 1.55M 21.58M
France FR 292K 2.43M 27.73M
Great Britain GB 380K 2.68M 34.62M
United States of America US 2,652K 40.83M 501.86M
Total   12M 795.74M 10,876.25M

Preprocessing

We preprocessed messages as follows:

We only consider Twitter as data source; messages with URLs were discarded, retweets and fourth square and other automatic messages were removed. Short messages were also discarded. Please check our manuscript for more details https://arxiv.org/abs/2110.06128 or contact us.

More resources

Clone our github repository where you can find more metadata, resources, and code: regional vocabulary, regional embeddings, knn graphs of concepts (tokens), token comparison among regions, among others.

Citing

If you find these resources useful, please give us a star in the github repository. Also, if you use them for research purposes, please consider to cite us:

Regionalized models for Spanish language variations based on Twitter. Eric S. Tellez, Daniela Moctezuma, Sabino Miranda, Mario Graff, and Guillermo Ruiz. https://arxiv.org/abs/2110.06128.

@misc{tellez2022regionalized,
      title={Regionalized models for Spanish language variations based on Twitter}, 
      author={Eric S. Tellez and Daniela Moctezuma and Sabino Miranda and Mario Graff and Guillermo Ruiz},
      year={2022},
      eprint={2110.06128},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contact us

If you want to know more about these resources, or you need something not shared in this repo/site, please contact us.