Regionalized resources for the Spanish-language
Lexical and semantic comparative between regions
Affinity matrices of the Spanish language regional vocabularies using lexical (left) and semantic models (right.)
Visualization of these affinity matrices
2D UMAP projections (colors are computed using a 3D projections). For these visualizations are based on $k$ nearest neighbor graphs ($k=3$) which capture local features of the affinity matrix.
Geographic view
Vocabulary and geographical similarity visualization, i.e., similar colors imply similarity (computed from 3D UMAP projetions). This notebook shows more maps for different configurations of UMAP.
Regional word embeddings (semantic) and vocabularies (lexical)
Regional word embeddings for variations of the Spanish language. For the following UMAP 2D projections we selected a subset of more than 100K tokens that appear in at least five regions (see this notebook for more details.)
Colors are also 3D UMAP projections; therefore, color clusters are also meaningful. Please note that UMAP projections are pretty stable regarding local shapes, but they can vary from a global perspective. The different forms indicate that other regions have different definitions of common tokens. Therefore, tasks that strongly influence regional meanings may take advantage of regional resources.
Look at this notebook to see some vertices of the graph used to generate the visualizations. Each word is a vertex of the graph that is also connected with other words. Note that definitions vary from region to region. For instance, see the description of iglesia (church), which US Spanish speakers define as Evangeliques and other regions Catholics. Another example comes from the america token, which refers to geographic terms in almost any region and football soccer teams for the MX region.
You can find projections for all Spanish-speaking countries in this notebook.
Resources
We extracted vocabularies and created BERT language models and word-embedding models with fastText for 26 countries having the Spanish language as one of their official or de facto languages. We segmented our collection per country such that models could learn regionalisms.
We created nine language models, called BILMA, for eight countries having enough data in our corpus to learn them. The ninth is a larger model using data from all 26 countries. Our BILMA models use the Keras framework; please visit the BILMA repository for more details and usage tutorials. In the case of fastText word embeddings, we learned four different dimensionalities per region using default hyper-parameters. We provide two kinds of models, bin
and vec
; the prior is a binary model, and the second is an ASCII version of the same model. The ASCII version can be parsed and used without fastText and also used by the same fastText to create supervised (classification) models with pretrained word embeddings.
Our ALL model is learned from the entire corpora.
OurBILMA models are also available in Huggingface https://huggingface.co/guillermoruiz/bilma. You can see how to work with BILMA with colab to complete phrases and how to use it as embedding generators
Note 1: All BILMA models use a common vocabulary file.
Note 2: You may need to use save as link instead of just click for downloading models.
Corpora
We collected Spanish tweets from 2016 to 2019 using the Twitter API (public stream) to create our manuscript and our resources. The final corpora is described below:
country | code | number of users | number of tweets | number of tokens |
---|---|---|---|---|
Argentina | AR | 1,376K | 234.22M | 2,887.92M |
Bolivia | BO | 36K | 1.15M | 20.99M |
Chile | CL | 415K | 45.29M | 719.24M |
Colombia | CO | 701K | 61.54M | 918.51M |
Costa Rica | CR | 79K | 7.51M | 101.67M |
Cuba | CU | 32K | 0.37M | 6.30M |
Dominican Republic | DO | 112K | 7.65M | 122.06M |
Ecuador | EC | 207K | 13.76M | 226.03M |
El Salvador | SV | 49K | 2.71M | 44.46M |
Equatorial Guinea | GQ | 1K | 8.93K | 0.14M |
Guatemala | GT | 74K | 5.22M | 75.79M |
Honduras | HN | 35K | 2.14M | 31.26M |
Mexico | MX | 1,517K | 115.53M | 1,635.69M |
Nicaragua | NI | 35K | 3.34M | 42.47M |
Panama | PA | 83K | 6.62M | 108.74M |
Paraguay | PY | 106K | 10.28M | 141.75M |
Peru | PE | 271K | 15.38M | 241.60M |
Puerto Rico | PR | 18K | 0.58M | 7.64M |
Spain | ES | 1,278K | 121.42M | 1,908.07M |
Uruguay | UY | 157K | 30.83M | 351.81M |
Venezuela | VE | 421K | 35.48M | 556.12M |
Brazil | BR | 1,604K | 27.20M | 142.22M |
Canada | CA | 149K | 1.55M | 21.58M |
France | FR | 292K | 2.43M | 27.73M |
Great Britain | GB | 380K | 2.68M | 34.62M |
United States of America | US | 2,652K | 40.83M | 501.86M |
Total | 12M | 795.74M | 10,876.25M |
Preprocessing
We preprocessed messages as follows:
- lower casing
- diacritic marks removed
- grouped hashtags, users and numbers
- normalize symbol repetitions (max. 2 repeats)
- laughs were normalized to four letters
- words, punctuactions, and emojis are used as tokens
We only consider Twitter as data source; messages with URLs were discarded, retweets and fourth square and other automatic messages were removed. Short messages were also discarded. Please check our manuscript for more details https://arxiv.org/abs/2110.06128 or contact us.
More resources
Citing
If you find these resources useful, please give us a star in the github repository. Also, if you use them for research purposes, please consider to cite us:
Regionalized models for Spanish language variations based on Twitter. Eric S. Tellez, Daniela Moctezuma, Sabino Miranda, Mario Graff, and Guillermo Ruiz. https://arxiv.org/abs/2110.06128.
@misc{tellez2022regionalized,
title={Regionalized models for Spanish language variations based on Twitter},
author={Eric S. Tellez and Daniela Moctezuma and Sabino Miranda and Mario Graff and Guillermo Ruiz},
year={2022},
eprint={2110.06128},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Contact us
If you want to know more about these resources, or you need something not shared in this repo/site, please contact us.
- Eric S. Tellez -
eric.tellez@infotec.mx
- Daniela Moctezuma
- Sabino Miranda
- Mario Graff
- Guillermo Ruiz