GitHub: https://github.com/INGEOTEC
WebPage: https://ingeotec.github.io/
Definition
Study people’s opinions, appraisals, attitudes, and emotions toward entities, individuals, events, and their attributes.
Formal Definition
Entity
Product, service, person, event, organization, or topic
Aspect
Entity’s component or attribute
Tasks
Definition
The aim is the classification of documents into a fixed number of predefined categories.
Polarity
El día de mañana no podré ir con ustedes a la librería
Negative
Polarity
Positive, Negative, Neutral
Emotion (Multiclass)
Event (Binary)
Gender
Man, Woman, Nonbinary, …
Age
Child, Teen, Adult
Language Variety
Definition
Machine learning (ML) is a subfield of artificial intelligence that focuses on the development and implementation of algorithms capable of learning from data without being explicitly programmed.
Types of ML algorithms
Decision function
\(g(\mathbf x) = -0.78 x_1 + 0.60 x_2 + -0.88\)
Decision function
\(g(\mathbf x) = -0.78 x_1 + 0.60 x_2 + -0.88\)
\(w_0\)
Decision function
text | klass | |
---|---|---|
0 | Ustedes están mal enserio, generar disforia???... | 0 |
1 | CaraotaDigital : #LoMásLeído #Video captó mome... | 1 |
2 | @vixin_twit Lo que te dan de puntos te lo ahor... | 0 |
3 | INVESTIGACIÓN.\nLas autoridades Investigan rob... | 1 |
4 | @AilenMormandoo No encontraste el indicado | 0 |
5 | El incidente en el que un individuo que transi... | 1 |
6 | Emergencia en @Codazzi_online por fuertes agu... | 0 |
7 | La policía boliviana arrestó a 9 venezolanos c... | 1 |
Question
Which of the following tasks does the previous training set belong to?
Problem
The independent variables are texts
Solution
Token as vector
\(d << \lvert \mathcal V \rvert\) (Dense Vector)
\(d = \lvert \mathcal V \rvert\) (Sparse Vector)
Algorithm
Procedure
\[\mathbf x = \sum_{t \in \mathcal U} \mathbf{v}_t\]
Unit Vector
\[\mathbf x = \frac{\sum_{t \in \mathcal U} \mathbf v_t}{\lVert \sum_{t \in \mathcal U} \mathbf v_t \rVert} \]
flowchart LR Entrada([Text]) --> Norm[Text Normalizer] Norm --> Seg[Tokenizer] Seg --> Terminos(...)
Common Types
Term Frequency - IDF
text = training_set[3]['text']
tokens = tm.tokenize(text)
vector = []
for token, tf in zip(*np.unique(tokens, return_counts=True)):
if token not in token2id:
continue
vector.append((token2id[token], tf * token2beta[token]))
vector[:4]
[(59, 7.906890595608518),
(52, 5.8479969065549495),
(1, 1.3219280948873617),
(6, 2.5224042154522373)]
Question
Which of the following representations do you consider to produce a larger vocabulary?
Prediction
tm = TextModel(token_list=[-2, -1, 3, 4], num_option=OPTION_NONE,
usr_option=OPTION_DELETE, url_option=OPTION_DELETE,
emo_option=OPTION_NONE, lc=True, del_dup=False,
del_punc=True, del_diac=True
).fit(training_set)
X = tm.transform(training_set)
labels = np.array([x['klass'] for x in training_set])
m = LinearSVC(dual='auto', class_weight='balanced').fit(X, labels)
hy = m.predict(tm.transform(test_set))
Wordcloud
path = './emoji_text.ttf'
items = tm.token_weight.items
tokens = {tm.id2token[id]: w_norm[id] * _w for id, _w in items()
if w_norm[id] >= 2.0 and np.isfinite(w_norm[id])}
word_cloud = WordCloud(font_path=path,
background_color='white'
).generate_from_frequencies(tokens)
plt.imshow(word_cloud, interpolation='bilinear')
plt.tick_params(left=False, right=False, labelleft=False,
labelbottom=False, bottom=False)