GitHub: https://github.com/INGEOTEC
WebPage: https://ingeotec.github.io/
Definition
Study people’s opinions, appraisals, attitudes, and emotions toward entities, individuals, events, and their attributes.
Formal Definition
Entity
Product, service, person, event, organization, or topic
Aspect
Entity’s component or attribute
Tasks
Definition
The aim is the classification of documents into a fixed number of predefined categories.
Polarity
El día de mañana no podré ir con ustedes a la librería
Negative
Polarity
Positive, Negative, Neutral
Emotion (Multiclass)
Event (Binary)
Gender
Man, Woman, Nonbinary, …
Age
Child, Teen, Adult
Language Variety
Definition
Machine learning (ML) is a subfield of artificial intelligence that focuses on the development and implementation of algorithms capable of learning from data without being explicitly programmed.
Types of ML algorithms
Decision function
\(g(\mathbf x) = -0.78 x_1 + 0.60 x_2 + -0.88\)
Decision function
\(g(\mathbf x) = -0.78 x_1 + 0.60 x_2 + -0.88\)
\(w_0\)
Decision function
text | klass | |
---|---|---|
0 | Fallece un ganadero de Peñacerrada atrapado po... | 0 |
1 | Carlos 'El Yoyas', imputado por un presunto ma... | 1 |
2 | Espacio Municipalista critica la política de a... | 0 |
3 | Cuatro detenidos por presunto hurto de ganado ... | 1 |
4 | @Wini83 @robersantacruz @juanjoph_73 @DCCoruna... | 0 |
5 | OCURRIÓ FRENTE MISMO A SU DOMICILIO \n\nUn jov... | 1 |
6 | Leighton Meester y Adam Brody juntos por 2,3 s... | 0 |
7 | Brutal asesinato de una patota en Rafael Casti... | 1 |
Question
Which of the following tasks does the previous training set belong to?
Problem
The independent variables are texts
Solution
Token as vector
\(d << \lvert \mathcal V \rvert\) (Dense Vector)
\(d = \lvert \mathcal V \rvert\) (Sparse Vector)
Algorithm
Procedure
\[\mathbf x = \sum_{t \in \mathcal U} \mathbf{v}_t\]
Unit Vector
\[\mathbf x = \frac{\sum_{t \in \mathcal U} \mathbf v_t}{\lVert \sum_{t \in \mathcal U} \mathbf v_t \rVert} \]
Common Types
Term Frequency - IDF
text = training_set[3]['text']
tokens = tm.tokenize(text)
vector = []
for token, tf in zip(*np.unique(tokens, return_counts=True)):
if token not in token2id:
continue
vector.append((token2id[token], tf * token2beta[token]))
vector[:4]
[(62, np.float64(7.906890595608518)),
(50, np.float64(5.8479969065549495)),
(31, np.float64(1.3219280948873617)),
(15, np.float64(2.5224042154522373))]
Question
Which of the following representations do you consider to produce a larger vocabulary?
Prediction
tm = TextModel(token_list=[-2, -1, 3, 4], num_option=OPTION_NONE,
usr_option=OPTION_DELETE, url_option=OPTION_DELETE,
emo_option=OPTION_NONE, lc=True, del_dup=False,
del_punc=True, del_diac=True
).fit(training_set)
X = tm.transform(training_set)
labels = np.array([x['klass'] for x in training_set])
m = LinearSVC(dual='auto', class_weight='balanced').fit(X, labels)
hy = m.predict(tm.transform(test_set))
Wordcloud
path = './emoji_text.ttf'
items = tm.token_weight.items
tokens = {tm.id2token[id]: w_norm[id] * _w for id, _w in items()
if w_norm[id] >= 2.0 and np.isfinite(w_norm[id])}
word_cloud = WordCloud(font_path=path,
background_color='white'
).generate_from_frequencies(tokens)
plt.imshow(word_cloud, interpolation='bilinear')
plt.tick_params(left=False, right=False, labelleft=False,
labelbottom=False, bottom=False)