GitHub: https://github.com/INGEOTEC
WebPage: https://ingeotec.github.io/
Definition
Study people’s opinions, appraisals, attitudes, and emotions toward entities, individuals, events, and their attributes.
Formal Definition
Entity
Product, service, person, event, organization, or topic
Aspect
Entity’s component or attribute
Tasks
Definition
The aim is the classification of documents into a fixed number of predefined categories.
Polarity
El día de mañana no podré ir con ustedes a la librería
Negative
Polarity
Positive, Negative, Neutral
Emotion (Multiclass)
Event (Binary)
Gender
Man, Woman, Nonbinary, …
Age
Child, Teen, Adult
Language Variety
Definition
Machine learning (ML) is a subfield of artificial intelligence that focuses on the development and implementation of algorithms capable of learning from data without being explicitly programmed.
Types of ML algorithms
Decision function
\(g(\mathbf x) = -0.78 x_1 + 0.60 x_2 + -0.88\)
Decision function
\(g(\mathbf x) = -0.78 x_1 + 0.60 x_2 + -0.88\)
\(w_0\)
Decision function
text | klass | |
---|---|---|
0 | @retrochenta aquí está el video vía @rtve \nht... | 0 |
1 | 🔴#ATENCIÓN | Cámara de seguridad capta el mome... | 1 |
2 | @mandramas @GalataLudovisi Y falta que se va a... | 0 |
3 | Secuestran y asesinan a una estudiante de bach... | 1 |
4 | “Me va a tocar algo bueno porque me guiño el o... | 0 |
5 | Los Vilos: Cuatro detenidos por golpiza y apuñ... | 1 |
6 | Resto del Carlino 25.04.19 https://t.co/uCIjMp... | 0 |
7 | Detenidos por múltiples atracos con arma de fu... | 1 |
Question
Which of the following tasks does the previous training set belong to?
Problem
The independent variables are texts
Solution
Token as vector
\(d << \lvert \mathcal V \rvert\) (Dense Vector)
\(d = \lvert \mathcal V \rvert\) (Sparse Vector)
Algorithm
Procedure
\[\mathbf x = \sum_{t \in \mathcal U} \mathbf{v}_t\]
Unit Vector
\[\mathbf x = \frac{\sum_{t \in \mathcal U} \mathbf v_t}{\lVert \sum_{t \in \mathcal U} \mathbf v_t \rVert} \]
flowchart LR Entrada([Text]) --> Norm[Text Normalizer] Norm --> Seg[Tokenizer] Seg --> Terminos(...)
Common Types
Term Frequency - IDF
text = training_set[3]['text']
tokens = tm.tokenize(text)
vector = []
for token, tf in zip(*np.unique(tokens, return_counts=True)):
if token not in token2id:
continue
vector.append((token2id[token], tf * token2beta[token]))
vector[:4]
[(57, 7.906890595608518),
(62, 5.8479969065549495),
(14, 1.3219280948873617),
(24, 2.5224042154522373)]
Question
Which of the following representations do you consider to produce a larger vocabulary?
Prediction
tm = TextModel(token_list=[-2, -1, 3, 4], num_option=OPTION_NONE,
usr_option=OPTION_DELETE, url_option=OPTION_DELETE,
emo_option=OPTION_NONE, lc=True, del_dup=False,
del_punc=True, del_diac=True
).fit(training_set)
X = tm.transform(training_set)
labels = np.array([x['klass'] for x in training_set])
m = LinearSVC(dual='auto', class_weight='balanced').fit(X, labels)
hy = m.predict(tm.transform(test_set))
Wordcloud
path = './emoji_text.ttf'
items = tm.token_weight.items
tokens = {tm.id2token[id]: w_norm[id] * _w for id, _w in items()
if w_norm[id] >= 2.0 and np.isfinite(w_norm[id])}
word_cloud = WordCloud(font_path=path,
background_color='white'
).generate_from_frequencies(tokens)
plt.imshow(word_cloud, interpolation='bilinear')
plt.tick_params(left=False, right=False, labelleft=False,
labelbottom=False, bottom=False)