GitHub: https://github.com/INGEOTEC
WebPage: https://ingeotec.github.io/
Definition
Study people’s opinions, appraisals, attitudes, and emotions toward entities, individuals, events, and their attributes.
Formal Definition
Entity
Product, service, person, event, organization, or topic
Aspect
Entity’s component or attribute
Tasks
Definition
The aim is the classification of documents into a fixed number of predefined categories.
Polarity
El día de mañana no podré ir con ustedes a la librería
Negative
Polarity
Positive, Negative, Neutral
Emotion (Multiclass)
Event (Binary)
Gender
Man, Woman, Nonbinary, …
Age
Child, Teen, Adult
Language Variety
Definition
Machine learning (ML) is a subfield of artificial intelligence that focuses on the development and implementation of algorithms capable of learning from data without being explicitly programmed.
Types of ML algorithms
Decision function
\(g(\mathbf x) = -0.78 x_1 + 0.60 x_2 + -0.88\)
Decision function
\(g(\mathbf x) = -0.78 x_1 + 0.60 x_2 + -0.88\)
\(w_0\)
Decision function
| text | klass | |
|---|---|---|
| 0 | El cartel pegado en la frente de "pelotuda" lo... | 0 |
| 1 | Tiroteo en bar de Carolina del Sur deja 2 muer... | 1 |
| 2 | Estamos en #Instagram. Seguí nuestro usuario p... | 0 |
| 3 | La Policía nos confirma la detención de un hom... | 1 |
| 4 | La ralentización de la globalización y sus can... | 0 |
| 5 | Golpeado y capturado un hombre que intentó rob... | 1 |
| 6 | @josemata1974 @amacasqui @diegobravorayo @Cami... | 0 |
| 7 | 🚨🚔🚑#Sucesos #SJR Vecinos detienen a sujeto por... | 1 |
Question
Which of the following tasks does the previous training set belong to?
Problem
The independent variables are texts
Solution
Token as vector
\(d << \lvert \mathcal V \rvert\) (Dense Vector)
\(d = \lvert \mathcal V \rvert\) (Sparse Vector)
Algorithm
Procedure
\[\mathbf x = \sum_{t \in \mathcal U} \mathbf{v}_t\]
Unit Vector
\[\mathbf x = \frac{\sum_{t \in \mathcal U} \mathbf v_t}{\lVert \sum_{t \in \mathcal U} \mathbf v_t \rVert} \]
flowchart LR
Entrada([Text]) --> Norm[Text Normalizer]
Norm --> Seg[Tokenizer]
Seg --> Terminos(...)
Common Types
Term Frequency - IDF
text = training_set[3]['text']
tokens = tm.tokenize(text)
vector = []
for token, tf in zip(*np.unique(tokens, return_counts=True)):
if token not in token2id:
continue
vector.append((token2id[token], tf * token2beta[token]))
vector[:4][(62, 7.906890595608518),
(56, 5.8479969065549495),
(10, 1.3219280948873617),
(23, 2.5224042154522373)]
Question
Which of the following representations do you consider to produce a larger vocabulary?
Prediction
tm = TextModel(token_list=[-2, -1, 3, 4], num_option=OPTION_NONE,
usr_option=OPTION_DELETE, url_option=OPTION_DELETE,
emo_option=OPTION_NONE, lc=True, del_dup=False,
del_punc=True, del_diac=True
).fit(training_set)
X = tm.transform(training_set)
labels = np.array([x['klass'] for x in training_set])
m = LinearSVC(dual='auto', class_weight='balanced').fit(X, labels)
hy = m.predict(tm.transform(test_set))Wordcloud
path = './emoji_text.ttf'
items = tm.token_weight.items
tokens = {tm.id2token[id]: w_norm[id] * _w for id, _w in items()
if w_norm[id] >= 2.0 and np.isfinite(w_norm[id])}
word_cloud = WordCloud(font_path=path,
background_color='white'
).generate_from_frequencies(tokens)
plt.imshow(word_cloud, interpolation='bilinear')
plt.tick_params(left=False, right=False, labelleft=False,
labelbottom=False, bottom=False)