Poornima Institute of Engineering & Technology
INFOTEC
GitHub: https://github.com/INGEOTEC
WebPage: https://ingeotec.github.io/
Definition
The aim is the classification of documents into a fixed number of predefined categories.
Polarity
El día de mañana no podré ir con ustedes a la librería
Negative
Polarity
Positive, Negative, Neutral
Emotion (Multiclass)
Event (Binary)
Gender
Man, Woman, Nonbinary, …
Age
Child, Teen, Adult
Language Variety
Definition
Machine learning (ML) is a subfield of artificial intelligence that focuses on the development and implementation of algorithms capable of learning from data without being explicitly programmed.
Types of ML algorithms
Decision function
\(g(\mathbf x) = -0.78 x_1 + 0.60 x_2 + -0.88\)
Decision function
\(g(\mathbf x) = -0.78 x_1 + 0.60 x_2 + -0.88\)
\(w_0\)
Decision function
| text | klass | |
|---|---|---|
| 0 | Hace 1 hora la policia estaba persiguiendo a u... | 0 |
| 1 | Detenidos un hombre y a un menor de edad por r... | 1 |
| 2 | Gobernador la cava ud será el responsable de l... | 0 |
| 3 | Intento de Golpe de Estado en Turquía dejó 161... | 1 |
| 4 | Ni el café me deja tomarlo tranquilo y tumbadi... | 0 |
| 5 | Tres guardias civiles rescatan a una mujer que... | 1 |
| 6 | @jgarciaruminot @jacoloma Honorables Senadores... | 0 |
| 7 | “Siete personas detenidas”: Asesinaron de múlt... | 1 |
Question
Which of the following tasks does the previous training set belong to?
Problem
The independent variables are texts
Solution
Token as vector
\(d << \lvert \mathcal V \rvert\) (Dense Vector)
\(d = \lvert \mathcal V \rvert\) (Sparse Vector)
Algorithm
Procedure
\[\mathbf x = \sum_{t \in \mathcal U} \mathbf{v}_t\]
Unit Vector
\[\mathbf x = \frac{\sum_{t \in \mathcal U} \mathbf v_t}{\lVert \sum_{t \in \mathcal U} \mathbf v_t \rVert} \]
flowchart LR
Entrada([Text]) --> Norm[Text Normalizer]
Norm --> Seg[Tokenizer]
Seg --> Terminos(...)
Common Types
from EvoMSA.utils import Download
from microtc.utils import tweet_iterator
from os.path import isdir, isfile
import pandas as pd
from random import shuffle
URL = 'https://github.com/INGEOTEC/Delitos/releases/download/Datos/delitos.zip'
if not isfile('delitos.zip'):
Download(URL,
'delitos.zip')
if not isdir('delitos'):
!unzip -Pingeotec delitos.zipTerm Frequency - IDF
text = training_set[3]['text']
tokens = tm.tokenize(text)
vector = []
for token, tf in zip(*np.unique(tokens, return_counts=True)):
if token not in token2id:
continue
vector.append((token2id[token], tf * token2beta[token]))
vector[:4][(62, 7.906890595608518),
(59, 5.8479969065549495),
(35, 1.3219280948873617),
(38, 2.5224042154522373)]
Question
Which of the following representations do you consider to produce a larger vocabulary?
from sklearn.model_selection import KFold
from sklearn.metrics import recall_score
kfold = KFold(n_splits=5, shuffle=True, random_state=1)
perf = []
for tr, vs in kfold.split(training_set):
train = [training_set[i] for i in tr]
val = [training_set[i] for i in vs]
tm = TextModel(token_list=[-1], num_option=OPTION_NONE,
usr_option=OPTION_DELETE, url_option=OPTION_DELETE,
emo_option=OPTION_NONE, lc=True, del_dup=False,
del_punc=True, del_diac=True).fit(train)
labels = [x['klass'] for x in train]
m = LinearSVC(dual='auto').fit(tm.transform(train), labels)
hy = m.predict(tm.transform(val))
_ = recall_score([x['klass'] for x in val], hy, average='macro')
perf.append(_)
np.mean(perf)0.8235285558223533
Personal webpage