Poornima Institute of Engineering & Technology
INFOTEC
GitHub: https://github.com/INGEOTEC
WebPage: https://ingeotec.github.io/
Definition
The aim is the classification of documents into a fixed number of predefined categories.
Polarity
El día de mañana no podré ir con ustedes a la librería
Negative
Polarity
Positive, Negative, Neutral
Emotion (Multiclass)
Event (Binary)
Gender
Man, Woman, Nonbinary, …
Age
Child, Teen, Adult
Language Variety
Definition
Machine learning (ML) is a subfield of artificial intelligence that focuses on the development and implementation of algorithms capable of learning from data without being explicitly programmed.
Types of ML algorithms
Decision function
\(g(\mathbf x) = -0.78 x_1 + 0.60 x_2 + -0.88\)
Decision function
\(g(\mathbf x) = -0.78 x_1 + 0.60 x_2 + -0.88\)
\(w_0\)
Decision function
| text | klass | |
|---|---|---|
| 0 | @josebaazkarraga @joegrn23 Y sin viaje con boc... | 0 |
| 1 | #EstoPasó | La Policía Nacional capturó a Greg... | 1 |
| 2 | Como hago para denunciar un comerciante por ma... | 0 |
| 3 | Un policía muerto y 25 heridos en ataque con g... | 1 |
| 4 | [Video] Maduro: Develamos plan de un grupo fas... | 0 |
| 5 | Asesinan a dos hombres en #Empalme \nSuman cua... | 1 |
| 6 | @Monica_Garcia_G Entornos seguros y sanos dice... | 0 |
| 7 | Recién por la calle de atrás de mi casa le rob... | 1 |
Question
Which of the following tasks does the previous training set belong to?
Problem
The independent variables are texts
Solution
Token as vector
\(d << \lvert \mathcal V \rvert\) (Dense Vector)
\(d = \lvert \mathcal V \rvert\) (Sparse Vector)
Algorithm
Procedure
\[\mathbf x = \sum_{t \in \mathcal U} \mathbf{v}_t\]
Unit Vector
\[\mathbf x = \frac{\sum_{t \in \mathcal U} \mathbf v_t}{\lVert \sum_{t \in \mathcal U} \mathbf v_t \rVert} \]
flowchart LR
Entrada([Text]) --> Norm[Text Normalizer]
Norm --> Seg[Tokenizer]
Seg --> Terminos(...)
Common Types
from EvoMSA.utils import Download
from microtc.utils import tweet_iterator
from os.path import isdir, isfile
import pandas as pd
from random import shuffle
URL = 'https://github.com/INGEOTEC/Delitos/releases/download/Datos/delitos.zip'
if not isfile('delitos.zip'):
Download(URL,
'delitos.zip')
if not isdir('delitos'):
!unzip -Pingeotec delitos.zipTerm Frequency - IDF
text = training_set[3]['text']
tokens = tm.tokenize(text)
vector = []
for token, tf in zip(*np.unique(tokens, return_counts=True)):
if token not in token2id:
continue
vector.append((token2id[token], tf * token2beta[token]))
vector[:4][(52, np.float64(7.906890595608518)),
(61, np.float64(5.8479969065549495)),
(33, np.float64(1.3219280948873617)),
(32, np.float64(2.5224042154522373))]
Question
Which of the following representations do you consider to produce a larger vocabulary?
from sklearn.model_selection import KFold
from sklearn.metrics import recall_score
kfold = KFold(n_splits=5, shuffle=True, random_state=1)
perf = []
for tr, vs in kfold.split(training_set):
train = [training_set[i] for i in tr]
val = [training_set[i] for i in vs]
tm = TextModel(token_list=[-1], num_option=OPTION_NONE,
usr_option=OPTION_DELETE, url_option=OPTION_DELETE,
emo_option=OPTION_NONE, lc=True, del_dup=False,
del_punc=True, del_diac=True).fit(train)
labels = [x['klass'] for x in train]
m = LinearSVC(dual='auto').fit(tm.transform(train), labels)
hy = m.predict(tm.transform(val))
_ = recall_score([x['klass'] for x in val], hy, average='macro')
perf.append(_)
np.mean(perf)np.float64(0.8235285558223533)
Personal webpage