Poornima Institute of Engineering & Technology
INFOTEC
GitHub: https://github.com/INGEOTEC
WebPage: https://ingeotec.github.io/
Definition
The aim is the classification of documents into a fixed number of predefined categories.
Polarity
El día de mañana no podré ir con ustedes a la librería
Negative
Polarity
Positive, Negative, Neutral
Emotion (Multiclass)
Event (Binary)
Gender
Man, Woman, Nonbinary, …
Age
Child, Teen, Adult
Language Variety
Definition
Machine learning (ML) is a subfield of artificial intelligence that focuses on the development and implementation of algorithms capable of learning from data without being explicitly programmed.
Types of ML algorithms
Decision function
\(g(\mathbf x) = -0.78 x_1 + 0.60 x_2 + -0.88\)
Decision function
\(g(\mathbf x) = -0.78 x_1 + 0.60 x_2 + -0.88\)
\(w_0\)
Decision function
text | klass | |
---|---|---|
0 | Ese señor melgarejo q está incurso en investig... | 0 |
1 | Confirmaron que uno de los detenidos sería el ... | 1 |
2 | Go pecho en bombe | 0 |
3 | #eb Asesinada una sobrina de Ángel María Villa... | 1 |
4 | Ojo aquí @iunida https://t.co/VWtvb3g7SY | 0 |
5 | VOA- La Policía de Texas reportó que un hombre... | 1 |
6 | Cada domingo, con escobas, picos y palas salen... | 0 |
7 | La Audiencia de A Coruña inicia hoy el juicio ... | 1 |
Question
Which of the following tasks does the previous training set belong to?
Problem
The independent variables are texts
Solution
Token as vector
\(d << \lvert \mathcal V \rvert\) (Dense Vector)
\(d = \lvert \mathcal V \rvert\) (Sparse Vector)
Algorithm
Procedure
\[\mathbf x = \sum_{t \in \mathcal U} \mathbf{v}_t\]
Unit Vector
\[\mathbf x = \frac{\sum_{t \in \mathcal U} \mathbf v_t}{\lVert \sum_{t \in \mathcal U} \mathbf v_t \rVert} \]
flowchart LR Entrada([Text]) --> Norm[Text Normalizer] Norm --> Seg[Tokenizer] Seg --> Terminos(...)
Common Types
from EvoMSA.utils import Download
from microtc.utils import tweet_iterator
from os.path import isdir, isfile
import pandas as pd
from random import shuffle
URL = 'https://github.com/INGEOTEC/Delitos/releases/download/Datos/delitos.zip'
if not isfile('delitos.zip'):
Download(URL,
'delitos.zip')
if not isdir('delitos'):
!unzip -Pingeotec delitos.zip
Term Frequency - IDF
text = training_set[3]['text']
tokens = tm.tokenize(text)
vector = []
for token, tf in zip(*np.unique(tokens, return_counts=True)):
if token not in token2id:
continue
vector.append((token2id[token], tf * token2beta[token]))
vector[:4]
[(54, 7.906890595608518),
(58, 5.8479969065549495),
(6, 1.3219280948873617),
(0, 2.5224042154522373)]
Question
Which of the following representations do you consider to produce a larger vocabulary?
from sklearn.model_selection import KFold
from sklearn.metrics import recall_score
kfold = KFold(n_splits=5, shuffle=True, random_state=1)
perf = []
for tr, vs in kfold.split(training_set):
train = [training_set[i] for i in tr]
val = [training_set[i] for i in vs]
tm = TextModel(token_list=[-1], num_option=OPTION_NONE,
usr_option=OPTION_DELETE, url_option=OPTION_DELETE,
emo_option=OPTION_NONE, lc=True, del_dup=False,
del_punc=True, del_diac=True).fit(train)
labels = [x['klass'] for x in train]
m = LinearSVC(dual='auto').fit(tm.transform(train), labels)
hy = m.predict(tm.transform(val))
_ = recall_score([x['klass'] for x in val], hy, average='macro')
perf.append(_)
np.mean(perf)
0.8235285558223533
Personal webpage