Poornima Institute of Engineering & Technology
INFOTEC
GitHub: https://github.com/INGEOTEC
WebPage: https://ingeotec.github.io/
Definition
The aim is the classification of documents into a fixed number of predefined categories.
Polarity
El día de mañana no podré ir con ustedes a la librería
Negative
Polarity
Positive, Negative, Neutral
Emotion (Multiclass)
Event (Binary)
Gender
Man, Woman, Nonbinary, …
Age
Child, Teen, Adult
Language Variety
Definition
Machine learning (ML) is a subfield of artificial intelligence that focuses on the development and implementation of algorithms capable of learning from data without being explicitly programmed.
Types of ML algorithms
Decision function
\(g(\mathbf x) = -0.78 x_1 + 0.60 x_2 + -0.88\)
Decision function
\(g(\mathbf x) = -0.78 x_1 + 0.60 x_2 + -0.88\)
\(w_0\)
Decision function
text | klass | |
---|---|---|
0 | 🛑 Detengamos a los robots asesinos antes de qu... | 0 |
1 | Policías municipales de Querétaro detectaron y... | 1 |
2 | @MicaSuarez12 @cosmicvelasco AINARA LA CONCHAB... | 0 |
3 | "Un profesor de Jerez ata con una cuerda y amo... | 1 |
4 | Del Califa de Dios y Su sirviente AlMahdi al E... | 0 |
5 | .@VirguezFranklin @jguaido Detenidos venezolan... | 1 |
6 | ECA: Empagliflocina (iSGLT2) vs placebo. 💡\nAq... | 0 |
7 | Acusan a una pareja por robar un millón de dól... | 1 |
Question
Which of the following tasks does the previous training set belong to?
Problem
The independent variables are texts
Solution
Token as vector
\(d << \lvert \mathcal V \rvert\) (Dense Vector)
\(d = \lvert \mathcal V \rvert\) (Sparse Vector)
Algorithm
Procedure
\[\mathbf x = \sum_{t \in \mathcal U} \mathbf{v}_t\]
Unit Vector
\[\mathbf x = \frac{\sum_{t \in \mathcal U} \mathbf v_t}{\lVert \sum_{t \in \mathcal U} \mathbf v_t \rVert} \]
Common Types
from EvoMSA.utils import Download
from microtc.utils import tweet_iterator
from os.path import isdir, isfile
import pandas as pd
from random import shuffle
URL = 'https://github.com/INGEOTEC/Delitos/releases/download/Datos/delitos.zip'
if not isfile('delitos.zip'):
Download(URL,
'delitos.zip')
if not isdir('delitos'):
!unzip -Pingeotec delitos.zip
Term Frequency - IDF
text = training_set[3]['text']
tokens = tm.tokenize(text)
vector = []
for token, tf in zip(*np.unique(tokens, return_counts=True)):
if token not in token2id:
continue
vector.append((token2id[token], tf * token2beta[token]))
vector[:4]
[(55, np.float64(7.906890595608518)),
(59, np.float64(5.8479969065549495)),
(21, np.float64(1.3219280948873617)),
(25, np.float64(2.5224042154522373))]
Question
Which of the following representations do you consider to produce a larger vocabulary?
from sklearn.model_selection import KFold
from sklearn.metrics import recall_score
kfold = KFold(n_splits=5, shuffle=True, random_state=1)
perf = []
for tr, vs in kfold.split(training_set):
train = [training_set[i] for i in tr]
val = [training_set[i] for i in vs]
tm = TextModel(token_list=[-1], num_option=OPTION_NONE,
usr_option=OPTION_DELETE, url_option=OPTION_DELETE,
emo_option=OPTION_NONE, lc=True, del_dup=False,
del_punc=True, del_diac=True).fit(train)
labels = [x['klass'] for x in train]
m = LinearSVC(dual='auto').fit(tm.transform(train), labels)
hy = m.predict(tm.transform(val))
_ = recall_score([x['klass'] for x in val], hy, average='macro')
perf.append(_)
np.mean(perf)
np.float64(0.8235285558223533)
Personal webpage