Sentiment Analysis

Poornima Institute of Engineering & Technology

Mario Graff

INFOTEC

Who / Where

INGEOTEC

GitHub: https://github.com/INGEOTEC
WebPage: https://ingeotec.github.io/

Aguascalientes, México

Introduction

Text Classification

Definition

The aim is the classification of documents into a fixed number of predefined categories.

Polarity

El día de mañana no podré ir con ustedes a la librería

Negative

Text Classification Tasks

Polarity

Positive, Negative, Neutral

Emotion (Multiclass)

Anger, Joy, …
Intensity of an emotion

Event (Binary)

Violent
Crime

Profiling

Gender

Man, Woman, Nonbinary, …

Age

Child, Teen, Adult

Language Variety

Spanish: Spain, Cuba, Argentine, México, …
English: United States, England, …

Approach

Machine Learning

Definition

Machine learning (ML) is a subfield of artificial intelligence that focuses on the development and implementation of algorithms capable of learning from data without being explicitly programmed.

Types of ML algorithms

Unsupervised Learning
Supervised Learning
Reinforcement Learning

Supervised Learning (Multiclass)

Supervised Learning (Binary)

Supervised Learning (Classification)

Decision function

\(g(\mathbf x) = -0.78 x_1 + 0.60 x_2 + -0.88\)

Supervised Learning (Geometry)

Decision function

\(g(\mathbf x) = -0.78 x_1 + 0.60 x_2 + -0.88\)

Supervised Learning (Geometry 2)

\(w_0\)

\(w_0 = -0.88\)
\(w_0 = 0.88\)

Supervised Learning (Geometry 3)

Decision function

\(g_{svm}(\mathbf x) = -0.78 x_1 + 0.60 x_2 + -0.88\)
\(g_{lr}(\mathbf x) = -2.58 x_1 + 0.84 x_2 + -3.06\)

Starting point

Training set

	text	klass
0	Ese señor melgarejo q está incurso en investig...	0
1	Confirmaron que uno de los detenidos sería el ...	1
2	Go pecho en bombe	0
3	#eb Asesinada una sobrina de Ángel María Villa...	1
4	Ojo aquí @iunida https://t.co/VWtvb3g7SY	0
5	VOA- La Policía de Texas reportó que un hombre...	1
6	Cada domingo, con escobas, picos y palas salen...	0
7	La Audiencia de A Coruña inicia hoy el juicio ...	1

Quiz

Question

Which of the following tasks does the previous training set belong to?

Polarity
Emotion identification
Aggressive detection
Profiling

Training set (2)

Problem

The independent variables are texts

Solution

Represent the texts in an suitable format for the classifier
- Token as a vector
  - Sparse vector
  - Dense vector
- Utterance as a vector

Text Representation

Token as Vector

Token as vector

The idea is that each token \(t\) is associate to a vector \(\mathbf v_t \in \mathbb R^d\)
Let \(\mathcal V\) represent the set composed by the different tokens
\(d\) corresponds to the dimension of the vector

\(d << \lvert \mathcal V \rvert\) (Dense Vector)

GloVe
Word2vec
fastText

Token as Vector (2)

\(d = \lvert \mathcal V \rvert\) (Sparse Vector)

\(\forall_{i \neq j} \mathbf v_i \cdot \mathbf v_j = 0\)
\(\mathbf v_i \in \mathbb R^d\)
\(\mathbf v_j \in \mathbb R^d\)

Algorithm

Sort the vocabulary \(\mathcal V\)
Associate \(i\)-th token to
\((\ldots, 0, \overbrace{\beta_i}^i, 0, \ldots)^\intercal\)
where \(\beta_i > 0\)

Utterance as Vector

Procedure

\[\mathbf x = \sum_{t \in \mathcal U} \mathbf{v}_t\]

where \(\mathcal{U}\) corresponds to all the tokens of the utterance
The vector \(\mathbf{v}_t\) is associated to token \(t\)

Unit Vector

\[\mathbf x = \frac{\sum_{t \in \mathcal U} \mathbf v_t}{\lVert \sum_{t \in \mathcal U} \mathbf v_t \rVert} \]

Tokens

flowchart LR
    Entrada([Text]) -->  Norm[Text Normalizer]
    Norm --> Seg[Tokenizer]
    Seg --> Terminos(...)

Text Normalization

User
URL
Entity
Case sensitive
Punctuation
Diacritic

Diacritic (remove)

import unicodedata
text = 'México'
output = ""
for x in unicodedata.normalize('NFD', text):
    o = ord(x)
    if 0x300 <= o and o <= 0x036F:
        continue
    output += x
output

'Mexico'

Text Normalization

Case sensitive

text = "México"
output = text.lower()
output

'méxico'

User (replace)

import re
text = "go http://google.com, and find out"
output = re.sub(r"https?://\S+", "_usr", text)
output

'go _usr and find out'

Tokenizer

Common Types

Words
n-grams (Words)
q-grams (Characters)
skip-grams

Words

text = 'I like playing football on Saturday'
words = text.split()
words

['I', 'like', 'playing', 'football', 'on', 'Saturday']

Tokenizer (2)

n-grams

text = 'I like playing football on Saturday'
words = text.split()
n = 3
n_grams = []
for a in zip(*[words[i:] for i in range(n)]):
    n_grams.append("~".join(a))
n_grams

['I~like~playing',
 'like~playing~football',
 'playing~football~on',
 'football~on~Saturday']

q-grams

text = 'I like playing'
q = 4
q_grams = []
for a in zip(*[text[i:] for i in range(q)]):
    q_grams.append("".join(a))
q_grams

['I li',
 ' lik',
 'like',
 'ike ',
 'ke p',
 'e pl',
 ' pla',
 'play',
 'layi',
 'ayin',
 'ying']

\(\mu\)-TC

TextModel

from microtc import TextModel
from microtc.params import OPTION_GROUP,\
  OPTION_DELETE, OPTION_NONE
tm = TextModel(token_list=[-1],
               num_option=OPTION_NONE,
               usr_option=OPTION_DELETE,
               url_option=OPTION_DELETE,
               emo_option=OPTION_NONE,
               lc=True,
               del_dup=False,
               del_punc=True,
               del_diac=True)

text = 'I like playing football with @mgraffg'
tm.tokenize(text)

['i', 'like', 'playing', 'football', 'with']

\(\mu\)-TC (2)

TextModel

tm = TextModel(token_list=[-2, -1, 6],
               num_option=OPTION_NONE,
               usr_option=OPTION_DELETE,
               url_option=OPTION_DELETE,
               emo_option=OPTION_NONE, 
               lc=True, del_dup=False,
               del_punc=True, del_diac=True)

text = 'I like playing...'
tm.tokenize(text)

['i~like',
 'like~playing',
 'i',
 'like',
 'playing',
 'q:~i~lik',
 'q:i~like',
 'q:~like~',
 'q:like~p',
 'q:ike~pl',
 'q:ke~pla',
 'q:e~play',
 'q:~playi',
 'q:playin',
 'q:laying',
 'q:aying~']

Training set

from EvoMSA.utils import Download
from microtc.utils import tweet_iterator
from os.path import isdir, isfile
import pandas as pd
from random import shuffle

URL = 'https://github.com/INGEOTEC/Delitos/releases/download/Datos/delitos.zip'
if not isfile('delitos.zip'):
  Download(URL,
           'delitos.zip')
if not isdir('delitos'):
  !unzip -Pingeotec delitos.zip

Utterance as Vector

TextModel

tm = TextModel(token_list=[-1],
               num_option=OPTION_NONE,
               usr_option=OPTION_DELETE,
               url_option=OPTION_DELETE,
               emo_option=OPTION_NONE, 
               lc=True, del_dup=False,
               del_punc=True, del_diac=True)

Tokenizer

from microtc.utils import tweet_iterator
fname = 'delitos/delitos_ingeotec_Es_train.json'
training_set = list(tweet_iterator(fname))
tm.tokenize(training_set[0])[:3]

['este', 'caso', 'tiene']

Vocabulary

from microtc.utils import Counter
voc = Counter()
for text in training_set:
  tokens = set(tm.tokenize(text))
  voc.update(tokens)
voc.most_common(n=3)

[('de', 980), ('en', 804), ('la', 655)]

Utterance as Vector (2)

Inverse Document Frequency (IDF)

token2id = {}
token2beta = {}
N = np.log2(voc.update_calls)
for id, (k, n) in enumerate(voc.items()):
  token2id[k] = id
  token2beta[k] = N - np.log2(n)

Term Frequency - IDF

text = training_set[3]['text']
tokens = tm.tokenize(text)
vector = []
for token, tf in zip(*np.unique(tokens, return_counts=True)):
  if token not in token2id:
    continue
  vector.append((token2id[token], tf * token2beta[token]))
vector[:4]

[(54, 7.906890595608518),
 (58, 5.8479969065549495),
 (6, 1.3219280948873617),
 (0, 2.5224042154522373)]

Utterance as Vector (3)

\(\mu\)-TC

tm.fit(training_set)

<microtc.textmodel.TextModel at 0x15ea9df10>

Utterance as Vector

text = training_set[3]['text']
tm[text][:4]

[(3523, 0.08478070043211834),
 (5114, 0.07639809274356325),
 (6569, 0.35264239129026387),
 (3340, 0.1965574235982234)]

Quiz

Question

Which of the following representations do you consider to produce a larger vocabulary?

tmA = TextModel(token_list=[-1, 3],
                num_option=OPTION_NONE,
                usr_option=OPTION_DELETE,
                url_option=OPTION_DELETE,
                emo_option=OPTION_NONE, 
                lc=True, del_dup=False,
                del_punc=True,
                del_diac=True
               ).fit(training_set)

tmB = TextModel(token_list=[-1, 6],
                num_option=OPTION_NONE,
                usr_option=OPTION_DELETE,
                url_option=OPTION_DELETE,
                emo_option=OPTION_NONE, 
                lc=True, del_dup=False,
                del_punc=True,
                del_diac=True
               ).fit(training_set)

Text Classification

Procedure

Text as Vectors

X = tm.transform(training_set)

Training a Classifier

from sklearn.svm import LinearSVC
labels = [x['klass'] for x in training_set]
m = LinearSVC(dual='auto').fit(X, labels)

Predict a text

X = tm.transform(['Buenos días']) # good morning
m.predict(X)

array([0])

Performance

from sklearn.model_selection import KFold
from sklearn.metrics import recall_score

kfold = KFold(n_splits=5, shuffle=True, random_state=1)
perf = []
for tr, vs in kfold.split(training_set):
    train = [training_set[i] for i in tr]
    val = [training_set[i] for i in vs]
    tm = TextModel(token_list=[-1], num_option=OPTION_NONE,
                   usr_option=OPTION_DELETE, url_option=OPTION_DELETE,
                   emo_option=OPTION_NONE, lc=True, del_dup=False,
                   del_punc=True, del_diac=True).fit(train)
    labels = [x['klass'] for x in train]
    m = LinearSVC(dual='auto').fit(tm.transform(train), labels)
    hy = m.predict(tm.transform(val))
    _ = recall_score([x['klass'] for x in val], hy, average='macro')
    perf.append(_)
np.mean(perf)

0.8235285558223533

Conclusions

Describe a supervised learning approach to tackle text classifications.
Explain the geometry of linear classifiers.
Use a procedure to represent a text as a vector.
Measure the performance of a text classifier.

Personal webpage

https://mgraffg.github.io