Sentiment Analysis

Poornima Institute of Engineering & Technology

INFOTEC

Who / Where

INGEOTEC

GitHub: https://github.com/INGEOTEC
WebPage: https://ingeotec.github.io/

Aguascalientes, México

Introduction

Text Classification

Definition

The aim is the classification of documents into a fixed number of predefined categories.

Polarity

El día de mañana no podré ir con ustedes a la librería

Negative

Text Classification Tasks

Polarity

Positive, Negative, Neutral

Emotion (Multiclass)

  • Anger, Joy, …
  • Intensity of an emotion

Event (Binary)

  • Violent
  • Crime

Profiling

Gender

Man, Woman, Nonbinary, …

Age

Child, Teen, Adult

Language Variety

  • Spanish: Spain, Cuba, Argentine, México, …
  • English: United States, England, …

Approach

Machine Learning

Definition

Machine learning (ML) is a subfield of artificial intelligence that focuses on the development and implementation of algorithms capable of learning from data without being explicitly programmed.

Types of ML algorithms

  • Unsupervised Learning
  • Supervised Learning
  • Reinforcement Learning

Supervised Learning (Multiclass)

Supervised Learning (Binary)

Supervised Learning (Classification)

Decision function

\(g(\mathbf x) = -0.78 x_1 + 0.60 x_2 + -0.88\)

Supervised Learning (Geometry)

Decision function

\(g(\mathbf x) = -0.78 x_1 + 0.60 x_2 + -0.88\)

Supervised Learning (Geometry 2)

\(w_0\)

  • \(w_0 = -0.88\)
  • \(w_0 = 0.88\)

Supervised Learning (Geometry 3)

Decision function

  • \(g_{svm}(\mathbf x) = -0.78 x_1 + 0.60 x_2 + -0.88\)
  • \(g_{lr}(\mathbf x) = -2.58 x_1 + 0.84 x_2 + -3.06\)

Starting point

Training set

text klass
0 🛑 Detengamos a los robots asesinos antes de qu... 0
1 Policías municipales de Querétaro detectaron y... 1
2 @MicaSuarez12 @cosmicvelasco AINARA LA CONCHAB... 0
3 "Un profesor de Jerez ata con una cuerda y amo... 1
4 Del Califa de Dios y Su sirviente AlMahdi al E... 0
5 .@VirguezFranklin @jguaido Detenidos venezolan... 1
6 ECA: Empagliflocina (iSGLT2) vs placebo. 💡\nAq... 0
7 Acusan a una pareja por robar un millón de dól... 1

Quiz

Question

Which of the following tasks does the previous training set belong to?

  1. Polarity
  2. Emotion identification
  3. Aggressive detection
  4. Profiling

Training set (2)

Problem

The independent variables are texts

Solution

  • Represent the texts in an suitable format for the classifier
    • Token as a vector
      • Sparse vector
      • Dense vector
    • Utterance as a vector

Text Representation

Token as Vector

Token as vector

  • The idea is that each token \(t\) is associate to a vector \(\mathbf v_t \in \mathbb R^d\)
  • Let \(\mathcal V\) represent the set composed by the different tokens
  • \(d\) corresponds to the dimension of the vector

\(d << \lvert \mathcal V \rvert\) (Dense Vector)

  • GloVe
  • Word2vec
  • fastText

Token as Vector (2)

\(d = \lvert \mathcal V \rvert\) (Sparse Vector)

  • \(\forall_{i \neq j} \mathbf v_i \cdot \mathbf v_j = 0\)
  • \(\mathbf v_i \in \mathbb R^d\)
  • \(\mathbf v_j \in \mathbb R^d\)

Algorithm

  • Sort the vocabulary \(\mathcal V\)
  • Associate \(i\)-th token to
  • \((\ldots, 0, \overbrace{\beta_i}^i, 0, \ldots)^\intercal\)
  • where \(\beta_i > 0\)

Utterance as Vector

Procedure

\[\mathbf x = \sum_{t \in \mathcal U} \mathbf{v}_t\]

  • where \(\mathcal{U}\) corresponds to all the tokens of the utterance
  • The vector \(\mathbf{v}_t\) is associated to token \(t\)

Unit Vector

\[\mathbf x = \frac{\sum_{t \in \mathcal U} \mathbf v_t}{\lVert \sum_{t \in \mathcal U} \mathbf v_t \rVert} \]

Tokens

flowchart LR
    Entrada([Text]) -->  Norm[Text Normalizer]
    Norm --> Seg[Tokenizer]
    Seg --> Terminos(...)

Text Normalization

  • User
  • URL
  • Entity
  • Case sensitive
  • Punctuation
  • Diacritic

Diacritic (remove)

import unicodedata
text = 'México'
output = ""
for x in unicodedata.normalize('NFD', text):
    o = ord(x)
    if 0x300 <= o and o <= 0x036F:
        continue
    output += x
output
'Mexico'

Text Normalization

Case sensitive

text = "México"
output = text.lower()
output
'méxico'

User (replace)

import re
text = "go http://google.com, and find out"
output = re.sub(r"https?://\S+", "_usr", text)
output
'go _usr and find out'

Tokenizer

Common Types

  • Words
  • n-grams (Words)
  • q-grams (Characters)
  • skip-grams

Words

text = 'I like playing football on Saturday'
words = text.split()
words
['I', 'like', 'playing', 'football', 'on', 'Saturday']

Tokenizer (2)

n-grams

text = 'I like playing football on Saturday'
words = text.split()
n = 3
n_grams = []
for a in zip(*[words[i:] for i in range(n)]):
    n_grams.append("~".join(a))
n_grams
['I~like~playing',
 'like~playing~football',
 'playing~football~on',
 'football~on~Saturday']

q-grams

text = 'I like playing'
q = 4
q_grams = []
for a in zip(*[text[i:] for i in range(q)]):
    q_grams.append("".join(a))
q_grams
['I li',
 ' lik',
 'like',
 'ike ',
 'ke p',
 'e pl',
 ' pla',
 'play',
 'layi',
 'ayin',
 'ying']

\(\mu\)-TC

TextModel

from microtc import TextModel
from microtc.params import OPTION_GROUP,\
  OPTION_DELETE, OPTION_NONE
tm = TextModel(token_list=[-1],
               num_option=OPTION_NONE,
               usr_option=OPTION_DELETE,
               url_option=OPTION_DELETE,
               emo_option=OPTION_NONE,
               lc=True,
               del_dup=False,
               del_punc=True,
               del_diac=True)
text = 'I like playing football with @mgraffg'
tm.tokenize(text)
['i', 'like', 'playing', 'football', 'with']

\(\mu\)-TC (2)

TextModel

tm = TextModel(token_list=[-2, -1, 6],
               num_option=OPTION_NONE,
               usr_option=OPTION_DELETE,
               url_option=OPTION_DELETE,
               emo_option=OPTION_NONE, 
               lc=True, del_dup=False,
               del_punc=True, del_diac=True)
text = 'I like playing...'
tm.tokenize(text)
['i~like',
 'like~playing',
 'i',
 'like',
 'playing',
 'q:~i~lik',
 'q:i~like',
 'q:~like~',
 'q:like~p',
 'q:ike~pl',
 'q:ke~pla',
 'q:e~play',
 'q:~playi',
 'q:playin',
 'q:laying',
 'q:aying~']

Training set

from EvoMSA.utils import Download
from microtc.utils import tweet_iterator
from os.path import isdir, isfile
import pandas as pd
from random import shuffle

URL = 'https://github.com/INGEOTEC/Delitos/releases/download/Datos/delitos.zip'
if not isfile('delitos.zip'):
  Download(URL,
           'delitos.zip')
if not isdir('delitos'):
  !unzip -Pingeotec delitos.zip

Utterance as Vector

TextModel

tm = TextModel(token_list=[-1],
               num_option=OPTION_NONE,
               usr_option=OPTION_DELETE,
               url_option=OPTION_DELETE,
               emo_option=OPTION_NONE, 
               lc=True, del_dup=False,
               del_punc=True, del_diac=True)

Tokenizer

from microtc.utils import tweet_iterator
fname = 'delitos/delitos_ingeotec_Es_train.json'
training_set = list(tweet_iterator(fname))
tm.tokenize(training_set[0])[:3]
['este', 'caso', 'tiene']

Vocabulary

from microtc.utils import Counter
voc = Counter()
for text in training_set:
  tokens = set(tm.tokenize(text))
  voc.update(tokens)
voc.most_common(n=3)
[('de', 980), ('en', 804), ('la', 655)]

Utterance as Vector (2)

Inverse Document Frequency (IDF)

token2id = {}
token2beta = {}
N = np.log2(voc.update_calls)
for id, (k, n) in enumerate(voc.items()):
  token2id[k] = id
  token2beta[k] = N - np.log2(n)

Term Frequency - IDF

text = training_set[3]['text']
tokens = tm.tokenize(text)
vector = []
for token, tf in zip(*np.unique(tokens, return_counts=True)):
  if token not in token2id:
    continue
  vector.append((token2id[token], tf * token2beta[token]))
vector[:4]
[(55, np.float64(7.906890595608518)),
 (59, np.float64(5.8479969065549495)),
 (21, np.float64(1.3219280948873617)),
 (25, np.float64(2.5224042154522373))]

Utterance as Vector (3)

\(\mu\)-TC

tm.fit(training_set)
<microtc.textmodel.TextModel at 0x7f8b6e9cb220>

Utterance as Vector

text = training_set[3]['text']
tm[text][:4]
[(3523, np.float64(0.08478070043211834)),
 (5114, np.float64(0.07639809274356325)),
 (6569, np.float64(0.35264239129026387)),
 (3340, np.float64(0.1965574235982234))]

Quiz

Question

Which of the following representations do you consider to produce a larger vocabulary?

A

tmA = TextModel(token_list=[-1, 3],
                num_option=OPTION_NONE,
                usr_option=OPTION_DELETE,
                url_option=OPTION_DELETE,
                emo_option=OPTION_NONE, 
                lc=True, del_dup=False,
                del_punc=True,
                del_diac=True
               ).fit(training_set)

B

tmB = TextModel(token_list=[-1, 6],
                num_option=OPTION_NONE,
                usr_option=OPTION_DELETE,
                url_option=OPTION_DELETE,
                emo_option=OPTION_NONE, 
                lc=True, del_dup=False,
                del_punc=True,
                del_diac=True
               ).fit(training_set)

Text Classification

Procedure

Text as Vectors

X = tm.transform(training_set)

Training a Classifier

from sklearn.svm import LinearSVC
labels = [x['klass'] for x in training_set]
m = LinearSVC(dual='auto').fit(X, labels)

Predict a text

X = tm.transform(['Buenos días']) # good morning
m.predict(X)
array([0])

Performance

from sklearn.model_selection import KFold
from sklearn.metrics import recall_score

kfold = KFold(n_splits=5, shuffle=True, random_state=1)
perf = []
for tr, vs in kfold.split(training_set):
    train = [training_set[i] for i in tr]
    val = [training_set[i] for i in vs]
    tm = TextModel(token_list=[-1], num_option=OPTION_NONE,
                   usr_option=OPTION_DELETE, url_option=OPTION_DELETE,
                   emo_option=OPTION_NONE, lc=True, del_dup=False,
                   del_punc=True, del_diac=True).fit(train)
    labels = [x['klass'] for x in train]
    m = LinearSVC(dual='auto').fit(tm.transform(train), labels)
    hy = m.predict(tm.transform(val))
    _ = recall_score([x['klass'] for x in val], hy, average='macro')
    perf.append(_)
np.mean(perf)
np.float64(0.8235285558223533)

Conclusions

  • Describe a supervised learning approach to tackle text classifications.
  • Explain the geometry of linear classifiers.
  • Use a procedure to represent a text as a vector.
  • Measure the performance of a text classifier.

Personal webpage