Text Classification

INFOTEC

CentroGEO

INFOTEC

Sabino Miranda-Jiménez

Who / Where

INGEOTEC

GitHub: https://github.com/INGEOTEC
WebPage: https://ingeotec.github.io/

Aguascalientes, México

Opinion Mining

Opinion Mining

Definition

Study people’s opinions, appraisals, attitudes, and emotions toward entities, individuals, events, and their attributes.

  • Distilling opinions from texts
  • A brand is interested in the costumer’s opinions
  • Huge amount of information

Opinion Mining (2)

Formal Definition

  • \(e_i\) - Entity
  • \(a_{ij}\) - Aspect of \(e_i\)
  • \(o_{ijkl}\) - Opinion orientation
  • \(h_k\) - Opinion source - Opinion holder
  • \(t_l\) - Time of the opinion

Entity

Product, service, person, event, organization, or topic

Aspect

Entity’s component or attribute

Opinion Mining Tasks

Tasks

  • Entity extraction
  • Aspect extraction - considered the entities
  • Identify opinion source and time
  • Identify opinion orientation

Introduction

Text Classification

Definition

The aim is the classification of documents into a fixed number of predefined categories.

Polarity

El día de mañana no podré ir con ustedes a la librería

Negative

Text Classification Tasks

Polarity

Positive, Negative, Neutral

Emotion (Multiclass)

  • Anger, Joy, …
  • Intensity of an emotion

Event (Binary)

  • Violent
  • Crime

Profiling

Gender

Man, Woman, Nonbinary, …

Age

Child, Teen, Adult

Language Variety

  • Spanish: Spain, Cuba, Argentine, México, …
  • English: United States, England, …

Approach

Machine Learning

Definition

Machine learning (ML) is a subfield of artificial intelligence that focuses on the development and implementation of algorithms capable of learning from data without being explicitly programmed.

Types of ML algorithms

  • Unsupervised Learning
  • Supervised Learning
  • Reinforcement Learning

Supervised Learning (Multiclass)

Supervised Learning (Binary)

Supervised Learning (Classification)

Decision function

\(g(\mathbf x) = -0.78 x_1 + 0.60 x_2 + -0.88\)

Supervised Learning (Geometry)

Decision function

\(g(\mathbf x) = -0.78 x_1 + 0.60 x_2 + -0.88\)

Supervised Learning (Geometry 2)

\(w_0\)

  • \(w_0 = -0.88\)
  • \(w_0 = 0.88\)

Supervised Learning (Geometry 3)

Decision function

  • \(g_{svm}(\mathbf x) = -0.78 x_1 + 0.60 x_2 + -0.88\)
  • \(g_{lr}(\mathbf x) = -2.58 x_1 + 0.84 x_2 + -3.06\)

Starting point

Training set

text klass
0 Fallece un ganadero de Peñacerrada atrapado po... 0
1 Carlos 'El Yoyas', imputado por un presunto ma... 1
2 Espacio Municipalista critica la política de a... 0
3 Cuatro detenidos por presunto hurto de ganado ... 1
4 @Wini83 @robersantacruz @juanjoph_73 @DCCoruna... 0
5 OCURRIÓ FRENTE MISMO A SU DOMICILIO \n\nUn jov... 1
6 Leighton Meester y Adam Brody juntos por 2,3 s... 0
7 Brutal asesinato de una patota en Rafael Casti... 1

Quiz

Question

Which of the following tasks does the previous training set belong to?

  1. Polarity
  2. Emotion identification
  3. Aggressive detection
  4. Profiling

Training set (2)

Problem

The independent variables are texts

Solution

  • Represent the texts in an suitable format for the classifier
  • Token as a vector
  • Sparse vector
  • Dense vector
  • Utterance as a vector

Text Representation

Token as Vector

Token as vector

  • The idea is that each token \(t\) is associate to a vector \(\mathbf v_t \in \mathbb R^d\)
  • Let \(\mathcal V\) represent the set composed by the different tokens
  • \(d\) corresponds to the dimension of the vector

\(d << \lvert \mathcal V \rvert\) (Dense Vector)

  • GloVe
  • Word2vec
  • fastText

Token as Vector (2)

\(d = \lvert \mathcal V \rvert\) (Sparse Vector)

  • \(\forall_{i \neq j} \mathbf v_i \cdot \mathbf v_j = 0\)
  • \(\mathbf v_i \in \mathbb R^d\)
  • \(\mathbf v_j \in \mathbb R^d\)

Algorithm

  • Sort the vocabulary \(\mathcal V\)
  • Associate \(i\)-th token to
  • \((\ldots, 0, \overbrace{\beta_i}^i, 0, \ldots)^\intercal\)
  • where \(\beta_i > 0\)

Utterance as Vector

Procedure

\[\mathbf x = \sum_{t \in \mathcal U} \mathbf{v}_t\]

  • where \(\mathcal{U}\) corresponds to all the tokens of the utterance
  • The vector \(\mathbf{v}_t\) is associated to token \(t\)

Unit Vector

\[\mathbf x = \frac{\sum_{t \in \mathcal U} \mathbf v_t}{\lVert \sum_{t \in \mathcal U} \mathbf v_t \rVert} \]

Sparse Representation

Tokens

flowchart LR
    Entrada([Text]) -->  Norm[Text Normalizer]
    Norm --> Seg[Tokenizer]
    Seg --> Terminos(...)

Text Normalization

  • User
  • URL
  • Entity
  • Case sensitive
  • Punctuation
  • Diacritic

Diacritic (remove)

text = 'México'
output = ""
for x in unicodedata.normalize('NFD', text):
    o = ord(x)
    if 0x300 <= o and o <= 0x036F:
        continue
    output += x
output
'Mexico'

Text Normalization

Case sensitive

text = "México"
output = text.lower()
output
'méxico'

URL (replace)

text = "go http://google.com, and find out"
output = re.sub(r"https?://\S+", "_url", text)
output
'go _url and find out'

Tokenizer

Common Types

  • Words
  • n-grams (Words)
  • q-grams (Characters)
  • skip-grams

Words

text = 'I like playing football on Saturday'
words = text.split()
words
['I', 'like', 'playing', 'football', 'on', 'Saturday']

Tokenizer (2)

n-grams

text = 'I like playing football on Saturday'
words = text.split()
n = 3
n_grams = []
for a in zip(*[words[i:] for i in range(n)]):
    n_grams.append("~".join(a))
n_grams
['I~like~playing',
 'like~playing~football',
 'playing~football~on',
 'football~on~Saturday']

q-grams

text = 'I like playing'
q = 4
q_grams = []
for a in zip(*[text[i:] for i in range(q)]):
    q_grams.append("".join(a))
q_grams
['I li',
 ' lik',
 'like',
 'ike ',
 'ke p',
 'e pl',
 ' pla',
 'play',
 'layi',
 'ayin',
 'ying']

\(\mu\)-TC

TextModel

tm = TextModel(token_list=[-1],
               num_option=OPTION_NONE,
               usr_option=OPTION_DELETE,
               url_option=OPTION_DELETE,
               emo_option=OPTION_NONE,
               lc=True,
               del_dup=False,
               del_punc=True,
               del_diac=True)
text = 'I like playing football with @mgraffg'
tm.tokenize(text)
['i', 'like', 'playing', 'football', 'with']

\(\mu\)-TC (2)

TextModel

tm = TextModel(token_list=[-2, -1, 6],
               num_option=OPTION_NONE,
               usr_option=OPTION_DELETE,
               url_option=OPTION_DELETE,
               emo_option=OPTION_NONE, 
               lc=True, del_dup=False,
               del_punc=True, del_diac=True)
text = 'I like playing...'
tm.tokenize(text)
['i~like',
 'like~playing',
 'i',
 'like',
 'playing',
 'q:~i~lik',
 'q:i~like',
 'q:~like~',
 'q:like~p',
 'q:ike~pl',
 'q:ke~pla',
 'q:e~play',
 'q:~playi',
 'q:playin',
 'q:laying',
 'q:aying~']

Training set

URL = 'https://github.com/INGEOTEC/Delitos/releases/download/Datos/delitos.zip'
if not isfile('delitos.zip'):
  Download(URL,
           'delitos.zip')
if not isdir('delitos'):
  !unzip -Pingeotec delitos.zip

Utterance as Vector

TextModel

tm = TextModel(token_list=[-1],
               num_option=OPTION_NONE,
               usr_option=OPTION_DELETE,
               url_option=OPTION_DELETE,
               emo_option=OPTION_NONE, 
               lc=True, del_dup=False,
               del_punc=True, del_diac=True)

Tokenizer

fname = 'delitos/delitos_ingeotec_Es_train.json'
training_set = list(tweet_iterator(fname))
tm.tokenize(training_set[0])[:3]
['este', 'caso', 'tiene']

Vocabulary

voc = Counter()
for text in training_set:
  tokens = set(tm.tokenize(text))
  voc.update(tokens)
voc.most_common(n=3)
[('de', 980), ('en', 804), ('la', 655)]

Utterance as Vector (2)

Inverse Document Frequency (IDF)

token2id = {}
token2beta = {}
N = np.log2(voc.update_calls)
for id, (k, n) in enumerate(voc.items()):
  token2id[k] = id
  token2beta[k] = N - np.log2(n)

Term Frequency - IDF

text = training_set[3]['text']
tokens = tm.tokenize(text)
vector = []
for token, tf in zip(*np.unique(tokens, return_counts=True)):
  if token not in token2id:
    continue
  vector.append((token2id[token], tf * token2beta[token]))
vector[:4]
[(62, np.float64(7.906890595608518)),
 (50, np.float64(5.8479969065549495)),
 (31, np.float64(1.3219280948873617)),
 (15, np.float64(2.5224042154522373))]

Utterance as Vector (3)

\(\mu\)-TC

tm.fit(training_set)
<microtc.textmodel.TextModel at 0x7f41ad8bb640>

Utterance as Vector

text = training_set[3]['text']
tm[text][:4]
[(3523, np.float64(0.08478070043211834)),
 (5114, np.float64(0.07639809274356325)),
 (6569, np.float64(0.35264239129026387)),
 (3340, np.float64(0.1965574235982234))]

Quiz

Question

Which of the following representations do you consider to produce a larger vocabulary?

A

tmA = TextModel(token_list=[-1, 3],
                num_option=OPTION_NONE,
                usr_option=OPTION_DELETE,
                url_option=OPTION_DELETE,
                emo_option=OPTION_NONE, 
                lc=True, del_dup=False,
                del_punc=True,
                del_diac=True
               ).fit(training_set)

B

tmB = TextModel(token_list=[-1, 6],
                num_option=OPTION_NONE,
                usr_option=OPTION_DELETE,
                url_option=OPTION_DELETE,
                emo_option=OPTION_NONE, 
                lc=True, del_dup=False,
                del_punc=True,
                del_diac=True
               ).fit(training_set)

Text Classification

Procedure

Text as Vectors

tm = TextModel(token_list=[-1], num_option=OPTION_NONE,
                usr_option=OPTION_DELETE, url_option=OPTION_DELETE,
                emo_option=OPTION_NONE, lc=True, del_dup=False,
                del_punc=True, del_diac=True
               ).fit(training_set)
X = tm.transform(training_set)

Training a Classifier

labels = [x['klass'] for x in training_set]
m = LinearSVC(dual='auto').fit(X, labels)

Predict a text

X = tm.transform(['Buenos días']) # good morning
m.predict(X)
array([0])

Performance

Test set

test_set = list(tweet_iterator(fname.replace('_train.', '_test.')))

Prediction

tm = TextModel(token_list=[-2, -1, 3, 4], num_option=OPTION_NONE,
               usr_option=OPTION_DELETE, url_option=OPTION_DELETE,
               emo_option=OPTION_NONE, lc=True, del_dup=False,
               del_punc=True, del_diac=True
              ).fit(training_set)
X = tm.transform(training_set)
labels = np.array([x['klass'] for x in training_set])
m = LinearSVC(dual='auto', class_weight='balanced').fit(X, labels)
hy = m.predict(tm.transform(test_set))

Performance

recall_score([x['klass'] for x in test_set],
             hy, average=None)
array([0.95070423, 0.68421053])

Feature Importance

Coefficients

def coef(X, y):
    m = LinearSVC(dual='auto',
                  class_weight='balanced'
                 ).fit(X, y)
    return m.coef_

Normalize Coefficients

stats = StatisticSamples(statistic=coef,
                         num_samples=50,
                         n_jobs=-1)
b_samples = stats(X, labels)
se = np.std(b_samples, axis=0)
se[se==0] = 1
w_norm = m.coef_ / se
w_norm = np.linalg.norm(w_norm, axis=0)

Feature Importance (2)

Wordcloud

path = './emoji_text.ttf'
items = tm.token_weight.items
tokens = {tm.id2token[id]: w_norm[id] * _w for id, _w in items()
          if w_norm[id] >= 2.0 and np.isfinite(w_norm[id])}
word_cloud = WordCloud(font_path=path,
                       background_color='white'
                      ).generate_from_frequencies(tokens)
plt.imshow(word_cloud, interpolation='bilinear')
plt.tick_params(left=False, right=False, labelleft=False,
                   labelbottom=False, bottom=False)

Feature Importance (2)

Conclusions

  • Describe a supervised learning approach to tackle text classifications.
  • Explain the geometry of linear classifiers.
  • Use a procedure to represent a text as a vector.
  • Measure the performance of a text classifier.