Text Classification

Mario Graff

INFOTEC

Daniela Moctezuma

CentroGEO

Eric S. Téllez

INFOTEC

Sabino Miranda-Jiménez

Who / Where

INGEOTEC

GitHub: https://github.com/INGEOTEC
WebPage: https://ingeotec.github.io/

Aguascalientes, México

Opinion Mining

Definition

Study people’s opinions, appraisals, attitudes, and emotions toward entities, individuals, events, and their attributes.

Distilling opinions from texts
A brand is interested in the costumer’s opinions
Huge amount of information

Opinion Mining (2)

Formal Definition

\(e_i\) - Entity
\(a_{ij}\) - Aspect of \(e_i\)
\(o_{ijkl}\) - Opinion orientation
\(h_k\) - Opinion source - Opinion holder
\(t_l\) - Time of the opinion

Entity

Product, service, person, event, organization, or topic

Aspect

Entity’s component or attribute

Opinion Mining Tasks

Tasks

Entity extraction
Aspect extraction - considered the entities
Identify opinion source and time
Identify opinion orientation

Introduction

Text Classification

Definition

The aim is the classification of documents into a fixed number of predefined categories.

Polarity

El día de mañana no podré ir con ustedes a la librería

Negative

Text Classification Tasks

Polarity

Positive, Negative, Neutral

Emotion (Multiclass)

Anger, Joy, …
Intensity of an emotion

Event (Binary)

Violent
Crime

Profiling

Gender

Man, Woman, Nonbinary, …

Age

Child, Teen, Adult

Language Variety

Spanish: Spain, Cuba, Argentine, México, …
English: United States, England, …

Approach

Machine Learning

Definition

Machine learning (ML) is a subfield of artificial intelligence that focuses on the development and implementation of algorithms capable of learning from data without being explicitly programmed.

Types of ML algorithms

Unsupervised Learning
Supervised Learning
Reinforcement Learning

Supervised Learning (Multiclass)

Supervised Learning (Binary)

Supervised Learning (Classification)

Decision function

\(g(\mathbf x) = -0.78 x_1 + 0.60 x_2 + -0.88\)

Supervised Learning (Geometry)

Decision function

\(g(\mathbf x) = -0.78 x_1 + 0.60 x_2 + -0.88\)

Supervised Learning (Geometry 2)

\(w_0\)

\(w_0 = -0.88\)
\(w_0 = 0.88\)

Supervised Learning (Geometry 3)

Decision function

\(g_{svm}(\mathbf x) = -0.78 x_1 + 0.60 x_2 + -0.88\)
\(g_{lr}(\mathbf x) = -2.58 x_1 + 0.84 x_2 + -3.06\)

Starting point

Training set

	text	klass
0	Ustedes están mal enserio, generar disforia???...	0
1	CaraotaDigital : #LoMásLeído #Video captó mome...	1
2	@vixin_twit Lo que te dan de puntos te lo ahor...	0
3	INVESTIGACIÓN.\nLas autoridades Investigan rob...	1
4	@AilenMormandoo No encontraste el indicado	0
5	El incidente en el que un individuo que transi...	1
6	Emergencia en @Codazzi_online por fuertes agu...	0
7	La policía boliviana arrestó a 9 venezolanos c...	1

Quiz

Question

Which of the following tasks does the previous training set belong to?

Polarity
Emotion identification
Aggressive detection
Profiling

Training set (2)

Problem

The independent variables are texts

Solution

Represent the texts in an suitable format for the classifier
Token as a vector
Sparse vector
Dense vector
Utterance as a vector

Text Representation

Token as Vector

Token as vector

The idea is that each token \(t\) is associate to a vector \(\mathbf v_t \in \mathbb R^d\)
Let \(\mathcal V\) represent the set composed by the different tokens
\(d\) corresponds to the dimension of the vector

\(d << \lvert \mathcal V \rvert\) (Dense Vector)

GloVe
Word2vec
fastText

Token as Vector (2)

\(d = \lvert \mathcal V \rvert\) (Sparse Vector)

\(\forall_{i \neq j} \mathbf v_i \cdot \mathbf v_j = 0\)
\(\mathbf v_i \in \mathbb R^d\)
\(\mathbf v_j \in \mathbb R^d\)

Algorithm

Sort the vocabulary \(\mathcal V\)
Associate \(i\)-th token to
\((\ldots, 0, \overbrace{\beta_i}^i, 0, \ldots)^\intercal\)
where \(\beta_i > 0\)

Utterance as Vector

Procedure

\[\mathbf x = \sum_{t \in \mathcal U} \mathbf{v}_t\]

where \(\mathcal{U}\) corresponds to all the tokens of the utterance
The vector \(\mathbf{v}_t\) is associated to token \(t\)

Unit Vector

\[\mathbf x = \frac{\sum_{t \in \mathcal U} \mathbf v_t}{\lVert \sum_{t \in \mathcal U} \mathbf v_t \rVert} \]

Sparse Representation

Tokens

flowchart LR
    Entrada([Text]) -->  Norm[Text Normalizer]
    Norm --> Seg[Tokenizer]
    Seg --> Terminos(...)

Text Normalization

User
URL
Entity
Case sensitive
Punctuation
Diacritic

Diacritic (remove)

text = 'México'
output = ""
for x in unicodedata.normalize('NFD', text):
    o = ord(x)
    if 0x300 <= o and o <= 0x036F:
        continue
    output += x
output

'Mexico'

Text Normalization

Case sensitive

text = "México"
output = text.lower()
output

'méxico'

URL (replace)

text = "go http://google.com, and find out"
output = re.sub(r"https?://\S+", "_url", text)
output

'go _url and find out'

Tokenizer

Common Types

Words
n-grams (Words)
q-grams (Characters)
skip-grams

Words

text = 'I like playing football on Saturday'
words = text.split()
words

['I', 'like', 'playing', 'football', 'on', 'Saturday']

Tokenizer (2)

n-grams

text = 'I like playing football on Saturday'
words = text.split()
n = 3
n_grams = []
for a in zip(*[words[i:] for i in range(n)]):
    n_grams.append("~".join(a))
n_grams

['I~like~playing',
 'like~playing~football',
 'playing~football~on',
 'football~on~Saturday']

q-grams

text = 'I like playing'
q = 4
q_grams = []
for a in zip(*[text[i:] for i in range(q)]):
    q_grams.append("".join(a))
q_grams

['I li',
 ' lik',
 'like',
 'ike ',
 'ke p',
 'e pl',
 ' pla',
 'play',
 'layi',
 'ayin',
 'ying']

\(\mu\)-TC

TextModel

tm = TextModel(token_list=[-1],
               num_option=OPTION_NONE,
               usr_option=OPTION_DELETE,
               url_option=OPTION_DELETE,
               emo_option=OPTION_NONE,
               lc=True,
               del_dup=False,
               del_punc=True,
               del_diac=True)

text = 'I like playing football with @mgraffg'
tm.tokenize(text)

['i', 'like', 'playing', 'football', 'with']

\(\mu\)-TC (2)

TextModel

tm = TextModel(token_list=[-2, -1, 6],
               num_option=OPTION_NONE,
               usr_option=OPTION_DELETE,
               url_option=OPTION_DELETE,
               emo_option=OPTION_NONE, 
               lc=True, del_dup=False,
               del_punc=True, del_diac=True)

text = 'I like playing...'
tm.tokenize(text)

['i~like',
 'like~playing',
 'i',
 'like',
 'playing',
 'q:~i~lik',
 'q:i~like',
 'q:~like~',
 'q:like~p',
 'q:ike~pl',
 'q:ke~pla',
 'q:e~play',
 'q:~playi',
 'q:playin',
 'q:laying',
 'q:aying~']

Training set

URL = 'https://github.com/INGEOTEC/Delitos/releases/download/Datos/delitos.zip'
if not isfile('delitos.zip'):
  Download(URL,
           'delitos.zip')
if not isdir('delitos'):
  !unzip -Pingeotec delitos.zip

Utterance as Vector

TextModel

tm = TextModel(token_list=[-1],
               num_option=OPTION_NONE,
               usr_option=OPTION_DELETE,
               url_option=OPTION_DELETE,
               emo_option=OPTION_NONE, 
               lc=True, del_dup=False,
               del_punc=True, del_diac=True)

Tokenizer

fname = 'delitos/delitos_ingeotec_Es_train.json'
training_set = list(tweet_iterator(fname))
tm.tokenize(training_set[0])[:3]

['este', 'caso', 'tiene']

Vocabulary

voc = Counter()
for text in training_set:
  tokens = set(tm.tokenize(text))
  voc.update(tokens)
voc.most_common(n=3)

[('de', 980), ('en', 804), ('la', 655)]

Utterance as Vector (2)

Inverse Document Frequency (IDF)

token2id = {}
token2beta = {}
N = np.log2(voc.update_calls)
for id, (k, n) in enumerate(voc.items()):
  token2id[k] = id
  token2beta[k] = N - np.log2(n)

Term Frequency - IDF

text = training_set[3]['text']
tokens = tm.tokenize(text)
vector = []
for token, tf in zip(*np.unique(tokens, return_counts=True)):
  if token not in token2id:
    continue
  vector.append((token2id[token], tf * token2beta[token]))
vector[:4]

[(59, 7.906890595608518),
 (52, 5.8479969065549495),
 (1, 1.3219280948873617),
 (6, 2.5224042154522373)]

Utterance as Vector (3)

\(\mu\)-TC

tm.fit(training_set)

<microtc.textmodel.TextModel at 0x152f85940>

Utterance as Vector

text = training_set[3]['text']
tm[text][:4]

[(3523, 0.08478070043211834),
 (5114, 0.07639809274356325),
 (6569, 0.35264239129026387),
 (3340, 0.1965574235982234)]

Quiz

Question

Which of the following representations do you consider to produce a larger vocabulary?

tmA = TextModel(token_list=[-1, 3],
                num_option=OPTION_NONE,
                usr_option=OPTION_DELETE,
                url_option=OPTION_DELETE,
                emo_option=OPTION_NONE, 
                lc=True, del_dup=False,
                del_punc=True,
                del_diac=True
               ).fit(training_set)

tmB = TextModel(token_list=[-1, 6],
                num_option=OPTION_NONE,
                usr_option=OPTION_DELETE,
                url_option=OPTION_DELETE,
                emo_option=OPTION_NONE, 
                lc=True, del_dup=False,
                del_punc=True,
                del_diac=True
               ).fit(training_set)

Text Classification

Procedure

Text as Vectors

tm = TextModel(token_list=[-1], num_option=OPTION_NONE,
                usr_option=OPTION_DELETE, url_option=OPTION_DELETE,
                emo_option=OPTION_NONE, lc=True, del_dup=False,
                del_punc=True, del_diac=True
               ).fit(training_set)
X = tm.transform(training_set)

Training a Classifier

labels = [x['klass'] for x in training_set]
m = LinearSVC(dual='auto').fit(X, labels)

Predict a text

X = tm.transform(['Buenos días']) # good morning
m.predict(X)

array([0])

Performance

Test set

test_set = list(tweet_iterator(fname.replace('_train.', '_test.')))

Prediction

tm = TextModel(token_list=[-2, -1, 3, 4], num_option=OPTION_NONE,
               usr_option=OPTION_DELETE, url_option=OPTION_DELETE,
               emo_option=OPTION_NONE, lc=True, del_dup=False,
               del_punc=True, del_diac=True
              ).fit(training_set)
X = tm.transform(training_set)
labels = np.array([x['klass'] for x in training_set])
m = LinearSVC(dual='auto', class_weight='balanced').fit(X, labels)
hy = m.predict(tm.transform(test_set))

Performance

recall_score([x['klass'] for x in test_set],
             hy, average=None)

array([0.95070423, 0.68421053])

Feature Importance

Coefficients

def coef(X, y):
    m = LinearSVC(dual='auto',
                  class_weight='balanced'
                 ).fit(X, y)
    return m.coef_

Normalize Coefficients

stats = StatisticSamples(statistic=coef,
                         num_samples=50,
                         n_jobs=-1)
b_samples = stats(X, labels)
se = np.std(b_samples, axis=0)
se[se==0] = 1
w_norm = m.coef_ / se
w_norm = np.linalg.norm(w_norm, axis=0)

Feature Importance (2)

Wordcloud

path = './emoji_text.ttf'
items = tm.token_weight.items
tokens = {tm.id2token[id]: w_norm[id] * _w for id, _w in items()
          if w_norm[id] >= 2.0 and np.isfinite(w_norm[id])}
word_cloud = WordCloud(font_path=path,
                       background_color='white'
                      ).generate_from_frequencies(tokens)
plt.imshow(word_cloud, interpolation='bilinear')
plt.tick_params(left=False, right=False, labelleft=False,
                   labelbottom=False, bottom=False)

Feature Importance (2)

Conclusions

Describe a supervised learning approach to tackle text classifications.
Explain the geometry of linear classifiers.
Use a procedure to represent a text as a vector.
Measure the performance of a text classifier.