Machine Learning Vienna University May 30,...
Transcript of Machine Learning Vienna University May 30,...
Text miningMachine Learning
Vienna University
May 30, 2014
1
Text Mining
Text mining also (data mining, text analytics) refers to the process of deriving high-quality
information from text:
1 structuring the input text
parsingaddition of some linguistic featuresremoval of some linguistic featuresinsertion into a database
2 deriving patterns within the structured data3 evaluation and interpretation of the output.
Text mining tasks:
1 text categorization2 text clustering3 concept/entity extraction4 production of granular taxonomies5 sentiment analysis6 document summarization7 entity relation modeling (i.e., learning relations between named entities)
2
Information Retrieval
Corpus: D unit of text dj , j = 1, · · ·D : documents, sub-documents, paragraphs,
sentences or a window of a fixed number of words
Dictionary W : N different words {word1,word2, · · · ,wordN} that appear into the
corpus.
Information Retrieval (IR) is finding documents from corpus that satisfy an information
need.
Examples: Web search, E-mail search, searching your computer, corporate knowledge
bases, legal information retrieval
You need a query!
3
Fundamental Property of Text (Zipf law)
Tipf’s law means that given some corpus of natural language, very few words are
responsible for the largest proportion of a written text.
Illustration:
1242 tour descriptions with about 1.5 m words and 30 thousands of different words. The
frequency of fifty most frequent words is shown in left figure. The frequency of words in
relation to a number of words is shown in right figure.
4
Pre-processing of Texts
Main components of pre-processing: blue blocks are obligatory, yellow- if required
START
Multple Documents (Corpus)
Removal of irrevelant information
Removal of empty documents
Removal of identical documents
Filtering of English texts
Tokenization
Spellchecking
Removal of repeating characters
Replacing of contractions Replacement with synonyms
Replacement with antonyms
Change uppercase to lowercase
Removal of stopwords
Lemmatization
Stemming
Replacement with hyperonyms
Removal of unique tokens
Use of SentiWordNet
END
5
Term-Document Matrices
Element tdi,j of a term-document matrix td is 1, if document i contains word j and 0
otherwise.
The incidence matrix td is a huge matrix.
An inverted index instead of incidence matrix is used to store the matrix: each document
is identified by docID, a document serial number, and for each word i we store a list of
all documents that contain this word.
From matrix td we can construct:
the document-term matrix: dt = tdT
the term-term matrix: tt = td · dtthe document-document matrix: dd = dt · td
Quite often the corpus size D is smaller than the dictionary size N, so the document
representation can be more efficient.
The dual description correspond to the document representation view of the problem,
and the primal to the term representation.
In the dual, a document is represented as the counts of terms that appear in it. In the
primal, a term is represented as the counts of the documents in which it appears.
tti,j , i = 1,N, j = 1,N symmetric matrix that indicate in how many documents pairs of
words i and j co-occur;
tti,i indicates in how many documents word i occur.
ddi,j indicates how many common words have documents i and j .
ddi,i indicates how many words has document i .
6
The single word probability model (unigram model)
Maximum likelihood probability estimates for each word in the vocabulary:
p(wi) = counts(wi)/∑
j counts(wj)
count(wi) - the word frequency vector and∑
j counts(wj) - total number of words in
corpus.
The most frequent words from tour descriptions without stop words and a Word Cloud:
Word Frequency
day 15541
city 7590
time 5557
visit 4966
tour 4554
one 4311
take 3914
town 3877
local 3587
enjoy 3306
7
Word Co-Occurrences
pi,j denotes the joint probability for co-occurrence of word i and j , and pi denotes the
individual probabilities for occurrence of word i .
pi,j = tti,j/N, pi = tti,i/N
Pointwise mutual information (PMI):
Ii,j = log (pi,j/(pipj)) = log (tti,j/(tti ttj))
Ii,j = 0: words i, j are statistical independent;
Ii,j > 0: positive affinity effect, i.e. words tend to co-occur more often than expected;
Ii,j < 0: negative affinity or rejection effect, i.e. they tend to co-occur less often than
expected in the case of statistical independence.
8
Basic n-Gram Models
Previously co-occurrences were computed by disregarding the relative position in each
word pair under consideration. Now we will take the position of words into consideration.
As a consequence, the resulting co-occurrence matrix will not be a symmetric one.
We want to compute probabilities for larger conventional units of text such as sentences,
paragraphs and documents.
p(w) = p(w1,w2, ...wm) = p(w1)p(w2|w1)p(w3|w1,w2)...p(wm|w1,w2...wm−1)
Assuming the Markov property:
1-gram: p(w) ≈ p(w1)p(w2)p(w3)...p(wm)2-gram: p(w) ≈ p(w2|w1)p(w3|w2)p(w4|w3)...p(wm|wm−1)3-gram: p(w) ≈ p(w3|w2,w1)p(w4|w3,w2)...p(wm|wm−1,wm−2)n-gram: p(w) ≈
∏i p(wi |wi−1,wi−2...wi−n+1)
Maximum likelihood estimates can be easily computed for probabilities by using a
training corpus.
9
Accounting for Order
Table left: Bigrams from descriptions of tour scored by their raw frequency.
Table right: Trigrams on basis of PMI for tour descriptions.
Nr. First Word Second Word
1 national park
2 travel time
3 free day
4 nestimated travel
5 machu picchu
6 make way
7 time explore
8 topdeck trip
9 world heritag
10 time hours
First Word Second Word Third Word
day trek time
small village day
free day take
free time drive
relax day best
afternoon tour local
along route day
city take time
day hanoi tour
years day trip
10
Collocations on Basis of Likelihood Ratios
Likelihood ratio is a number that tells us how much more likely one hypothesis is than the
other.
We examine the following two alternative explanations for the occurrence frequency of a
bigram w1w2:
Hypothesis 1: P(w2|w1) = p = P(w2|¬w1)
Hypothesis 2: P(w2|w1) = p1 6= p2 = P(w2|¬w1)
Hypothesis 1 is a formalization of independence (the occurrence of w2 is independent of
the previous occurrence of w1);
Hypothesis 2 is a formalization of dependence which is good evidence for an interesting
collocation.
Assuming a binomial distribution, we get the likelihood for counts of w1,w2, w1 and w2.
11
Term-Frequency Matrices
We define a term-frequency matrix tf as a number of occurrences of a word i in
document j : tfi,j
Docs
Terms 127 144 191 194 211
buy 0 0 0 0 0
buyer 0 0 0 0 0
buyers 0 2 0 0 0
calendar 0 0 0 0 0
called 0 0 0 0 0
cambridge 0 1 0 0 0
Matrix tf doesn’t consider the ordering of words in a document. That’s why this model is
also called as the bag of words model.
12
Term-Document Count Matrices
But relevance of word does not increase proportionally with term frequency. Thus we
take log-frequency weighting:
wij =
{1 + log10 (tf(ti , dj)), if tf(ti , dj) > 0
0, otherwise
Score for a document-query pair: sum over words i in both query q and document j :
Score =∑
i∈q∩dj
(1 + log (tf(ti , dj)))
13
Term Frequency-Inverse Document Frequency (tf-idf)
Frequent terms are less informative than rare terms, so tti,i is an inverse measure of the
informativeness of word i .
We define the idf (inverse documents frequency) of i by idfi = log10 (D/tti,i)We use log (D/dfi) instead of D/dfi to “smooth” the effect of df .
tf-idf weight of a word i in document j is a product of its tf weight and its idf weight:
tf_idfi,j = (1 + log tfi,j )̇ log10 (N/tti,i)It is the best known weighting scheme in information retrieval.
The tf-idf weights of words are good indicators of importance, and they are easy and fast
to compute.
For final ranking of documents for a query q we become a score:
Score(q, dj) =∑
i∈q∩dj
tf_idfij
14
TF-IDF Cosine Score
How similar are our documents to each other? Once we are operating under the vector
space model, we can compute different types of distances between word vectors;
Euclidean, Hamming, Jaccard, cosine, and so forth.
Since we have represented documents as word vectors, we can find the cosine distance
between documents and treat the resulting angle as an estimate of similarity based on
tf-idf weighted word vectors.
Here is a formula how to calculate tf-idf cosine distance between two documents di
and dj :
Cosine Similarity(di , dj) =(di ,dj )‖di‖‖dj‖
with (di , dj) = (tf_idf1,i)(tf_idf1,j) + (tf_idf2,i)(tf_idf2,j) + · · ·+ (tf_idfN,i)(tf_idfN,j)
‖ di ‖=√(tf_idf1,i)2 + (tf_idf2,i)2 + · · ·+ (tf_idfN,i)2
‖ dj ‖=√(tf_idf1,j)2 + (tf_idf2,j)2 + · · ·+ (tf_idfN,j)2
With tf-idf Cosine Score we can find documents that are similar to each other.
15
Clustering
In pattern recognition problems, the training data consists of a set of input vectors x
without any corresponding target values. The goal in such unsupervised learning
problems may be to discover groups of similar examples within the data, where it is
called clustering.
ï£ij
16
Classification
In contrast to clustering, where groups are unknown at the beginning, classification tries
to put specific documents into groups known in advance.
Typical real-world examples are spam classification of e-mails or classifying news
articles into topics.
In the following, we give two examples:
1 a very simple classifier k-Nearest NeighborsIn k-NN classification an object is classified by a majority vote of its neighbors, with
the object being assigned to the class most common among its k nearest
neighbors (k is a positive integer, typically small).
2 a more advanced method: Support Vector Machines.
17
Non-negative matrix factorization(NMF)
A term-document matrix tf is factored into a term-feature and a feature-document matrix:
WH = tf
The features are derived from the contents of the documents, and the feature-document
matrix describes data clusters of related documents.
Matrix multiplication can be implemented as linear combinations of column vectors in W
with coefficients supplied by cell values in H. Each column in tf can be computed as
follows:
tfi =N∑
j=1
Hjiwj
ai is the i th column vector of the product matrix tf
Hji is the cell value in the j th row and i th column of the matrixH
wj is the j th column of the matrix W
When multiplying matrices, the dimensions of the factor matrices may be significantly
lower than those of the product matrix and it’s this property that forms the basis of NMF.
18
Statistical Bag-of-Words
We assume statistical independence among word occurrences.
General class of bag-of-word models is known as topic models. The basic idea of topicmodels is, that text do have a higher order (=latent semantic) structure which,however,
is obscured by word usage (e.g. through the use of synonyms or polysemy). By using
conceptual indices that are derived statistically via a truncated singular value
decomposition (a two-mode factor analysis) over a given document-term matrix, this
variability problem can be overcome.
We use statistical inference methods to training this type of models. Consider, for
instance, the following decomposition of the probability of a given document p:
p(d) =∑
z
p(d, z) =∑
z
p(z)p(d|z)
We represent the document probability as the result of marginalizing the joint probability
distribution p(d, z) over a hidden discrete variable z. This hidden or latent variable is
commonly referred to as the topic variable.
p(d) ≈∑
z
p(z)∏
n
p(wn|z)
19
Statistical Bag-of-Words
We are going to determine p(z) and p(w|z) from a given set of data with EM algorithm.
Initial step. Choose a number of topic and randomly generate some initial values for
p(z) and p(w|z).1. step (E-step). We compute p(z|d) from the topic probabilities p(z) and the
conditional probabilities of the words given the topics p(w|z):
p(z|d) = 1/γp(d|z)p(z) ≈ 1/γp(z)∏
n
p(wn|z)
where 1/γ is just a normalization because of condition:∑
z p(z|d) = 1
2. step (M-step). We estimate new values for both p(z) and p(w|z) from the p(z|d)probabilities obtained above and the word-per-document occurrences that are counted
over the dataset:
p(w|z) = 1/ξ∑
d c(d,w)p(z|d)p(z) = 1/ρ
∑d
∑w c(d,w)p(z|d)
where c(d,w) are the counts corresponding to the number of times each vocabulary
word w occurs in each document d of the collection, and 1/ξ and 1/ρ are normalization
factors for conditions∑
w p(w|z) = 1 and∑
z p(z) = 1.
20
Latent Semantic Analysis (LSA)
Dirichlet distribution:f (x1, · · · , xK−1;α1, · · · , αK ) =
1B(α)
∏Ki=1 xαi−1
i , on the open (K 1)-dimensional
simplex defined by:
x1, · · · , xK−1 > 0 (1)
x1 + · · ·+ xK−1 < 1 (2)
xK = 1− x1 − · · · − xK−1 (3)
and zero elsewhere.
The normalizing constant is the multinomial Beta function, which can be expressed in
terms of the gamma function:
B(α) =
∏K
i=1Γ(αi )
Γ(∑K
i=1αi), α = (α1, · · · , αK ).
Multinomial distribution: the probability mass function can be expressed using the
gamma function as:
f(x1, . . . , xk ; p1, . . . , pk) =Γ(
∑ixi +1)∏
iΓ(xi +1)
∏ki=1 pxi
i .
21
Latent Semantic Analysis (LSA)
LDA consists of the following three steps.
Step 1: The term distribution p(w|z) is determined for each topic by
p(w|z) ∼ Dirichlet(β).The dimensions of the parameters correspond to the number of topics k .
Step 2: The proportions p(z) of the topic distribution for the document d are determined
by p(z) ∼ Dirichlet(α).Step 3: For each of the N words wi
1 Choose a topic zi ∼ Multinomial(θ).
2 Choose a word wi from a multinomial probability distribution conditioned on the
topic zi : p(wi , zi , β)
β is the term distribution of topics and contains the probability of a word occurring in a
given topic.
22
Information Extraction
Information Extraction (IE) systems analyse unrestricted text in order to extract
information about pre-specified types of events, entities or relationships.
Part-of-Speech (POS) tagging
Special POS selections: select only JJ (Adjective) or only NN (Noun, singular or
mass) or assemble consecutive nouns
Noun Phrase (NP) - Chunking and Chinking: make a small grammar with regular
expression that catch a sentence part, as example, a individual noun phrase
(chunks):
Named Entity Recognition (NER): identify named entities and to classify them into
a set of predefined types such as: ORGANIZATION, PERSON, LOCATION, DATE,
TIME , MONEY , PERCENT, FACILITY , GPE
Entity Feature
Relation Extraction
NP
It PRP was VBD an DT amazing NN experience NN ! .
Figure : POS Parse tree
23
Sentiment analysis
The sentiment can be positive or negative and can be applied to product search.
Polarity Estimation In sentiment analysis we learn how the property, subjectivity or
sentiment of a sentence can be deduced from the words occurring in the sentence.
1 to identify the product property mentioned in the given sentence,
2 to identify the polarity of the opinion (i.e. the sentiment).
The identification of polarity can again be divided into
1 identifying subjective and objective sentences and
2 identifying the polarity as neutral, positive or negative
24
Semantic Orientation (SO)
For sentiment analysis we can calculate Semantic Orientation (SO) of documents from
corpus.
The SO measure for each individual phrase is calculated based on the affiliation with a
positive reference word excellent and negative reference word poor and is given by
SO(phrase) = PMI(phrase, ”excellent”)− PMI(phrase, ”poor”)
As approximative sentiment analysis we can consider the whole text as a one long
sentence and calculate the SO measure for the dictionary.
25