DMTM 2015 - 18 Text Mining Part 2

34
Prof. Pier Luca Lanzi Text Mining – Part 2 Data Mining and Text Mining (UIC 583 @ Politecnico di Milano)

Transcript of DMTM 2015 - 18 Text Mining Part 2

Page 1: DMTM 2015 - 18 Text Mining Part 2

Prof. Pier Luca Lanzi

Text Mining – Part 2���Data Mining and Text Mining (UIC 583 @ Politecnico di Milano)

Page 2: DMTM 2015 - 18 Text Mining Part 2

Prof. Pier Luca Lanzi

Association

Page 3: DMTM 2015 - 18 Text Mining Part 2

Prof. Pier Luca Lanzi

Keyword-Based Association Analysis

•  Aims to discover sets of keywords that occur frequently ���together in the documents

•  Relies on the usual techniques for mining associative���and correlation rules

•  Each document is considered as a transaction of type������

{document id, {set of keywords}}

Page 4: DMTM 2015 - 18 Text Mining Part 2

Prof. Pier Luca Lanzi

Why Mining Associations?

•  Association mining may discover set of consecutive or closely-located keywords, called terms or phrase§ Compound (e.g., {Stanford,University})§ Noncompound (e.g., {dollars,shares,exchange})

•  Once discovered the most frequent terms, term-level mining can be applied most effectively (w.r.t. single word level)

•  They are useful for improving accuracy of many NLP tasks§ POS tagging, parsing, entity recognition, acronym expansion§ Grammar learning

4

Page 5: DMTM 2015 - 18 Text Mining Part 2

Prof. Pier Luca Lanzi

Why Mining Associations?

•  They are useful for improving accuracy of many NLP tasks§ POS tagging, parsing, entity recognition, acronym expansion§ Grammar learning

•  They are directly useful for many applications in text retrieval and mining § Text retrieval (e.g., use word associations to suggest a

variation of a query) § Automatic construction of topic map for browsing: words as

nodes and associations as edges § Compare and summarize opinions (e.g., what words are most

strongly associated with “battery” in positive and negative reviews about iPhone 6, respectively?)

5

Page 6: DMTM 2015 - 18 Text Mining Part 2

Prof. Pier Luca Lanzi

Basic Word Relations:���Paradigmatic vs. Syntagmatic ���

•  Paradigmatic§ A & B have paradigmatic relation if they can be substituted for

each other (i.e., A & B are in the same class) § For example, “cat” and “dog”, “Monday” and “Tuesday”

•  Syntagmatic§ A & B have syntagmatic relation if they can be combined with

each other (i.e., A & B are related semantically)§ For example, “cat” and “sit”, “car” and “drive”

•  These two basic and complementary relations can be generalized to describe relations of any items in a language

6

Page 7: DMTM 2015 - 18 Text Mining Part 2

Prof. Pier Luca Lanzi

Mining Paradigmatic Relations

Page 8: DMTM 2015 - 18 Text Mining Part 2

Prof. Pier Luca Lanzi

Mining Syntagmatic Relations

Page 9: DMTM 2015 - 18 Text Mining Part 2

Prof. Pier Luca Lanzi

How to Mine Paradigmatic and���Syntagmatic Relations?

•  Paradigmatic§ Represent each word by its context§ Compute context similarity§ Words with high context similarity likely have paradigmatic relation

•  Syntagmatic § Count how many times two words occur together in a context

(e.g., sentence or paragraph) § Compare their co-occurrences with their individual occurrences § Words with high co-occurrences but relatively low individual

occurrences likely have syntagmatic relation •  Paradigmatically related words tend to have syntagmatic relation with

the same word, thus we can join the discovery of the two relations

9

Page 10: DMTM 2015 - 18 Text Mining Part 2

Prof. Pier Luca Lanzi

Mining Paradigmatic Relations

Page 11: DMTM 2015 - 18 Text Mining Part 2

Prof. Pier Luca Lanzi

Mining Paradigmatic Relations

Page 12: DMTM 2015 - 18 Text Mining Part 2

Prof. Pier Luca Lanzi

Sim(“Cat”, “Dog”) = ���Sim(Left1(“cat”), Left1(“dog”)) + ���Sim(Right1(“cat”), Right1(“dog”)) + ���…���+ Sim(Window8(“cat”), Window8(“dog”))=?���

���High value of Sim(word1, word2) implies that ���word1 and word2 are paradigmatically related

Page 13: DMTM 2015 - 18 Text Mining Part 2

Prof. Pier Luca Lanzi

Adapting BM25 to Paradigmatic Relation Mining 13

Page 14: DMTM 2015 - 18 Text Mining Part 2

Prof. Pier Luca Lanzi

Mining Syntagmatic Relations

Page 15: DMTM 2015 - 18 Text Mining Part 2

Prof. Pier Luca Lanzi

Mining Syntagmatic Relations

Mining Paradigmatic Relations

Page 16: DMTM 2015 - 18 Text Mining Part 2

Prof. Pier Luca Lanzi

Predicting Words

• We want to predict whether the word W is present or absent in a text segment•  Some words are easier to predict than others§ W = “the” (easy because frequent)§ W = “unicorn” (easy because rare)§ W = “meat” (difficult because not frequent nor rare)

•  Formal definition§ Binary random variable Xw is 1 if w is present 0 otherwise§ p(Xw=1) + p(Xw=0) = 1

•  The more random Xw is the more difficult to predict it is

16

How can we measure the “randomness” of Xw? J

Page 17: DMTM 2015 - 18 Text Mining Part 2

Prof. Pier Luca Lanzi

Conditional Entropy

•  The entropy of Xw measures the randomness of predicting w§ The lower the entropy the easier to predict§ The higher the entropy the more difficult to predict

•  Suppose now that “eat” is present in the segment and that we need to predict the presence of “meat”

•  Questions§ Does the presence of “eat” help predicting the presence of “meat”?§ Does it reduces the uncertainty about “meat”?§ What if “eat” is not present?

17

Page 18: DMTM 2015 - 18 Text Mining Part 2

Prof. Pier Luca Lanzi

Complete Formula

Page 19: DMTM 2015 - 18 Text Mining Part 2

Prof. Pier Luca Lanzi

In general, for any discrete random variable���we have that H(X) ≥ H(X|Y)���

���What is the minimum possible value of H(X|Y)?���

���Which one is smaller? H(Xmeat|Xthe) or H(Xmeat|Xeat)?

Page 20: DMTM 2015 - 18 Text Mining Part 2

Prof. Pier Luca Lanzi

Mining Syntagmatic Relations using Entropy

•  For each word W1§ For every other word W2 compute H(XW1|XW2)§ Sort all the candidates in ascending order of H(XW1|XW2)§ Take the top-ranked candidate words as words that have

potential syntagmatic relations with W1§ Need to use a threshold for each W1

•  However, while H(XW1|XW2) and H(XW1|XW3) are comparable, H(XW1|XW2) and H(XW3|XW2) aren’t!

20

Page 21: DMTM 2015 - 18 Text Mining Part 2

Prof. Pier Luca Lanzi

Text Classification

Page 22: DMTM 2015 - 18 Text Mining Part 2

Prof. Pier Luca Lanzi

Document classification

•  Solve the problem of labeling automatically text documents on the basis of§ Topic§ Style§ Purpose

•  Usual classification techniques can be used to learn from a training set of manually labeled documents• Which features? Keywords can be thousands…•  Major approaches§ Similarity-based§ Dimensionality reduction§ Naïve Bayes text classifiers

22

Page 23: DMTM 2015 - 18 Text Mining Part 2

Prof. Pier Luca Lanzi

Similarity-based Text Classifiers

•  Exploits Information Retrieval and k-nearest-neighbor classifier§ For a new document to classify, the k most similar documents

in the training set are retrieved§ Documents are classified on the basis of the class distribution

among the k documents retrieved using the majority vote or a weighted vote

§ Tuning k is very important to achieve a good performance

•  Limitations§ Space overhead to store all the documents in training set§ Time overhead to retrieve the similar documents

23

Page 24: DMTM 2015 - 18 Text Mining Part 2

Prof. Pier Luca Lanzi

Dimensionality Reduction for Text Classification

•  As in the Vector Space Model, the goal is to reduce the number of features to represents text•  Usual dimensionality reduction approaches in Information

Retrieval are based on the distribution of keywords among the whole documents database•  In text classification it is important to consider also the correlation

between keywords and classes§ Rare keywords have a high TF-IDF but might be uniformly

distributed among classes§ LSA and LPA do not take into account classes distributions

•  Usual classification techniques can be then applied on reduced features space are SVMs and Naïve Bayes classifiers

24

Page 25: DMTM 2015 - 18 Text Mining Part 2

Prof. Pier Luca Lanzi

•  Document categories {C1 , …, Cn}•  Document to classify D•  Probabilistic model:

• We chose the class C* such that

•  Issues§ Which features?§ How to compute the probabilities?

Naïve Bayes for Text Classification

( | ) ( )( | )( )i i

iP D C P CP C D

P D=

25

Page 26: DMTM 2015 - 18 Text Mining Part 2

Prof. Pier Luca Lanzi

Naïve Bayes for Text Classification

•  Features can be simply defined as the words in the document•  Let ai be a keyword in the doc, and wj a word in the vocabulary,

we get:

Example

H={like,dislike} D= “Our approach to representing arbitrary text documents is disturbingly simple”

26

Page 27: DMTM 2015 - 18 Text Mining Part 2

Prof. Pier Luca Lanzi

Naïve Bayes for Text Classification

•  Features can be simply defined as the words in the document•  Let ai be a keyword in the doc, and wj a word in the vocabulary,

•  Assumptions§ Keywords distributions are inter-independent§ Keywords distributions are order-independent

27

Page 28: DMTM 2015 - 18 Text Mining Part 2

Prof. Pier Luca Lanzi

Naïve Bayes for Text Classification

•  How to compute probabilities?§ Simply counting the occurrences may lead to wrong results when

probabilities are small•  M-estimate approach adapted for text:

§ Nc is the whole number of word positions in documents of class C§ Nc,k is the number of occurrences of wk in documents of class C§  |Vocabulary| is the number of distinct words in training set § Uniform priors are assumed

28

Page 29: DMTM 2015 - 18 Text Mining Part 2

Prof. Pier Luca Lanzi

Naïve Bayes for Text Classification

•  Final classification is performed as

•  Despite its simplicity Naïve Bayes classifiers works very well in practice

•  Applications§ Newsgroup post classification§ NewsWeeder (news recommender)

Page 30: DMTM 2015 - 18 Text Mining Part 2

Prof. Pier Luca Lanzi

Clustering Text

Page 31: DMTM 2015 - 18 Text Mining Part 2

Prof. Pier Luca Lanzi

Text Clustering

•  It involves the same steps as other text mining algorithms to preprocess the text data into an adequate representation§ Tokenizing and stemming each synopsis§ Transforming the corpus into vector space using TF-IDF§ Calculating cosine distance between each document as a

measure of similarity§ Clustering the documents using the similarity measure

•  Everything works as with the usual data except that text requires a rather long preprocessing.

31

Page 32: DMTM 2015 - 18 Text Mining Part 2

Prof. Pier Luca Lanzi

Page 33: DMTM 2015 - 18 Text Mining Part 2

Prof. Pier Luca Lanzi

Weka DEMO

•  Movie Review Data§ http://www.cs.cornell.edu/people/pabo/movie-review-data/

•  Load Text Data in Weka (CLI)

33

Page 34: DMTM 2015 - 18 Text Mining Part 2

Prof. Pier Luca Lanzi

Weka DEMO (2)

•  From text to word vectors: StringToWordVector•  Naïve Bayes Classifier•  Stoplist§ https://code.google.com/p/stop-words/

•  Attribute Selection

34