1 ADB 2011 – Text Mining Bettina Berendt, K.U.Leuven.

46
1 ADB 2011 – Text Mining Bettina Berendt, K.U.Leuven

Transcript of 1 ADB 2011 – Text Mining Bettina Berendt, K.U.Leuven.

1

ADB 2011 – Text Mining

Bettina Berendt, K.U.Leuven

ADB 2011 – Text Mining

Bettina Berendt, K.U.Leuven

2

Agenda

A basic concept: Texts as feature vectors (so we can apply the algorithms we know)

.

Text classification

Other approaches to opinion mining

Further examples from mining news, blogs and other social media

Some notes about text preprocessing

.

3

Agenda

A basic concept: Texts as feature vectors (so we can apply the algorithms we know)

.

Text classification

Other approaches to opinion mining

Further examples from mining news, blogs and other social media

Some notes about text preprocessing

.

4The goal: text representation in the usual “feature” model

Basic idea:

Keywords are extracted from texts.

These keywords describe the (usually) topical content of Web pages and other text contributions.

Based on the vector space model of document collections:

Each unique word in a corpus of Web pages = one dimension

Each page(view) is a vector with non-zero weight for each word in that page(view), zero weight for other words

Words become “features” (in a data-mining sense)

5

Conceptually, the inverted file structure represents a document-feature matrix, where each row is the feature vector for a page and each column is a feature

How to get there

Feature representation for texts

each text p is represented as a k-dimensional feature vector, where k is the total number of extracted features from the site in a global dictionary

feature vectors obtained are organized into an inverted file structure containing a dictionary of all extracted features and posting files for pageviews

6

nova galaxy heat actor film rolediet

A 1.0 0.5 0.3

B 0.5 1.0

C 0.4 1.0 0.8 0.7

D 0.9 1.0 0.5

E 0.5 0.7 0.9

F 0.6 1.0 0.3 0.2 0.8

Document Ids

a documentvector

Features

Document Representation as Vectors

Starting point is the raw term frequency as term weights

Other weighting schemes can generally be obtained by applying various transformations to the document vectors

7

8

9

10

Agenda

A basic concept: Texts as feature vectors (so we can apply the algorithms we know)

.

Text classification

Other approaches to opinion mining

Further examples from mining news, blogs and other social media

Some notes about text preprocessing

.

11

The idea of text mining ...

... is to go beyond frequency-counting

... is to go beyond the search-for-documents framework

... is to find patterns (of meaning) within and across documents

(yes, there is text mining behind some of the things the above tools do!)

12

The steps of text mining

1. Application understanding

2. Corpus generation

3. Data understanding

4. Text preprocessing

5. Search for patterns / modelling

Topical analysis

Sentiment analysis / opinion mining

6. Evaluation

7. Deployment

13

Application understanding; Corpus generation

What is the question?

What is the context?

What could be interesting sources, and where can they be found?

Crawl

Use a search engine and/or archive Google blogs search

Technorati

Blogdigger

...

14

Preprocessing (1)

Data cleaning

Goal: get clean ASCII text

Remove HTML markup*, pictures, advertisements, ...

Automate this: wrapper induction

* Note: HTML markup may carry information too (e.g., <b> or <h1> marks something important), which can be extracted! (Depends on the application)

15

Preprocessing (2)

Further text preprocessing Goal: get processable lexical / syntactical units Tokenize (find word boundaries) Lemmatize / stem

ex. buyers, buyer buyer / buyer, buying, ... buy

Remove stopwords Find Named Entities (people, places, companies, ...); filtering Resolve polysemy and homonymy: word sense disambiguation;

“synonym unification“ Part-of-speech tagging; filtering of nouns, verbs, adjectives, ... ...

Most steps are optional and application-dependent! Many steps are language-dependent; coverage of non-English varies Free and/or open-source tools or Web APIs exist for most steps

16

Preprocessing (3)

Creation of text representation

Goal: a representation that the modelling algorithm can work on

Most common forms: A text as

a set or (more usually) bag of words / vector-space representation: term-document matrix with weights reflecting occurrence, importance, ...

a sequence of words

a tree (parse trees)

17

An important part of preprocessing:Named-entity recognition (1)

18

An important part of preprocessing:Named-entity recognition (2)

Technique: Lexica, heuristic rules, syntax parsing

Re-use lexica and/or develop your own

configurable tools such as GATE

A challenge: multi-document named-entity recognition

See proposal in Subašić & Berendt (Proc. ICDM 2008)

19

The simplest form of content analysis is based on NER

Berendt, Schlegel und KochIn Zerfaß et al. (Hrsg.) Kommunikation, Partizipation und Wirkungen im Social Web, 2008

20

Agenda

A basic concept: Texts as feature vectors (so we can apply the algorithms we know)

.

Text classification

Other approaches to opinion mining

Further examples from mining news, blogs and other social media

Some notes about text preprocessing

.

21

Note

Text classification was first done by topic ( you‘ll do this in the exercise session), but the class could be anything.

In the following example, we‘ll use a sentiment class and thereby enter the area of sentiment/opinion mining (at the document level).

22

What makes people happy?

23

Happiness in blogosphere

24

Well kids, I had an awesome birthday thanks to you. =D Just wanted to so thank you for coming and thanks for the gifts and junk. =) I have many pictures and I will post them later. hearts

Well kids, I had an awesome birthday thanks to you. =D Just wanted to so thank you for coming and thanks for the gifts and junk. =) I have many pictures and I will post them later. hearts

current mood:

Home alone for too many hours, all week long ... screaming child, headache, tears that just won’t let themselves loose.... and now I’ve lost my wedding band. I hate this.

Home alone for too many hours, all week long ... screaming child, headache, tears that just won’t let themselves loose.... and now I’ve lost my wedding band. I hate this.

current mood:

What are the characteristic words of these two moods?

[Mihalcea, R. & Liu, H. (2006). In Proc. AAAI Spring Symposium CAAW.]

Slides based on Rada Mihalcea‘s presentation.

25

Data, data preparation and learning

LiveJournal.com – optional mood annotation

10,000 blogs:

5,000 happyhappy entries / 5,000 sadsad entries

average size 175 words / entry

post-processing – remove SGML tags, tokenization, part-of-speech tagging

quality of automatic “mood separation”

naïve bayes text classifier five-fold cross validation

Accuracy: 79.13% (>> 50% baseline)

26

Results: Corpus-derived happiness factors

yay 86.67

shopping 79.56

awesome 79.71

birthday 78.37

lovely 77.39

concert 74.85

cool 73.72

cute 73.20

lunch 73.02

books 73.02

goodbye 18.81hurt 17.39tears 14.35cried 11.39upset 11.12sad 11.11cry 10.56died 10.07lonely 9.50crying 5.50

happiness factor of a word = the number of occurrences in the happy blogposts / the total frequency in the corpus

27

Agenda

A basic concept: Texts as feature vectors (so we can apply the algorithms we know)

.

Text classification

Other approaches to opinion mining

Further examples from mining news, blogs and other social media

Some notes about text preprocessing

.

28Opinion at the “product-feature” level: Feature-based Summary (Hu and Liu, Proc. SIGKDD’04)

GREAT Camera., Jun 3, 2004

Reviewer: jprice174 from Atlanta, Ga.

I did a lot of research last year before I bought this camera... It kinda hurt to leave behind my beloved nikon 35mm SLR, but I was going to Italy, and I needed something smaller, and digital.

The pictures coming out of this camera are amazing. The 'auto' feature takes great pictures most of the time. And with digital, you're not wasting film if the picture doesn't come out. …….

Feature1: picture

Positive: 12

The pictures coming out of this camera are amazing.

Overall this is a good camera with a really good picture clarity.

Negative: 2

The pictures come out hazy if your hands shake even for a moment during the entire process of taking a picture.

Focusing on a display rack about 20 feet away in a brightly lit room during day time, pictures produced by this camera were blurry and in a shade of orange.

Feature2: battery life

Source: Product reviews similar to blogs, but (more) clearly product-related

29

Sentistrength

http://gplsi.dlsi.ua.es/congresos/wassa2010/authors/6%20Emotion%20detection.ppt

30

Agenda

A basic concept: Texts as feature vectors (so we can apply the algorithms we know)

.

Text classification

Other approaches to opinion mining

Further examples from mining news, blogs and other social media

Some notes about text preprocessing

.

31

Feldman et al., Proc. ICDM 2007

More about named entities: co-occurrence

Source:Discussion boards similar to blogs,but (more) clearly communication-related

32

Co-occurrence of brands and attributes

Feldman et al., Proc. ICDM 2007

33

Capturing “online buzz“Capturing “online buzz“:Bursty communication actitivies

34

Comparing search volume, news and blogs

35

Recall “Michelle Obama“

Google Trends, Blogpulse etc. associate documents / document sets with “bursts“

But: this means the user has to read the documents!

Can we do better and create a concise summary of what was discussed in that period?

Can we allow the user to ask as much detail as s/he is interested in?

More advanced text modelling: Summarization – of time-indexed documents

36

Yes –

with

STORIES

(Subašić & Berendt,

Proc. ICDM 2008)

37

Salient story elements

1. Identify content-bearing terms (e.g. 150 top-TF.IDF over whole corpus)

2. Split whole corpus T by atomic time period (e.g., week)

3. For each time period (atomic or moving-average) Compute the weights for corpus t for this period Weight =

Support of co-occurrence of 2 content-bearing terms w1, w2 in t =

(# articles from t containing both w1, w2 in window) / (# all articles in t)

4. Threshold Number of occurrences of co-occurrence(w1, w2) in t ≥ θ1 (e.g., 5)

Time-relevance TR of co-occurrence(w1, w2) =

support(co-occurrence(w1, w2)) in t / support(co-occurrence(w1, w2)) in T ≥ θ2 (e.g., 2)

Thresholds are set dynamically + interactively by the user

5. Story elements = relationships = all these edges

Story basics = terms = all nodes connected by these edges

38

Salient story stages, and story evolution

6. Story stage = the story graph made of basics and elements in t

7. Story evolution = how story stages evolve over the t in T

39

An event: a missing child

40

A central figure emerges in the police investigations

41

Uncovering more details

42

Uncovering more details

43

An eventless time

44

The story and the underlying documents

45

(Berendt & Trümper, in press)

Navigating between documents; relating different source types to one another

46

Literature and other sources

A good textbook on Text Mining:

Feldman, R. & Sanger, J. (2007). The Text Mining Handbook. Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press.

A good introduction (even if a bit old), including a good overview of preprocessing issues:

Baldi, Pierre / Frasconi, Paolo / Smyth, Padhraic (2003). Modeling the Internet and the Web. Probabilistic Methods and Algorithms. Wiley. Chapter 4: http://media.wiley.com/product_data/excerpt/61/04708490/0470849061.pdf

p.29: Thelwall, M., Buckley, K., Paltoglou, G. Cai, D., & Kappas, A. (2010). Sentiment strength detection in short informal text. Journal of the American Society for Information Science and Technology, 61(12), 2544–2558. http://www.scit.wlv.ac.uk/~cm1993/papers/SentiStrengthPreprint.doc

More papers and materials are here: http://sentistrength.wlv.ac.uk/

Individual references:

pp. 36ff.: Subašić, I. & Berendt, B. (2008). Web mining for understanding stories through graph visualisation. In Proc. of the 2008 Eighth IEEE International Conference on Data Mining (pp. 570–579). Los Alamitos, CA: IEEE Computer Society Press.

p. 19: Berendt, B., Schlegel, M., & Koch, R. (2008). Die deutschsprachige Blogosphäre: Reifegrad, Politisierung, Themen und Bezug zu Nachrichtenmedien. In A. Zerfaß, M. Welker, & J. Schmidt (Eds.), Kommunikation, Partizipation und Wirkungen im Social Web (Band 2:Strategien und Anwendungen: Perspektiven für Wirtschaft, Politik, Publizistik) (pp. 72–96). Köln, Germany: Herbert von Halem Verlag.

pp. 31f: R. Feldman, M. Fresko, J. Goldenberg, O. Netzer, and L. H. Ungar (2007). Extracting product comparisons from discussion boards. In Proc. ICDM 2007, pp. 469–474. IEEE Computer Society, 2007. http://ieeexplore.ieee.org/iel5/4470209/4470210/04470275.pdf?arnumber=4470275

p. 45: Berendt, B. & Trümper, D. (2009). Semantics-based analysis and navigation of heterogeneous text corpora: the porpoise news and blogs engine. I.-H. Ting & H.-J. Wu (Eds.), Web Mining Applications in E-commerce and E-services (pp. 45-64). Berlin etc.: Springer, Studies in Computational Intelligence, Vol. 172. http://www.cs.kuleuven.be/~berendt/Papers/berendt_truemper_2009.pdf

p. 28: Minqing Hu and Bing Liu (2004). Mining and summarizing customer reviews. In Proc. SIGKDD’04 (pp. 168-177). http://portal.acm.org/citation.cfm?doid=1014052.1014073

pp. 22ff.: Mihalcea, R. & Liu, H. (2006). A corpus-based approach to finding happiness, In Proc. AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.79.6759

See http://wiki.esi.ac.uk/Current_Approaches_to_Data_Mining_Blogs for more articles on the subject.