2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Contextual Text Mining Qiaozhu Mei...

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Contextual Text Mining

Qiaozhu [email protected]

University of Illinois at Urbana-Champaign

http://www.cs.uiuc.edu/


Knowledge Discovery from Text

2

Text Mining System


2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 3

Trend of Text Content

Content Type

Published Content

Professional web content

User generated content

Private text content

Amount / day 3-4G ~ 2G 8-10G ~ 3T

- Ramakrishnan and Tomkins 2007



Text on the Web (Unconfirmed)

4

~750k /day

~3M day

~150k /day

1M

10B

6M

~100B

Where to Start? Where to Go?

Gold?



Context Information in Text

5

Author

Time

Source

Author’s occupati

on

Language Social

Network

Check Lap Kok, HK

self designer, publisher, editor …

3:53 AM Jan 28th

From Ping.fm

Location

Sentiment

Sentiment



Rich Context in Text

6

102M blogs

~3M msgs /day

~150k bookmarks /day

~300M words/month

~2M users

5M users 500M URLs

8M contributors 100+ languages

750K posts/day

100M users > 1M groups

73 years~400k authors ~4k sources

1B queries? Per hour?Per IP?



Text + Context = ?

7

+

Context = GuidanceI Have A Guide!

=



Query + User = Personalized Search

8

MSR

Modern System Research

Medical simulation

Montessori School of Raleigh

Mountain Safety Research

MSR Racing

Wikipedia definitions

Metropolis Street Racer

Molten salt reactor

Mars sample return

Magnetic Stripe Reader

How much can personalized help?

If you know me, you should give me Microsoft Research…



Common Themes IBM APPLE DELL

Battery Life Long, 4-3 hrs Medium, 3-2 hrs Short, 2-1 hrs

Hard disk Large, 80-100 GB Small, 5-10 GB Medium, 20-50 GB

Speed Slow, 100-200 Mhz Very Fast, 3-4 Ghz Moderate, 1-2 Ghz

IBM LaptopReviews

APPLE LaptopReviews

DELL LaptopReviews

Customer Review + Brand = Comparative Product Summary

Can we compare Products?



Hot Topics in SIGMOD

Literature + Time = Topic Trends

What’s hot in literature?



One Week Later

Blogs + Time & Location = Spatiotemporal Topic Diffusion

How does discussion spread?



Tom Hanks, who is my favorite movie star act the leading role.

protesting... will lose your faith by watching the movie.

a good book to past time.

... so sick of people making such a big deal about a fiction book

The Da Vinci Code

Blogs + Sentiment = Faceted Opinion Summary

What is good and what is bad?



Information retrieval

Machine learning Data mining

Coauthor Network

Publications + Social Network =Topical Community

Who works together on what?



Query log + User = Personalized SearchLiterature + Time = Topic TrendsReview + Brand = Comparative OpinionBlog + Time & Location = Spatiotemporal Topic

DiffusionBlog + Sentiment = Faceted Opinion SummaryPublications + Social Network = Topical Community

Text + Context = Contextual Text Mining

14

…..

A General Solution for All




• Generative Model of Text• Modeling Simple Context• Modeling Implicit Context• Modeling Complex Context • Applications of Contextual Text Mining

15



Generative Model of Text

16

)|( ModelwordP

the.. movie.. harry ..

potter is .. based.. on.. j..k..rowling

theis

harrypottermovie

plottime

rowling

0.10.070.050.040.040.020.010.01

the

Generation

Inference, Estimation

harry

pottermovie

harry

is



Contextualized Models

17

book

Generation: • How to select contexts?• How to model the relations ofcontexts?

Inference:• How to estimate contextual models?• How to reveal contextual patterns?

),|( ContextModelwordP

Year = 2008

Year = 1998

Location = USLocation = China

Source = official

Sentiment = +

harry

potter

is

bookharry

potterrowling

0.150.100.080.05

movieharry

potterdirector

0.180.090.080.04



Topics in Text

• Topic (Theme) = the subject of a discourse• A topic covers multiple documents• A document has multiple topics• Topic = a soft cluster of documents• Topic = a multinomial distribution of words

18

Many text mining tasks:• Extracting topics from text• Reveal contextual topic patterns

WebSearch

search 0.2engine 0.15query 0.08user 0.07ranking 0.06……



Probabilistic Topic Models

19

ipodnano

musicdownload

apple

0.150.080.050.020.01

movieharrypotter

actressmusic

0.100.090.050.040.02

Topic 1

Topic 2

Apple iPod

Harry Potter

Ki

iTopicwPizPwP..1

)|()()(

I downloaded

the music of

the movie

harry potter to

my ipod nano

ipod 0.15

harry 0.09



Parameter Estimation

• Maximizing data likelihood:

• Parameter Estimation using EM algorithm

20

))|(log(maxarg* ModelDataP

ipodnano

musicdownload

apple

0.150.080.050.020.01

movieharrypotter

actressmusic

0.100.090.050.040.02

I downloaded

the music of

the movie

harry potter to

my ipod nano

?????

?????

Guess the affiliation

Estimate the params

I downloaded

the music of

the movie

harry potter to

my ipod nano

I downloaded

the music of

the movie

harry potter to

my ipod nano

I downloaded

the music of

the movie

harry potter to

my ipod nano

Pseudo-Counts



How Context Affects Topics

21

• Topics in science literature:16th Century v.s. 21st Century

• When do a computer scientist and a gardener use “tree, root, prune” in text?

• What does “tree” mean in “algorithm”?

• In Europe, “football” appears a lot in a soccer report. What about in the US?

Text are generated according to the Context!!



Simple Contextual Topic Model

22

Topic 1

Topic 2

Context 1:2004

Context 2:2007

Cj Ki

jij ContextTopicwPContextizPjcPwP..1 ..1

),|()|()()(

Apple iPod

Harry Potter

I downloaded

the music of

the movie

harry potter to

my iphone



Contextual Topic Patterns

• Compare contextualized versions of topics: Contextual topic patterns

• Contextual topic patterns conditional distributions– z: topic; c: context; w: word

• : strength of topics in context• :content variation of topics

23

) )|((or )|( jzcPiczP

),|( iczwP



Example: Topic Life Cycles (Mei and Zhai KDD’05)

24

0

0. 002

0. 004

0. 006

0. 008

0. 01

0. 012

0. 014

0. 016

0. 018

0. 02

1999 2000 2001 2002 2003 2004Time (year)

Nor

mal

ized

Str

engt

h of

The

me

Biology Data

Web Information

Time Series

Classification

Association Rule

Clustering

Bussiness

Context = time

Comparing )|( zcP



Example: Spatiotemporal Theme Pattern (Mei et al. WWW’06)

25

Week4: The theme is again strong along the east coast and the Gulf of Mexico

Week3: The theme distributes more uniformly over the states

Week2: The discussion moves towards the north and west

Week5: The theme fades out in most states

Week1: The theme is the strongest along the Gulf of Mexico

About Government Responsein Hurricane Katrina

Context = time & location

Comparing )|( czP



Example: Evolutionary Topic Graph (Mei and Zhai KDD’05)

26

T

SVM 0.007criteria 0.007classifica – tion 0.006linear 0.005…

decision 0.006tree 0.006classifier 0.005class 0.005Bayes 0.005…

Classifica - tion 0.015text 0.013unlabeled 0.012document 0.008labeled 0.008learning 0.007…

Informa - tion 0.012web 0.010social 0.008retrieval 0.007distance 0.005networks 0.004…

1999

…

web 0.009classifica –tion 0.007features0.006topic 0.005…

mixture 0.005random 0.006cluster 0.006clustering 0.005variables 0.005… topic 0.010

mixture 0.008LDA 0.006 semantic 0.005…

…

2000 2001 2002 2003 2004

KDD

Context = timeComparing ),|( czwP



Example: Event Impact Analysis(Mei and Zhai KDD’06)

vector 0.0514concept 0.0298model 0.0291space 0.0236boolean 0.0151function 0.0123…

xml 0.0678email 0.0197 model 0.0191collect 0.0187judgment 0.0102rank 0.0097…

probabilist 0.0778model 0.0432logic 0.0404 boolean 0.0281algebra 0.0200estimate 0.0119weight 0.0111…

model 0.1687language 0.0753estimate 0.0520 parameter 0.0281distribution 0.0268smooth 0.0198likelihood 0.0059…

1998

Publication of the paper “A language modeling approach to information retrieval”

Starting of the TREC conferences

year1992

term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372model 0.0310probabilistic 0.0188document 0.0173…

Theme: retrieval models

SIGIR papersSIGIR papers

Context = eventComparing ),|( czwP



Implicit Context in Text

• Some contexts are hidden– Sentiments; intents; impact; etc.

• Document contexts: don’t know for sure– Need to infer this affiliation from the data

• Train a model M for each implicit context• Provide M to the topic model as guidance

28



Modeling Implicit Context

29

Topic 1

Topic 2

Positive

Negative

???hate

awfuldisgust

0.210.030.01

goodlike

perfect

0.100.050.02

Apple iPod

Harry Potter

I like the

song of

movie on

perfect but

hate the accent

my

ipod

the



Semi-supervised Topic Model(Mei et al. WWW’07)

Maximum A Posterior (MAP)

Estimation

Maximum Likelihood

Estimation (MLE)Add Dirichlet

priors

w

Topics

…

1

2

k

d1

d2

dk

Document

love great

hateawful

r1

r2

Similar to adding pseudo-counts to the observation

Guidance from

the user

))|(log(maxarg* DP

))()|(log(maxarg* PDP



Example: Faceted Opinion Summarization (Mei et al. WWW’07)

Neutral Positive Negative

Topic 1:Movie

... Ron Howards selection of Tom Hanks to play Robert Langdon.

Tom Hanks stars in the movie,who can be mad at that?

But the movie might get delayed, and even killed off if he loses.

Directed by: Ron Howard Writing credits: Akiva Goldsman ...

Tom Hanks, who is my favorite movie star act the leading role.

protesting ... will lose your faith by ... watching the movie.

After watching the movie I went online and some research on ...

Anybody is interested in it?

... so sick of people making such a big deal about a FICTION book and movie.

Topic 2:Book

I remembered when i first read the book, I finished the book in two days.

Awesome book. ... so sick of people making such a big deal about a FICTION book and movie.

I’m reading “Da Vinci Code” now.

…

So still a good book to past time.

This controversy book cause lots conflict in west society.

31Context = topic & sentiment



Results: Sentiment Dynamics

Facet: the book “ the da vinci code”. ( Bursts during the movie, Pos > Neg )

Facet: the impact on religious beliefs. ( Bursts during the movie, Neg > Pos )



Results: Topic with User’s Guidance

• Topics for iPod:

No Prior With Prior

Battery, nano Marketing Ads, spam Nano Battery

battery apple free nano battery

shuffle microsoft sign color shuffle

charge market offer thin charge

nano zune freepay hold usb

dock device complete model hour

itune company virus 4gb mini

usb consumer freeipod dock life

hour sale trial inch rechargable

Guidance from the user: I know two topics should look like this



Complex Context in Text

• Complex context structure of contexts• Many contexts has latent structure

– Time; location; social network

• Why modeling context structure?– Review novel contextual patterns;– Regularize contextual models;– Alleviate data sparseness: smoothing;

34



Modeling Complex Context

35

Topic 1

Topic 2

A

B

Context 1

Two Intuitions:• Regularization: Model(A) and Model(B) should be similar• Smoothing: Look at B if A doesn’t have enough data

Context A and B are closely related

tionRegularizaLikelihood)( CO



Applications of Contextual Text Mining

• Personalized Search– Personalization with backoff

• Social Network Analysis (for schools)– Finding Topical communities

• Information Retrieval (for industry labs)– Smoothing Language Models

36



Application I: Personalized Search

37



Personalization with Backoff (Mei and Church WSDM’08)

• Ambiguous query: MSG– Madison Square Garden– Monosodium Glutamate

• Disambiguate based on user’s prior clicks• We don’t have enough data for everyone!

– Backoff to classes of users• Proof of Concept:

– Context = Segments defined by IP addresses• Other Market Segmentation (Demographics)



Apply Contextual Text Mining to Personalized Search

• The text data: Query Logs• The generative model: P(Url| Query)• The context: Users (IP addresses)• The contextual model: P(Url| Query, IP)• The structure of context:

– Hierarchical structure of IP addresses

39



Evaluation Metric: Entropy (H)

• • Difficulty of encoding information (a distr.)

– Size of search space; difficulty of a task

• H = 20 1 million items distributed uniformly• Powerful tool for sizing challenges and

opportunities – How hard is search? – How much does personalization help?

Xx

xpxpXH )(log)()(



How Hard Is Search?

• Traditional Search– H(URL | Query)– 2.8 (= 23.9 – 21.1)

• Personalized Search– H(URL | Query, IPIP)– 1.21.2 (= 27.2 – 26.0)

Entropy (H)

Query 21.1

URL 22.1

IP 22.1

All But IP 23.9

All But URL 26.0

All But Query 27.1

All Three 27.2Personalization cuts H in Half!



Context = First k bytes of IP

42

),|(

),|(

),|(

),|(

),|(),|(

00

11

22

33

44

QIPUrlP

QIPUrlP

QIPUrlP

QIPUrlP

QIPUrlPQIPUrlP

156.111.188.243

156.111.188.*

156.111.*.*

156.*.*.*

*.*.*.*

Full personalization: every context has a different model: sparse data!

No personalization: all contexts share the same model

Personalization with backoff:

similar contexts have similar

models



Backing Off by IP

• λs estimated with EM• A little bit of personalization

– Better than too much – Or too little

Lambda

0

0.05

0.1

0.15

0.2

0.25

0.3

λ4 λ3 λ2 λ1 λ0

λ4 : weights for first 4 bytes of IP λ3 : weights for first 3 bytes of IPλ2 : weights for first 2 bytes of IP

……

4

0

),|(),|(i

ii QIPUrlPQIPUrlP

Sparse Data Missed Opportunity



Context Market Segmentation

• Traditional Goal of Marketing:– Segment Customers (e.g., Business v. Consumer)– By Need & Value Proposition

• Need: Segments ask different questions at different times• Value: Different advertising opportunities

• Segmentation Variables– Queries, URL Clicks, IP Addresses– Geography & Demographics (Age, Gender, Income)– Time of day & Day of Week



Business Days v. Weekends:More Clicks and Easier Queries

3,000,000

4,000,000

5,000,000

6,000,000

7,000,000

8,000,000

9,000,000

1 3 5 7 9 11 13 15 17 19 21 23

Jan 2006 (1st is a Sunday)

Clic

ks

1.001.021.041.061.081.101.121.141.161.181.20

En

tro

py

(H)

Total Clicks H(Url | IP, Q)

Easier

More Clicks



Harder Queries at TV Time

46

Harder queries



Application II: Information Retrieval

47



Application: Text Retrieval

Document d

A text mining paper

data mining

Doc Language Model (LM) θd : p(w|d) text 4/100=0.04

mining 3/100=0.03clustering 1/100=0.01…data = 0computing = 0…

Query q

Data ½=0.5Mining ½=0.5

Query Language Model θq : p(w|q)

Data ½=0.4Mining ½=0.4Clustering =0.1…

?p(w|q’)

text =0.039mining =0.028clustering =0.01…data = 0.001computing = 0.0005… Similarity

function

)|(

)|(log)|()||(

d

qq

Vwdq wp

wpwpD

Smoothed Doc LM θd' : p(w|d’)

48



Smoothing a Document Language Model

49

Retrieval performance estimate LM smoothing LM

text 4/100 = 0.04mining 3/100 = 0.03Assoc. 1/100 = 0.01clustering 1/100=0.01…data = 0computing = 0…

text = 0.039mining = 0.028Assoc. = 0.009clustering =0.01…data = 0.001computing = 0.0005…

Assign non-zero prob. to unseen words

Estimate a more accurate distribution from sparse data

text = 0.038mining = 0.026Assoc. = 0.008clustering =0.01…data = 0.002computing = 0.001…

)|( dMLE wP

)|( collectionwP )|()|()1()|( collectiondMLE wPwPdwP



Apply Contextual Text Mining to Smoothing Language Models

• The text data: collection of documents• The generative model: P(word)• The context: Document• The contextual model: P(w|d)• The structure of context:

– Graph structure of documents

• Goal: use the graph of documents to estimate a good P(w|d)

50



Traditional Document Smoothing in Information Retrieval

d

Collection

d

Clusters

d

Nearest Neighbors

Collection

Cluster

neighbors

d

d~

Interpolate MLE

with Reference LM

Estimate a Reference language model

θref using the collection (corpus)

ref

)|()|()|( refdMLE wPwPdwP

[Ponte & Croft 98]

[Liu & Croft 04]

[Kurland& Lee 04]

51



Graph-based Smoothing for Language Models in Retrieval (Mei et

al. SIGIR 2008)• A novel and general view of

smoothing

52

d

P(w|d): MLEP(w|d): Smoothed

P(w|d) = Surface on top of the Graph

projection on a plain

Smoothed LM = Smoothed Surface!

Collection = Graph (of Documents)

Collection

P(w|d1)P(w|d2)

d1d2

Can also be a word graph



The General Objective of Smoothing

53

2

),(

2 ))(,()~

)(()1()(

Evu

vuVu

uu ffvuwffuwCO

ufuf~ 2)

~)(( uu ffuw

Fidelity to MLE 2

),(

))(,(

Evu

vu ffvuw

Smoothness of the surface

)(uw

Importance of vertices

),( vuw

- Weights of edges (1/dist.)



Smoothing Language Models using a Document Graph

54

Construct a kNN graph of documents;

d w(u): Deg(u) w(u,v): cosine

AdditionalDirichlet

Smoothing

fu= p(w|du);

uf

;)|()(

),()|()1()|(

Vv

vuMLEu dwPuDeg

vuwdwPdwP

Document language model:



Effectiveness of the Framework

55

Data Sets Dirichlet DMDG DMWG † DSDG QMWG

AP88-90 0.217 0.254 ***(+17.1%)

0.252 ***(+16.1%)

0.239 ***(+10.1%)

0.239(+10.1%)

LA 0.247 0.258 **(+4.5%)

0.257 **(+4.5%)

0.251 **(+1.6%)

0.247

SJMN 0.204 0.231 ***(+13.2%)

0.229 ***(+12.3%)

0.225 ***(+10.3%)

0.219(+7.4%)

TREC8 0.257 0.271 *** (+5.4%)

0.271 **(+5.4%)

0.261 (+1.6%)

0.260(+1.2%)

† DMWG: reranking top 3000 results. Usually this yieldsto a reduced performance than ranking all the documents

Wilcoxon test: *, **, *** means significance level 0.1, 0.05, 0.01

Graph-based smoothing >> BaselineSmoothing Doc LM >> relevance score >> Query LM



Intuitive Interpretation – Smoothing using Document Graph

d

d1 0

0))|(1)(1(1)|()1()|( uMLuMLu dwPdwPdwP

Vv

vdwPuDeg

vuw)|(

)(

),( Absorption Probability to the “1” state

Writing a word w in a document = random walk on the doc Markov chain; write down w if reaching “1”

)1( uP )0( uP

)( vuP

Act as neighbors do

56



Application III: Social Network Analysis

57



Topical Community Analysis

58

physicist, physics, scientist, theory, gravitation …

writer, novel, best-sell, book, language, film…

Topic modeling to help community extraction

Information Retrieval +Data Mining +Machine Learning, …

=Domain Review +Algorithm +Evaluation, …

orComputer Science Literature

Network analysis to help topic extraction

?



Apply Contextual Text Mining to Topical Community Analysis

• The text data: Publications of researchers• The generative model: topic model• The context: author• The contextual model: author-topic model• The structure of context:

– Social Network: coauthor network of researchers

59



Intuitions

• People working on the same topic belong to the same “topical community”

• Good community: coherent topic + well connected• A topic is semantically coherent if people working on

this topic also collaborate a lot

60

IR

IR

IR?

More likely to be an IR person or a compiler person?

Intuition: my topics are similar to my neighbors



Social Network Context for Topic Modeling

61

• Context = author• Coauthor = similar contexts• Intuition: I work on similar topics to

my neighbors

Smoothed Topic distributions P(θj|author)

e.g. coauthor network



Evu

k

jjj

jd w

k

jj

vpupvuw

wpdpdwcGCO

, 1

2

1

)))|()|((),(2

1(

))|()|(log),(()1(),(

Topic Modeling with Network Regularization (NetPLSA)

62

• Basic Assumption (e.g., co-author graph)• Related authors work on similar topics

PLSA

Graph Harmonic Regularizer,

Generalization of [Zhu ’03],

Evu

k

jjj

jd w

k

jj

vpupvuw

wpdpdwcGCO

, 1

2

1

)))|()|((),(2

1(

))|()|(log),(()1(),(

importance (weight) of an edge

difference of topic distribution on neighbor vertices

tradeoff betweentopic and smoothness

topic distribution of a document

)|( , 2

1,

...1

upfwhereff jujkj

jTj



Topics & Communities without Regularization

Topic 1 Topic 2 Topic 3 Topic 4

term 0.02 peer 0.02 visual 0.02 interface 0.02

question 0.02 patterns 0.01 analog 0.02 towards 0.02

protein 0.01 mining 0.01 neurons 0.02 browsing 0.02

training 0.01 clusters 0.01 vlsi 0.01 xml 0.01

weighting 0.01

stream 0.01 motion 0.01 generation 0.01

multiple 0.01 frequent 0.01 chip 0.01 design 0.01

recognition 0.01 e 0.01 natural 0.01 engine 0.01

relations 0.01 page 0.01 cortex 0.01 service 0.01

library 0.01 gene 0.01 spike 0.01 social 0.01

63

?? ? ?

Noisy community assignment



Topics & Communities with Regularization

64

Topic 1 Topic 2 Topic 3 Topic 4

retrieval 0.13 mining 0.11 neural 0.06 web 0.05

information 0.05 data 0.06 learning 0.02 services 0.03

document 0.03 discovery 0.03 networks 0.02 semantic 0.03

query 0.03 databases 0.02 recognition 0.02 services 0.03

text 0.03 rules 0.02 analog 0.01 peer 0.02

search 0.03 association 0.02 vlsi 0.01 ontologies 0.02

evaluation 0.02 patterns 0.02 neurons 0.01 rdf 0.02

user 0.02 frequent 0.01 gaussian 0.01 management 0.01

relevance 0.02 streams 0.01 network 0.01 ontology 0.01

Information Retrieval

Data mining Machine learning

Web

Coherent community assignment



Topic Modeling and SNA Improve Each Other

Methods Cut Edge Weights

Ratio Cut/ Norm. Cut

Community Size

Community 1

Community 2

Community 3

Community 4

PLSA 4831 2.14/1.25 2280 2178 2326 2257

NetPLSA 662 0.29/0.13 2636 1989 3069 1347

NCut 855 0.23/0.12 2699 6323 8 11

65

-Ncut: spectral clustering with normalized cut. J. Shi et al. 2000- pure network based community finding

Network Regularization helps extract coherent communities(network assures the focus of topics)

Topic Modeling helps balancing communities(text implicitly bridges authors)

The smaller the betterThe smaller the better



Smoothed Topic Map

66

Map a topic on the network (e.g., using p(θ|a))

PLSA(Topic : “information retrieval”)

NetPLSA

Core contributors

Irrelevant

Intermediate



Summary of My Talk

67

• Text + Context = Contextual Text Mining– A new paradigm of text mining

• A novel framework for contextual text mining– Probabilistic Topic Models– Contextualize by simple context, implicit context,

complex context;

• Applications of contextual text mining



A Roadmap of My Work

68

Information Retrieval& Web Search


KDD 05

KDD 06b

WWW 06

WWW 07

WWW 08

Contextual TopicModels

KDD 06a

SIGIR 07

SIGIR 08

KDD 07

WSDM 08

CIKM 08

ACL 08



Research Discipline

69

Text InformationManagement

Text Mining InformationRetrieval

Data Mining

Natural LanguageProcessing

Database

Bioinformatics

Machine Learning

Applied Statistics

Social Networks

Information Science



End Note

70

+ =



Thank You

71


2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Contextual Text Mining Qiaozhu Mei...

Documents

Transcript of 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Contextual Text Mining Qiaozhu Mei...