2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Contextual Text Mining Qiaozhu Mei...
-
Upload
milton-taylor -
Category
Documents
-
view
214 -
download
0
Transcript of 2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign Contextual Text Mining Qiaozhu Mei...
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Contextual Text Mining
Qiaozhu [email protected]
University of Illinois at Urbana-Champaign
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Knowledge Discovery from Text
2
Text Mining System
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 3
Trend of Text Content
Content Type
Published Content
Professional web content
User generated content
Private text content
Amount / day 3-4G ~ 2G 8-10G ~ 3T
- Ramakrishnan and Tomkins 2007
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Text on the Web (Unconfirmed)
4
~750k /day
~3M day
~150k /day
1M
10B
6M
~100B
Where to Start? Where to Go?
Gold?
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Context Information in Text
5
Author
Time
Source
Author’s occupati
on
Language Social
Network
Check Lap Kok, HK
self designer, publisher, editor …
3:53 AM Jan 28th
From Ping.fm
Location
Sentiment
Sentiment
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Rich Context in Text
6
102M blogs
~3M msgs /day
~150k bookmarks /day
~300M words/month
~2M users
5M users 500M URLs
8M contributors 100+ languages
750K posts/day
100M users > 1M groups
73 years~400k authors ~4k sources
1B queries? Per hour?Per IP?
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Text + Context = ?
7
+
Context = GuidanceI Have A Guide!
=
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Query + User = Personalized Search
8
MSR
Modern System Research
Medical simulation
Montessori School of Raleigh
Mountain Safety Research
MSR Racing
Wikipedia definitions
Metropolis Street Racer
Molten salt reactor
Mars sample return
Magnetic Stripe Reader
How much can personalized help?
If you know me, you should give me Microsoft Research…
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 9
Common Themes IBM APPLE DELL
Battery Life Long, 4-3 hrs Medium, 3-2 hrs Short, 2-1 hrs
Hard disk Large, 80-100 GB Small, 5-10 GB Medium, 20-50 GB
Speed Slow, 100-200 Mhz Very Fast, 3-4 Ghz Moderate, 1-2 Ghz
IBM LaptopReviews
APPLE LaptopReviews
DELL LaptopReviews
Customer Review + Brand = Comparative Product Summary
Can we compare Products?
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 10
Hot Topics in SIGMOD
Literature + Time = Topic Trends
What’s hot in literature?
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 11
One Week Later
Blogs + Time & Location = Spatiotemporal Topic Diffusion
How does discussion spread?
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 12
Tom Hanks, who is my favorite movie star act the leading role.
protesting... will lose your faith by watching the movie.
a good book to past time.
... so sick of people making such a big deal about a fiction book
The Da Vinci Code
Blogs + Sentiment = Faceted Opinion Summary
What is good and what is bad?
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 13
Information retrieval
Machine learning Data mining
Coauthor Network
Publications + Social Network =Topical Community
Who works together on what?
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Query log + User = Personalized SearchLiterature + Time = Topic TrendsReview + Brand = Comparative OpinionBlog + Time & Location = Spatiotemporal Topic
DiffusionBlog + Sentiment = Faceted Opinion SummaryPublications + Social Network = Topical Community
Text + Context = Contextual Text Mining
14
…..
A General Solution for All
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Contextual Text Mining
• Generative Model of Text• Modeling Simple Context• Modeling Implicit Context• Modeling Complex Context • Applications of Contextual Text Mining
15
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Generative Model of Text
16
)|( ModelwordP
the.. movie.. harry ..
potter is .. based.. on.. j..k..rowling
theis
harrypottermovie
plottime
rowling
0.10.070.050.040.040.020.010.01
the
Generation
Inference, Estimation
harry
pottermovie
harry
is
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Contextualized Models
17
book
Generation: • How to select contexts?• How to model the relations ofcontexts?
Inference:• How to estimate contextual models?• How to reveal contextual patterns?
),|( ContextModelwordP
Year = 2008
Year = 1998
Location = USLocation = China
Source = official
Sentiment = +
harry
potter
is
bookharry
potterrowling
0.150.100.080.05
movieharry
potterdirector
0.180.090.080.04
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Topics in Text
• Topic (Theme) = the subject of a discourse• A topic covers multiple documents• A document has multiple topics• Topic = a soft cluster of documents• Topic = a multinomial distribution of words
18
Many text mining tasks:• Extracting topics from text• Reveal contextual topic patterns
WebSearch
search 0.2engine 0.15query 0.08user 0.07ranking 0.06……
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Probabilistic Topic Models
19
ipodnano
musicdownload
apple
0.150.080.050.020.01
movieharrypotter
actressmusic
0.100.090.050.040.02
Topic 1
Topic 2
Apple iPod
Harry Potter
Ki
iTopicwPizPwP..1
)|()()(
I downloaded
the music of
the movie
harry potter to
my ipod nano
ipod 0.15
harry 0.09
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Parameter Estimation
• Maximizing data likelihood:
• Parameter Estimation using EM algorithm
20
))|(log(maxarg* ModelDataP
ipodnano
musicdownload
apple
0.150.080.050.020.01
movieharrypotter
actressmusic
0.100.090.050.040.02
I downloaded
the music of
the movie
harry potter to
my ipod nano
?????
?????
Guess the affiliation
Estimate the params
I downloaded
the music of
the movie
harry potter to
my ipod nano
I downloaded
the music of
the movie
harry potter to
my ipod nano
I downloaded
the music of
the movie
harry potter to
my ipod nano
Pseudo-Counts
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
How Context Affects Topics
21
• Topics in science literature:16th Century v.s. 21st Century
• When do a computer scientist and a gardener use “tree, root, prune” in text?
• What does “tree” mean in “algorithm”?
• In Europe, “football” appears a lot in a soccer report. What about in the US?
Text are generated according to the Context!!
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Simple Contextual Topic Model
22
Topic 1
Topic 2
Context 1:2004
Context 2:2007
Cj Ki
jij ContextTopicwPContextizPjcPwP..1 ..1
),|()|()()(
Apple iPod
Harry Potter
I downloaded
the music of
the movie
harry potter to
my iphone
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Contextual Topic Patterns
• Compare contextualized versions of topics: Contextual topic patterns
• Contextual topic patterns conditional distributions– z: topic; c: context; w: word
• : strength of topics in context• :content variation of topics
23
) )|((or )|( jzcPiczP
),|( iczwP
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Example: Topic Life Cycles (Mei and Zhai KDD’05)
24
0
0. 002
0. 004
0. 006
0. 008
0. 01
0. 012
0. 014
0. 016
0. 018
0. 02
1999 2000 2001 2002 2003 2004Time (year)
Nor
mal
ized
Str
engt
h of
The
me
Biology Data
Web Information
Time Series
Classification
Association Rule
Clustering
Bussiness
Context = time
Comparing )|( zcP
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Example: Spatiotemporal Theme Pattern (Mei et al. WWW’06)
25
Week4: The theme is again strong along the east coast and the Gulf of Mexico
Week3: The theme distributes more uniformly over the states
Week2: The discussion moves towards the north and west
Week5: The theme fades out in most states
Week1: The theme is the strongest along the Gulf of Mexico
About Government Responsein Hurricane Katrina
Context = time & location
Comparing )|( czP
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Example: Evolutionary Topic Graph (Mei and Zhai KDD’05)
26
T
SVM 0.007criteria 0.007classifica – tion 0.006linear 0.005…
decision 0.006tree 0.006classifier 0.005class 0.005Bayes 0.005…
Classifica - tion 0.015text 0.013unlabeled 0.012document 0.008labeled 0.008learning 0.007…
Informa - tion 0.012web 0.010social 0.008retrieval 0.007distance 0.005networks 0.004…
1999
…
web 0.009classifica –tion 0.007features0.006topic 0.005…
mixture 0.005random 0.006cluster 0.006clustering 0.005variables 0.005… topic 0.010
mixture 0.008LDA 0.006 semantic 0.005…
…
2000 2001 2002 2003 2004
KDD
Context = timeComparing ),|( czwP
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 27
Example: Event Impact Analysis(Mei and Zhai KDD’06)
vector 0.0514concept 0.0298model 0.0291space 0.0236boolean 0.0151function 0.0123…
xml 0.0678email 0.0197 model 0.0191collect 0.0187judgment 0.0102rank 0.0097…
probabilist 0.0778model 0.0432logic 0.0404 boolean 0.0281algebra 0.0200estimate 0.0119weight 0.0111…
model 0.1687language 0.0753estimate 0.0520 parameter 0.0281distribution 0.0268smooth 0.0198likelihood 0.0059…
1998
Publication of the paper “A language modeling approach to information retrieval”
Starting of the TREC conferences
year1992
term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372model 0.0310probabilistic 0.0188document 0.0173…
Theme: retrieval models
SIGIR papersSIGIR papers
Context = eventComparing ),|( czwP
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Implicit Context in Text
• Some contexts are hidden– Sentiments; intents; impact; etc.
• Document contexts: don’t know for sure– Need to infer this affiliation from the data
• Train a model M for each implicit context• Provide M to the topic model as guidance
28
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Modeling Implicit Context
29
Topic 1
Topic 2
Positive
Negative
???hate
awfuldisgust
0.210.030.01
goodlike
perfect
0.100.050.02
Apple iPod
Harry Potter
I like the
song of
movie on
perfect but
hate the accent
my
ipod
the
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 30
Semi-supervised Topic Model(Mei et al. WWW’07)
Maximum A Posterior (MAP)
Estimation
Maximum Likelihood
Estimation (MLE)Add Dirichlet
priors
w
Topics
…
1
2
k
d1
d2
dk
Document
love great
hateawful
r1
r2
Similar to adding pseudo-counts to the observation
Guidance from
the user
))|(log(maxarg* DP
))()|(log(maxarg* PDP
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Example: Faceted Opinion Summarization (Mei et al. WWW’07)
Neutral Positive Negative
Topic 1:Movie
... Ron Howards selection of Tom Hanks to play Robert Langdon.
Tom Hanks stars in the movie,who can be mad at that?
But the movie might get delayed, and even killed off if he loses.
Directed by: Ron Howard Writing credits: Akiva Goldsman ...
Tom Hanks, who is my favorite movie star act the leading role.
protesting ... will lose your faith by ... watching the movie.
After watching the movie I went online and some research on ...
Anybody is interested in it?
... so sick of people making such a big deal about a FICTION book and movie.
Topic 2:Book
I remembered when i first read the book, I finished the book in two days.
Awesome book. ... so sick of people making such a big deal about a FICTION book and movie.
I’m reading “Da Vinci Code” now.
…
So still a good book to past time.
This controversy book cause lots conflict in west society.
31Context = topic & sentiment
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 32
Results: Sentiment Dynamics
Facet: the book “ the da vinci code”. ( Bursts during the movie, Pos > Neg )
Facet: the impact on religious beliefs. ( Bursts during the movie, Neg > Pos )
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 33
Results: Topic with User’s Guidance
• Topics for iPod:
No Prior With Prior
Battery, nano Marketing Ads, spam Nano Battery
battery apple free nano battery
shuffle microsoft sign color shuffle
charge market offer thin charge
nano zune freepay hold usb
dock device complete model hour
itune company virus 4gb mini
usb consumer freeipod dock life
hour sale trial inch rechargable
Guidance from the user: I know two topics should look like this
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Complex Context in Text
• Complex context structure of contexts• Many contexts has latent structure
– Time; location; social network
• Why modeling context structure?– Review novel contextual patterns;– Regularize contextual models;– Alleviate data sparseness: smoothing;
34
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Modeling Complex Context
35
Topic 1
Topic 2
A
B
Context 1
Two Intuitions:• Regularization: Model(A) and Model(B) should be similar• Smoothing: Look at B if A doesn’t have enough data
Context A and B are closely related
tionRegularizaLikelihood)( CO
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Applications of Contextual Text Mining
• Personalized Search– Personalization with backoff
• Social Network Analysis (for schools)– Finding Topical communities
• Information Retrieval (for industry labs)– Smoothing Language Models
36
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Application I: Personalized Search
37
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 38
Personalization with Backoff (Mei and Church WSDM’08)
• Ambiguous query: MSG– Madison Square Garden– Monosodium Glutamate
• Disambiguate based on user’s prior clicks• We don’t have enough data for everyone!
– Backoff to classes of users• Proof of Concept:
– Context = Segments defined by IP addresses• Other Market Segmentation (Demographics)
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Apply Contextual Text Mining to Personalized Search
• The text data: Query Logs• The generative model: P(Url| Query)• The context: Users (IP addresses)• The contextual model: P(Url| Query, IP)• The structure of context:
– Hierarchical structure of IP addresses
39
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 40
Evaluation Metric: Entropy (H)
• • Difficulty of encoding information (a distr.)
– Size of search space; difficulty of a task
• H = 20 1 million items distributed uniformly• Powerful tool for sizing challenges and
opportunities – How hard is search? – How much does personalization help?
Xx
xpxpXH )(log)()(
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 41
How Hard Is Search?
• Traditional Search– H(URL | Query)– 2.8 (= 23.9 – 21.1)
• Personalized Search– H(URL | Query, IPIP)– 1.21.2 (= 27.2 – 26.0)
Entropy (H)
Query 21.1
URL 22.1
IP 22.1
All But IP 23.9
All But URL 26.0
All But Query 27.1
All Three 27.2Personalization cuts H in Half!
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Context = First k bytes of IP
42
),|(
),|(
),|(
),|(
),|(),|(
00
11
22
33
44
QIPUrlP
QIPUrlP
QIPUrlP
QIPUrlP
QIPUrlPQIPUrlP
156.111.188.243
156.111.188.*
156.111.*.*
156.*.*.*
*.*.*.*
Full personalization: every context has a different model: sparse data!
No personalization: all contexts share the same model
Personalization with backoff:
similar contexts have similar
models
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 43
Backing Off by IP
• λs estimated with EM• A little bit of personalization
– Better than too much – Or too little
Lambda
0
0.05
0.1
0.15
0.2
0.25
0.3
λ4 λ3 λ2 λ1 λ0
λ4 : weights for first 4 bytes of IP λ3 : weights for first 3 bytes of IPλ2 : weights for first 2 bytes of IP
……
4
0
),|(),|(i
ii QIPUrlPQIPUrlP
Sparse Data Missed Opportunity
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 44
Context Market Segmentation
• Traditional Goal of Marketing:– Segment Customers (e.g., Business v. Consumer)– By Need & Value Proposition
• Need: Segments ask different questions at different times• Value: Different advertising opportunities
• Segmentation Variables– Queries, URL Clicks, IP Addresses– Geography & Demographics (Age, Gender, Income)– Time of day & Day of Week
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 45
Business Days v. Weekends:More Clicks and Easier Queries
3,000,000
4,000,000
5,000,000
6,000,000
7,000,000
8,000,000
9,000,000
1 3 5 7 9 11 13 15 17 19 21 23
Jan 2006 (1st is a Sunday)
Clic
ks
1.001.021.041.061.081.101.121.141.161.181.20
En
tro
py
(H)
Total Clicks H(Url | IP, Q)
Easier
More Clicks
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Harder Queries at TV Time
46
Harder queries
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Application II: Information Retrieval
47
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Application: Text Retrieval
Document d
A text mining paper
data mining
Doc Language Model (LM) θd : p(w|d) text 4/100=0.04
mining 3/100=0.03clustering 1/100=0.01…data = 0computing = 0…
Query q
Data ½=0.5Mining ½=0.5
Query Language Model θq : p(w|q)
Data ½=0.4Mining ½=0.4Clustering =0.1…
?p(w|q’)
text =0.039mining =0.028clustering =0.01…data = 0.001computing = 0.0005… Similarity
function
)|(
)|(log)|()||(
d
Vwdq wp
wpwpD
Smoothed Doc LM θd' : p(w|d’)
48
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Smoothing a Document Language Model
49
Retrieval performance estimate LM smoothing LM
text 4/100 = 0.04mining 3/100 = 0.03Assoc. 1/100 = 0.01clustering 1/100=0.01…data = 0computing = 0…
text = 0.039mining = 0.028Assoc. = 0.009clustering =0.01…data = 0.001computing = 0.0005…
Assign non-zero prob. to unseen words
Estimate a more accurate distribution from sparse data
text = 0.038mining = 0.026Assoc. = 0.008clustering =0.01…data = 0.002computing = 0.001…
)|( dMLE wP
)|( collectionwP )|()|()1()|( collectiondMLE wPwPdwP
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Apply Contextual Text Mining to Smoothing Language Models
• The text data: collection of documents• The generative model: P(word)• The context: Document• The contextual model: P(w|d)• The structure of context:
– Graph structure of documents
• Goal: use the graph of documents to estimate a good P(w|d)
50
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Traditional Document Smoothing in Information Retrieval
d
Collection
d
Clusters
d
Nearest Neighbors
Collection
Cluster
neighbors
d
d~
Interpolate MLE
with Reference LM
Estimate a Reference language model
θref using the collection (corpus)
ref
)|()|()|( refdMLE wPwPdwP
[Ponte & Croft 98]
[Liu & Croft 04]
[Kurland& Lee 04]
51
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Graph-based Smoothing for Language Models in Retrieval (Mei et
al. SIGIR 2008)• A novel and general view of
smoothing
52
d
P(w|d): MLEP(w|d): Smoothed
P(w|d) = Surface on top of the Graph
projection on a plain
Smoothed LM = Smoothed Surface!
Collection = Graph (of Documents)
Collection
P(w|d1)P(w|d2)
d1d2
Can also be a word graph
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
The General Objective of Smoothing
53
2
),(
2 ))(,()~
)(()1()(
Evu
vuVu
uu ffvuwffuwCO
ufuf~ 2)
~)(( uu ffuw
Fidelity to MLE 2
),(
))(,(
Evu
vu ffvuw
Smoothness of the surface
)(uw
Importance of vertices
),( vuw
- Weights of edges (1/dist.)
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Smoothing Language Models using a Document Graph
54
Construct a kNN graph of documents;
d w(u): Deg(u) w(u,v): cosine
AdditionalDirichlet
Smoothing
fu= p(w|du);
uf
;)|()(
),()|()1()|(
Vv
vuMLEu dwPuDeg
vuwdwPdwP
Document language model:
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Effectiveness of the Framework
55
Data Sets Dirichlet DMDG DMWG † DSDG QMWG
AP88-90 0.217 0.254 ***(+17.1%)
0.252 ***(+16.1%)
0.239 ***(+10.1%)
0.239(+10.1%)
LA 0.247 0.258 **(+4.5%)
0.257 **(+4.5%)
0.251 **(+1.6%)
0.247
SJMN 0.204 0.231 ***(+13.2%)
0.229 ***(+12.3%)
0.225 ***(+10.3%)
0.219(+7.4%)
TREC8 0.257 0.271 *** (+5.4%)
0.271 **(+5.4%)
0.261 (+1.6%)
0.260(+1.2%)
† DMWG: reranking top 3000 results. Usually this yieldsto a reduced performance than ranking all the documents
Wilcoxon test: *, **, *** means significance level 0.1, 0.05, 0.01
Graph-based smoothing >> BaselineSmoothing Doc LM >> relevance score >> Query LM
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Intuitive Interpretation – Smoothing using Document Graph
d
d1 0
0))|(1)(1(1)|()1()|( uMLuMLu dwPdwPdwP
Vv
vdwPuDeg
vuw)|(
)(
),( Absorption Probability to the “1” state
Writing a word w in a document = random walk on the doc Markov chain; write down w if reaching “1”
)1( uP )0( uP
)( vuP
Act as neighbors do
56
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Application III: Social Network Analysis
57
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Topical Community Analysis
58
physicist, physics, scientist, theory, gravitation …
writer, novel, best-sell, book, language, film…
Topic modeling to help community extraction
Information Retrieval +Data Mining +Machine Learning, …
=Domain Review +Algorithm +Evaluation, …
orComputer Science Literature
Network analysis to help topic extraction
?
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Apply Contextual Text Mining to Topical Community Analysis
• The text data: Publications of researchers• The generative model: topic model• The context: author• The contextual model: author-topic model• The structure of context:
– Social Network: coauthor network of researchers
59
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Intuitions
• People working on the same topic belong to the same “topical community”
• Good community: coherent topic + well connected• A topic is semantically coherent if people working on
this topic also collaborate a lot
60
IR
IR
IR?
More likely to be an IR person or a compiler person?
Intuition: my topics are similar to my neighbors
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Social Network Context for Topic Modeling
61
• Context = author• Coauthor = similar contexts• Intuition: I work on similar topics to
my neighbors
Smoothed Topic distributions P(θj|author)
e.g. coauthor network
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Evu
k
jjj
jd w
k
jj
vpupvuw
wpdpdwcGCO
, 1
2
1
)))|()|((),(2
1(
))|()|(log),(()1(),(
Topic Modeling with Network Regularization (NetPLSA)
62
• Basic Assumption (e.g., co-author graph)• Related authors work on similar topics
PLSA
Graph Harmonic Regularizer,
Generalization of [Zhu ’03],
Evu
k
jjj
jd w
k
jj
vpupvuw
wpdpdwcGCO
, 1
2
1
)))|()|((),(2
1(
))|()|(log),(()1(),(
importance (weight) of an edge
difference of topic distribution on neighbor vertices
tradeoff betweentopic and smoothness
topic distribution of a document
)|( , 2
1,
...1
upfwhereff jujkj
jTj
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Topics & Communities without Regularization
Topic 1 Topic 2 Topic 3 Topic 4
term 0.02 peer 0.02 visual 0.02 interface 0.02
question 0.02 patterns 0.01 analog 0.02 towards 0.02
protein 0.01 mining 0.01 neurons 0.02 browsing 0.02
training 0.01 clusters 0.01 vlsi 0.01 xml 0.01
weighting 0.01
stream 0.01 motion 0.01 generation 0.01
multiple 0.01 frequent 0.01 chip 0.01 design 0.01
recognition 0.01 e 0.01 natural 0.01 engine 0.01
relations 0.01 page 0.01 cortex 0.01 service 0.01
library 0.01 gene 0.01 spike 0.01 social 0.01
63
?? ? ?
Noisy community assignment
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Topics & Communities with Regularization
64
Topic 1 Topic 2 Topic 3 Topic 4
retrieval 0.13 mining 0.11 neural 0.06 web 0.05
information 0.05 data 0.06 learning 0.02 services 0.03
document 0.03 discovery 0.03 networks 0.02 semantic 0.03
query 0.03 databases 0.02 recognition 0.02 services 0.03
text 0.03 rules 0.02 analog 0.01 peer 0.02
search 0.03 association 0.02 vlsi 0.01 ontologies 0.02
evaluation 0.02 patterns 0.02 neurons 0.01 rdf 0.02
user 0.02 frequent 0.01 gaussian 0.01 management 0.01
relevance 0.02 streams 0.01 network 0.01 ontology 0.01
Information Retrieval
Data mining Machine learning
Web
Coherent community assignment
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Topic Modeling and SNA Improve Each Other
Methods Cut Edge Weights
Ratio Cut/ Norm. Cut
Community Size
Community 1
Community 2
Community 3
Community 4
PLSA 4831 2.14/1.25 2280 2178 2326 2257
NetPLSA 662 0.29/0.13 2636 1989 3069 1347
NCut 855 0.23/0.12 2699 6323 8 11
65
-Ncut: spectral clustering with normalized cut. J. Shi et al. 2000- pure network based community finding
Network Regularization helps extract coherent communities(network assures the focus of topics)
Topic Modeling helps balancing communities(text implicitly bridges authors)
The smaller the betterThe smaller the better
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Smoothed Topic Map
66
Map a topic on the network (e.g., using p(θ|a))
PLSA(Topic : “information retrieval”)
NetPLSA
Core contributors
Irrelevant
Intermediate
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Summary of My Talk
67
• Text + Context = Contextual Text Mining– A new paradigm of text mining
• A novel framework for contextual text mining– Probabilistic Topic Models– Contextualize by simple context, implicit context,
complex context;
• Applications of contextual text mining
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
A Roadmap of My Work
68
Information Retrieval& Web Search
Contextual Text Mining
KDD 05
KDD 06b
WWW 06
WWW 07
WWW 08
Contextual TopicModels
KDD 06a
SIGIR 07
SIGIR 08
KDD 07
WSDM 08
CIKM 08
ACL 08
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
Research Discipline
69
Text InformationManagement
Text Mining InformationRetrieval
Data Mining
Natural LanguageProcessing
Database
Bioinformatics
Machine Learning
Applied Statistics
Social Networks
Information Science
2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign
End Note
70
+ =