Post on 09-Jan-2016
description
PET: A Statistical Model for Popular Events Tracking in Social Communi-
ties
Cindy Xide Lin1, Bo Zhao1, Qiaozhu Mei2, Jiawei Han1
1University of Illinois at Urbana-Champaign, 2University of Michigan
KDD 2010
2010. 09. 16.
Summarized and Presented by Sang-il Song, IDS Lab., Seoul National Uni-versity
Copyright 2010 by CEBT
Contents
Introduction
Concept Definition
Problem Definition
Model
Interest model
Topic model
Experiment
Data Collection
Baseline and Gold standard
Analysis on Popularity Trend
Analysis on Content Evolution
Conclusions & Discussions
2
Copyright 2010 by CEBT
Introduction
Boom of online communities
e.g., Facebook, Blogger, Twitter, …
Facilitates the information creation, sharing and diffusion.
Popular topic or event can spread much faster.
Needs to track the diffusion and evolution of a popular event
Hot topics emerge, prevail and die
It is desirable to monitor whether people like, what they like, and how their interests change over time
e.g., Who are still interested in watching Avatar 50 days af-ter its release date?
3
Copyright 2010 by CEBT
Introduction
Tracking the evolution of a popular topic is challenging
Diffusion of an event is vague
e.g., You don’t know whether I am interest in an event
e.g., and even if you do, from whom did I get this interest.
Fortunately, a large volume of text data is generated from the social communities.
Besides Communicating with friends, a web user also con-stantly generates text contents such as blog.
A network structure and a text collection which evolve si-multaneously and interrelatedly.
4
Copyright 2010 by CEBT
Goal
Tracking Popular Event in a time-variant social commu-nity
A stream of text information
A stream of network structures
Modeling the interest of user
Modeling the change of topic
5
Copyright 2010 by CEBT
Concept Definition: Network Stream
6
26
3
4 5
1
Gk: The snapshot of network at time tk
v1
v2
v3
v4
v5
v6
G = { G1, G2, …, Gn }
Copyright 2010 by CEBT
Concept Definition: Document Stream
Document Collection Stream D = {D1, D2, …, DT}
Documents collections Dk = {dk,1, dk,2, …., dk,N}
7
26
3
4 5
1
v1
v2
v3
v4
v5
v6
w1, w2w3, w1,…
dk,1
w2, w2w3, w1,…
dk,2
w4, w1w1, w1,…
dk,3
w2, w6w2, w5,…
dk,4
w7, w7w7, w7,…
dk,5
w8, w6w2, w5,…
dk,5
Copyright 2010 by CEBT
Concept Definition: Topic and Event
Topic
topic θ is a multinomial distribution of words {p(w|θ)}w∈W
Topic has different version over time, denoting the version at time tk as θk
Event
A stream of topics Theta E = {θ0E, θ1
E, θ2E, … θT
E}
θ0E is the primitive topic of the event
θkE corresponds to the version of θ0
E at time tk
– Indicates the major aspects of the event in network Gk
8
Copyright 2010 by CEBT
Concept Definition: Interest
Interest
hk(i): node vi in Gk has a certain level of interest in the par-
ticular event at time tk
Real value between 0 and 1
Hk = {hk(1), hk(2), …, hk(N)}
9
Copyright 2010 by CEBT
Problem: Popular Event Tracking
Inputs
Network Stream G
Document Stream D
Primitive topic of an event θ0
Task1: Popularity Tracking
Inferring the latent stream of interests. (Hk)
– providing much richer information about how the interest e
Task2: Topic Tracking
Inferring the latent stream of topics about the event ΘE
– Keeping track of the new development about the event,
– Understanding event evolution
10
Copyright 2010 by CEBT
Intuitions
Observation 1. Interest and Connections
The behavior of each individual is usually influenced by its friend.
Observation 2. Interest and History
The behavior of each individual should be generally consis-tent over time.
Events should not change dramatically.
Observation 3. Content and Interest
When an individual has a higher level of interest in an event, the content she generates should be more likely to be related to the event
11
Copyright 2010 by CEBT
The General Model
Current interest and topic depends on
Current network
Current Documents
Previous history (Markovian simplification)
Formal representation
P(Hk, Θk | Gk, Dk, Hk-1)
12
Copyright 2010 by CEBT
Assumption
13
How to model P(Hk, Θk | Gk, Dk, Hk-1) ?
Assumption 1.
Given current network structure Gk and previous Hk-1,
Current interest status Hk is independent of the document collec-
tion Dk
Hk ㅛ Dk | Gk, Hk-1
People first become interested in the event and therefore generate discussion it
Assumption 2.
Given the current interest status Hk and the document collection Dk,
The current topic model k is independent of Gk and Hk-1
θk ㅛ Gk, Hk-1| Hk, Dk
Once the author has developed an interest in the event, the con-tents she writes will only depend on the event itself and the level of interest
P( Hk, Θk | Gk, Dk, Hk-1 ) = P(Hk | Gk, Hk-1) P(Θk|Hk, Dk)
Copyright 2010 by CEBT
Interest Model
Gibbs Random field
Great use in studying natural processes
(Gibbs distribution)
cf. (Gaussian distribution is a special member of Gibbs dis-tribution family)
P (Hk | Gk, Hk-1)
h’(k) is weighted sum of friends’ interest
The first part is transition energy of node i
The last part represents neighbors expectation
14
0.20.3
10.2
0.80.1
h’=1*0.2+0.3*0.8+0.2*0.1 = 0.46
Copyright 2010 by CEBT
Topic Model
Considering each document is generated two multino-mial component model
Background model: θkB
– Modeling Common words
Latent event topic model: θkE
– Modeling discriminative and meaningful words
The probability of generating word
P(Θk|Hk, Dk)
15
Copyright 2010 by CEBT
Twitter Data collection
Selecting 5000 users with follower-followee relationship
Considering each day as a time point (tk: the kth day)
Document dk,i is obtained by concatenating tweets dis-
played by user i in k
weight of relationship between user equals the number of tweets displayed by user I by following user j during the pe-riod from tk-30 to tk.
16
Copyright 2010 by CEBT
Baseline and Gold standard
BOM: extracting the daily box office at Mojo
The box office earning is a trustworthy criterion to reflect the movie’s popularity
GInt: Google Insight
PET
PET- : special version of PET by removing network struc-ture
JonK / Cont17
Copyright 2010 by CEBT
Analysis on Popularity Trend
18
Copyright 2010 by CEBT
Analysis on Popularity Trend
19
Copyright 2010 by CEBT
Analysis on Popularity Trend
PET always has the best performance
Historic, textual and structured information is reflected well
PET- can not response sufficiently to sudden changes
20
Copyright 2010 by CEBT
Analysis on Content Evolution
21
Copyright 2010 by CEBT
Conclusion & Discussion
Propose the novel problem of Popular Event Tracking
Propose popular event tracking model, PET
Unified probabilistic framework to model different factors
Covers classical models
Experimental studies show that PET outperforms existing ones
PET is not good framework for tracking interest
There exist the more accurate data such as Google Insight.
Tracking topic changing is a novel problem.
PET detects and tracks topic evolution well.
22