A Unified Model for Stable and Temporal Topic Detection from Social
Media Data
Hongzhi Yin†, Bin Cui†, Hua Lu‡, Yuxin Huang† and Junjie Yao†
†Peking University‡Aalobrg University
2 / 38
Outline
Motivation Problem Formulation A Basic Solution
A User-Temporal Mixture Model Enhancement of the basic solution
Regularization TechniqueBurst-Weighted Boosting
Experiments Q/A
3 / 38
Outline
Motivation Problem Formulation A Basic Solution
A User-Temporal Mixture Model Enhancement of the basic solution
Regularization TechniqueBurst-Weighted Boosting
Experiments Q/A
5 / 38
Motivation (Cont.)
Two different types of topics are mixed up in the social media platforms such as Twitter, Weibo and Delicious;
Temporal Topics are temporally coherent meaningful themes. They are time-sensitive and often on popular real-life events or hot spots, i.e., breaking events in the real world.
Stable Topics are often on users’ regular interests and their daily routine discussions, e.g., their moods and statuses.
6 / 38
One Example in Twitter
Temporal Topic : Dead pigs in Shanghai Temporal Topic : Dead pigs in Shanghai Stable Topic : Big Data Stable Topic : Big Data
7 / 38
Another Example in Twitter
Temporal Topic: Independence DayTemporal Topic: Independence Day
Stable Topic: Animal AdoptionStable Topic: Animal Adoption
8 / 38
We can tell the difference between temporal and Stable topics from their temporal distributions and their description words.
10 / 38
Outline
Motivation Problem Formulation A Basic Solution
A User-Temporal Mixture Model Enhancement of the basic solution
Regularization TechniqueBurst-Weighted Smoothing
Experiments Q/A
11 / 38
Problem Formulation
A user-time-associated document d is a text document associated with a time stamp and a user.
A temporal topic is a temporally coherent theme. In other words, the words that are emerging in the close time dimension are clustered in a topic. An example of temporal topics: Given a collection of user-
time-associated tweets, the desired temporal topics are the events happening in different times.
Formally, a temporal/stable topic is represented by a word distribution where
12 / 38
Problem Formulation (Cont.)
A topic distribution in time dimension is the distribution of topics given a specific time interval.Formally, is the probability of temporal
topic given time interval t. A topic distribution in user space is the
distribution of topics given a specific user. Formally, is the probability of stable topic
given user u.
13 / 38
Problem Formulation (Cont.)
A User-Time-Keyword Matrix M is a hyper-matrix whose three dimensions refer to user, time and keyword. A cell in M[u, t, w] stores the frequency of word w generated by user u within time interval t.
Given a collection of user-time-associated documents C, we first formulate matrix M Detecting Temporal TopicsExtracting Stable Topics
Task 1
Task 2
14 / 38
Problem Formulation (Cont.)
Detecting a set of temporal topics that are event-driven.Detecting bursty events, such as disaster (e.g.,
earthquakes), politics (e.g., election), and public events (e.g., Olympics)
Analyzing topic trends Extracting a set of stable topics that are interest-
driven.Finding user intrinsic interests and better
modeling user preference
15 / 38
Outline
Motivation Problem Formulation A Basic Solution
A User-Temporal Mixture Model Enhancement of the basic solution
Regularization TechniqueBurst-Weighted Boosting
Experiments Q/A
16 / 38
A User-Time Mixture Model Main Insights
To find both temporal and stable topics in a unified manner, we propose a topic model that simultaneously captures two observations:● Words generated around the same time are more likely
to have the same event-driven temporal topic
● Words generated by the same user are more likely to have the same interest-driven stable topic.
The former helps find event-driven temporal topics while the latter helps identify interest-driven stable topics.
17 / 38
Combine user and time information We assume that when a user u generates a word
w at time t, he/she is probably influenced by two factors: the breaking news/events occurring in time t and his/her intrinsic interests.
Breaking events are modeled by temporal topics and user intrinsic interests are modeled by stable topics.
18 / 38
The likelihood that user u generates word w at time t is as follows:
Parameters and are mixing weights controlling the motivation factor choice, also denoting the proportions of temporal topics and stable topics in the dataset. It is worth mentioning that they are learnt automatically, instead of being fixed.
19 / 38
Parameter Estimation
The log-likelihood of the whole user-time-associated document collection C is
E-M algorithm to estimate
E-Step );( nQ M-Step… …Compute expectation Maximize, closed form solution
Please refer to the details of E-M algorithm in Section 4.2
21 / 38
Outline
Motivation Problem Formulation A Basic Solution
A User-Temporal Mixture Model Enhancement of the basic solution
Regularization TechniqueBurst-Weighted Boosting
Experiments Q/A
22 / 38
Spatial Regularization
Intuitions If two users are connected in the social network space, they
are more likely to enjoy same/similar interests/topics. A topic is interest-coherent if people who are interested in
this topic also close in the network space.
22
DB
DB
DB?
More likely to be an DB person or an IR person?
Intuition: users’ interests are similar to their neighbors
23 / 38
Spatial Regularization
Topic Model With Spatial Regularization A regularized data likelihood is defined as follows:
RegularizerRegularizer
The Spatial Regularizer plays the role of spatial smoothing for user interests.
24 / 38
Parameter Estimation
24
Maximize, using Newton-Raphson
E-Step );( nQ M-Step… …Compute expectation
Regularized complete log-likelihood
Smooth using a spatial regularizer; in each iteration, a user interest is smoothed by his/her spatial neighbors.
25 / 38
Outline
Motivation Problem Formulation A Basic Solution
A User-Temporal Mixture Model Enhancement of the basic solution
Regularization TechniqueBurst-Weighted Boosting
Experiments Q/A
26 / 38
Insights
In topic models, the words with high occurrence rate, i.e., popular words, enjoy high probabilities to appear at top positions in each discovered topic.
These popular words are mostly general words, denoting abstract concepts. In stable topics, they can illustrate the domain of topics at the first glimpse.
However, in temporal topics, words with notable bursty feature are superior in expressing temporal information since users are more interested in bursty words than in abstract concepts when browsing temporal topic
27 / 38
Example: Michael Jackson’s Death
In this temporal topic, weexpect that bursty words“mj”, “michael jackson” “moonwalk” become the dominant words rather than the general words “world”, “news” and “death”.
But they cannot be removed as stop words, since they can help illustrate the stable topics.
28 / 38
Burst-Weighted Boosting We implement a bursty boosting step to escalate the
probability of these bursty words during the procedure of detecting temporal topics.We first compute the bursty-degree of each word in each
time interval. (Yao et al. ICDE’2010) A boosting step is then taken after each few E-M
iterations, as follows.
In this step, a word w will have its generation probability boosted in a temporal topic only if w’s bursty period overlaps with that of the topic.
29 / 38
Outline
Motivation Problem Formulation A Basic Solution
A User-Temporal Mixture Model Enhancement of the basic solution
Regularization TechniqueBurst-Weighted Boosting
Experiments Q/A
30 / 38
Data Sets
Twitter Data set (Mar. 2009 to Oct.2009) Delicious Data set (Feb.2008 to Dec. 2009) Sina Weibo (2011)
31 / 38
Data Sets Twitter: People in this platform often discuss many social
events and their daily life. It contains 9,884,640 tweets posted by 456,024 users in the period of Mar. 2009 to Oct.2009. Each user in this data set at least published 200 posts. We first removed all the stop words.
Delicious: Delicious is a collaborative tagging system on which users can upload and tag web pages. We collected 200,000 users and their tagging behaviors from the period of Feb.2008 to Dec. 2009. The dataset contains 7,103,622 tags. Topics on technology and electronic cover more than half of tags. Breaking news also co-exists.
32 / 38
Compared Methods
Our modelsBUT is the basic modelEUTS is the model enhanced with spatial
regularizationEUTB is the model enhanced with both spatial
regularization and burst-weighted boosting. PLSA Model on Time Slices (Mei et al. KDD’05) Individual Detection Method (Wang et al. KDD’07) Topic Over Time Model (TOT) (Wang et al. KDD’06) TimeUserLDA (Diao et al. ACL’12)
34 / 38
Time Stamp Prediction Comparison
Compared Methods0
0.10.20.30.40.50.60.70.8
EUTB
EUTS
BUT
TOT
TimeUserLDA
Individual Detection
EUTB
EUTS
BUT
TOT
TimeUserLDA
Individual Detection
35 / 38
Topic Quality Comparison
Excellent: a nicely presented temporal topic;Good: a topic containing bursty features;Poor: a topic without obvious bursty features
36 / 38
Stable Topics Detected in DeliciousT 10 T 16 T 27 T 55 T 8 T 33
windows
0.049
resources
0.034
news
0.107
u.s.
0.096
programmin
g 0.028
food
0.034
tools
0.048
education
0.031
latest
0.102
news
0.081
python
0.019
recipe
0.033
Freeware
0.038
interactive
0.020
Current
0.099
politics
0.076
Ruby
0.016
Cooking
0.030
firefox
0.038
Teaching
0.020
World
0.094
Democrats
0.068
javascript
0.015
Dessert
0.026
0.029
science
0.019
events
0.084
international
0.064
software
0.014
Shopping
0.021
security
0.028
tools
0.015
newspaper
0.084
obama
0.061
tutorial
0.011
Home
0.016
37 / 38
Temporal Topics Detected in Delicious
T77 T78 T 87 T89
1.12-1.31 6.15-6.27 4.24-5.6 5.27-6.6
obama 0.144 moon 0.090 flu 0.158 google 0.061
inauguration 0.106 Space 0.068 swineflu 0.078 googlewave 0.059
bush 0.059 apollo11 0.032 pandemic 0.062 wave 0.042
president 0.021 apollo 0.023 swine 0.050 bing 0.040
gaza 0.017 nasa 0.018 health 0.020 apps 0.040
whitehouse 0.012 competition 0.015 disease 0.010 realtime 0.038
38 / 38
Stable Topics Detected in TwitterT 5 T 6 T 11 T 53 T 39 T 22
free
0.020
free
0.007
day
0.104
assassin
0.039
god
0.015
teeth
0.035
market
0.011
iphone
0.006
travel
0.009
attempt
0.034
day
0.013
white
0.027
money
0.010
video
0.006
hotel
0.008
wound
0.024
follow
0.010
mom
0.027
People
0.007
photo
0.006
Check
0.006
level
0.020
free
0.009
yellow
0.023
check
0.007
camera
0.004
site
0.004
reach
0.016
look
0.008
trick
0.022
help
0.006
Apple
0.004
Golf
0.004
Account
0.01
check
0.006
free
0.021
39 / 38
Temporal Topics Detected in Twitter
T63 T86 T 66 T70
7.6-7.15 7.1-7.6 10.7-10.15 6.24-6.30
july 0.012 july 0.035 free 0.012 michael 0.038
free 0.010 happy 0.020 nobel 0.012 jackson 0.036
summer 0.008 day 0.016 prize 0.011 rip 0.007
live 0.007 firework 0.009 peace 0.008 farrah 0.007
potter 0.006 independ 0.006 win 0.008 dead 0.005
harry 0.006 celebrate 0.005 obama 0.008 sad 0.005
45 / 38
Outline
Motivation Problem Formulation A Basic Solution
A User-Temporal Mixture Model Enhancement of the basic solution
Regularization TechniqueBurst-Weighted Boosting
Experiments Q/A
Top Related