Finding bursty topics from microblogs
-
Upload
moresmile -
Category
Technology
-
view
146 -
download
0
description
Transcript of Finding bursty topics from microblogs
FINDING BURSTY TOPICS FROM MICROBLOGS
Qiming Diao, Jing Jiang, Feida Zhu, Ee-Peng Lim
Living Analytics Research CentreSchool of Information SystemsSingapore Management University
Abstract
To find topics that have bursty patterns on microblogs
two observations: 1. posts published around the same time
are more likely to have the same topic2. posts published by the same user are
more likely to have the same topic
Introduction
Retrospective bursty event detection : Bursty detection: state machine Topic discovery: LDA
Two assumptions:1. If a post is about a global event, it is likely
to follow a global topic distribution that is time-dependent.
2. If a post is about a personal topic, it is likelyto follow a personal topic distribution that is more or less stable overtime.
Method
Preliminaries d i , u i , t i , w i,j a bursty topic b as a word distribution
coupled with a bursty interval, denoted as ( ϕb,tb
s ,tbe )
Our task: to find meaningful bursty topics from the input text stream.
Our method: a topic discovery step and a burst detection step.
Our Topic Model
Assume:1. C (latent) topics in the text stream,
where each topic c has a word distribution ϕc.
2. A background word distribution ϕB 3. A single post is most likely to be about
a single topic.4. A global topic distribution θt for each
time point t .
Our focus is to find popular global events, we need to separate out these “personal” posts.
A time-independent topic distribution ηu for each user to capture her long term topical interests.
Learning
Gibbs sampling :
M(0) ,M(1) , M(.)
M(c) , M(.)
M(c) , M(.)
E(v) , E(.)
M(v) , M(.)
Learning
M(wi,j) , M(wi,j) , M(.)
Burst Detection
Assume: A series of counts( mc1 , mc2 ,..., mcT)
representing the intensity of the topic at different time points.
These counts are generated by two Poisson distributions corresponding to a bursty state and a normal state.
Burst Detection
σ 0 = 0 . 9 and σ 1 =0 . 6 for all topics.
Finally, a burst is marked by a consecutive subsequence of bursty states.
Experiments
Data Set sampled 2892 users from this dataset and
extracted their tweets between September 1 and November 30, 2011(91 days in total).
the final dataset with 3,967,927 tweets and24,280,638 tokens.
Ground Truth Generation top-30 bursty topics from each model two human judges to judge their quality by
assigning a score of either 0 or 1 Evaluation
We set the number of topics C to 80, α to 50/C and β to 0.01. Each model was run for 500 iterations of Gibbs sampling.
Sample Results and Discussions
Sample Results and Discussions
two case studies to demonstratethe effectiveness of our model
Effectiveness of Temporal Models: BothTimeLDA and TimeUserLDA tend to group posts published on the same day into the same topic.
two case studies to demonstratethe effectiveness of our model
Effectiveness of User Models: it is important to filter out users’ “personal” posts in order to find meaningful global events.
Conclusions
A new topic model that considers both thetemporal information of microblog posts and users’ personal interests.
A Poisson-based state machine to identify bursty periods from the topics discovered by our model.
TM-LDA: EFFICIENT ONLINE MODELING OF THE LATENT TOPIC TRANSITIONS IN SOCIAL MEDIA
ABSTRACT
TM-LDA learns the transition parameters among topics by minimizing the prediction error on topic distribution in subsequent postings.
We develop an efficient updating algorithm to adjust transition parameters, as new documents stream in.
Challenges:1. to model and analyze latent topics in
social textual data;2. to adaptively update the models as the
massive social content streams in;3. to facilitate temporal-aware applications
of social media
contribution
First, we propose a novel temporally-aware topic language model, TM-LDA, which captures the latent topic transitions in temporally-sequenced documents.
Second, we design an efficient algorithm to update TM-LDA which enables it to be performed on large scale data.
Finally, we evaluate TM-LDA against the static topic modeling method(LDA)
METHODOLOGY
TM-LDA Algorithm if we define the space of topic distribution
as X = { x ∈ Rn+ : || x || 1 = 1 } , TM-LDA can be considered as a function f : X → X .
the prediction error
TM-LDA is modeled as a non-linear mapping:
Error Function of TM-LDA:
Iterative Minimization of the Error Function
Direct Minimization of the Error Function
TM-LDA for Twitter Stream
TM-LDA for Twitter Stream
let A = D (1 ;m ) and B = D (2 ;m +1)
UPDATING TRANSITION PARAMETERS Updating Transition Parameters with
Sherman-Morrison-Woodbury Formula
Updating Transition Parameters with QR-factorization
Suppose the QR-factorization of matrix A is A = QR , where Q′Q = I and R is an upper triangularmatrix. RT=Q’B
EXPERIMENTS
Dataset
Using Perplexity as Evaluation Metric
Predicting Future Tweets
TM-LDA first trains LDA on 7-day historical tweets and compute the transition parameter matrix accordingly. Then for each new tweet generated on the 8th day, it predicts the topic distribution of the following tweet.
Estimated Topic Distributions of\Future" Tweets : the topic distribution of the tweet b.
LDA Topic Distributions of \Future" Tweets :the inferred topic distribution of the tweet b .
LDA Topic Distributions of\Previous" Tweets :the inferred topic distribution of the tweet a .
Efficiency of Updating Transition Parameters
Properties of Transition Parameters
T is a square matrix where the size of T is determined by the number of topics trained in LDA.
The row sum of T is always 1, which means that the overall weights emitted from atopicis 1.
APPLYING TM-LDA FORTREND ANAL-YSIS AND SENSEMAKING
Changing Topic Transitions over Time
Various Topic Transition Patterns by Cities
CONCLUSIONS
a novel temporally-aware language model, TM-LDA, for efficiently modeling streams ofsocial text such as a Twitter stream for an author
an efficient model updating algorithm for TM-LDA