Finding bursty topics from microblogs

FINDING BURSTY TOPICS FROM MICROBLOGS

Qiming Diao, Jing Jiang, Feida Zhu, Ee-Peng Lim

Living Analytics Research CentreSchool of Information SystemsSingapore Management University

Abstract

To find topics that have bursty patterns on microblogs

two observations: 1. posts published around the same time

are more likely to have the same topic2. posts published by the same user are

more likely to have the same topic

Introduction

Retrospective bursty event detection ： Bursty detection: state machine Topic discovery: LDA

Two assumptions:1. If a post is about a global event, it is likely

to follow a global topic distribution that is time-dependent.

2. If a post is about a personal topic, it is likelyto follow a personal topic distribution that is more or less stable overtime.

Method

Preliminaries d i , u i , t i , w i,j a bursty topic b as a word distribution

coupled with a bursty interval, denoted as ( ϕb,tb

s ,tbe )

Our task: to find meaningful bursty topics from the input text stream.

Our method: a topic discovery step and a burst detection step.

Our Topic Model

Assume:1. C (latent) topics in the text stream,

where each topic c has a word distribution ϕc.

2. A background word distribution ϕB 3. A single post is most likely to be about

a single topic.4. A global topic distribution θt for each

time point t .

Our focus is to find popular global events, we need to separate out these “personal” posts.

A time-independent topic distribution ηu for each user to capture her long term topical interests.

Learning

Gibbs sampling :

M(0) ,M(1) , M(.)

M(c) , M(.)

M(c) , M(.)

E(v) , E(.)

M(v) , M(.)

Learning

M(wi,j) , M(wi,j) , M(.)

Burst Detection

Assume: A series of counts( mc1 , mc2 ,..., mcT)

representing the intensity of the topic at different time points.

These counts are generated by two Poisson distributions corresponding to a bursty state and a normal state.

Burst Detection

σ 0 = 0 . 9 and σ 1 =0 . 6 for all topics.

Finally, a burst is marked by a consecutive subsequence of bursty states.

Experiments

Data Set sampled 2892 users from this dataset and

extracted their tweets between September 1 and November 30, 2011(91 days in total).

the final dataset with 3,967,927 tweets and24,280,638 tokens.

Ground Truth Generation top-30 bursty topics from each model two human judges to judge their quality by

assigning a score of either 0 or 1 Evaluation

We set the number of topics C to 80, α to 50/C and β to 0.01. Each model was run for 500 iterations of Gibbs sampling.

Sample Results and Discussions

two case studies to demonstratethe effectiveness of our model

Effectiveness of Temporal Models: BothTimeLDA and TimeUserLDA tend to group posts published on the same day into the same topic.

two case studies to demonstratethe effectiveness of our model

Effectiveness of User Models: it is important to filter out users’ “personal” posts in order to find meaningful global events.

Conclusions

A new topic model that considers both thetemporal information of microblog posts and users’ personal interests.

A Poisson-based state machine to identify bursty periods from the topics discovered by our model.

TM-LDA: EFFICIENT ONLINE MODELING OF THE LATENT TOPIC TRANSITIONS IN SOCIAL MEDIA

ABSTRACT

TM-LDA learns the transition parameters among topics by minimizing the prediction error on topic distribution in subsequent postings.

We develop an efficient updating algorithm to adjust transition parameters, as new documents stream in.

Challenges：1. to model and analyze latent topics in

social textual data;2. to adaptively update the models as the

massive social content streams in;3. to facilitate temporal-aware applications

of social media

contribution

First, we propose a novel temporally-aware topic language model, TM-LDA, which captures the latent topic transitions in temporally-sequenced documents.

Second, we design an efficient algorithm to update TM-LDA which enables it to be performed on large scale data.

Finally, we evaluate TM-LDA against the static topic modeling method(LDA)

METHODOLOGY

TM-LDA Algorithm if we define the space of topic distribution

as X = { x ∈ Rn+ : || x || 1 = 1 } , TM-LDA can be considered as a function f : X → X .

the prediction error

TM-LDA is modeled as a non-linear mapping:

Error Function of TM-LDA：

Iterative Minimization of the Error Function

Direct Minimization of the Error Function

TM-LDA for Twitter Stream

TM-LDA for Twitter Stream

let A = D (1 ;m ) and B = D (2 ;m +1)

UPDATING TRANSITION PARAMETERS Updating Transition Parameters with

Sherman-Morrison-Woodbury Formula

Updating Transition Parameters with QR-factorization

Suppose the QR-factorization of matrix A is A = QR , where Q′Q = I and R is an upper triangularmatrix. RT=Q’B

EXPERIMENTS

Dataset

Using Perplexity as Evaluation Metric

Predicting Future Tweets

TM-LDA first trains LDA on 7-day historical tweets and compute the transition parameter matrix accordingly. Then for each new tweet generated on the 8th day, it predicts the topic distribution of the following tweet.

Estimated Topic Distributions of\Future" Tweets : the topic distribution of the tweet b.

LDA Topic Distributions of \Future" Tweets :the inferred topic distribution of the tweet b .

LDA Topic Distributions of\Previous" Tweets :the inferred topic distribution of the tweet a .

Efficiency of Updating Transition Parameters

Properties of Transition Parameters

T is a square matrix where the size of T is determined by the number of topics trained in LDA.

The row sum of T is always 1, which means that the overall weights emitted from atopicis 1.

APPLYING TM-LDA FORTREND ANAL-YSIS AND SENSEMAKING

Changing Topic Transitions over Time

Various Topic Transition Patterns by Cities

CONCLUSIONS

a novel temporally-aware language model, TM-LDA, for efficiently modeling streams ofsocial text such as a Twitter stream for an author

an efficient model updating algorithm for TM-LDA

Finding bursty topics from microblogs

Technology

Transcript of Finding bursty topics from microblogs