Uncovering Social Links Through Stochastic Point …...2 The Problem (a) Twitter Tweet 1 Tweet 2...
Embed Size (px)
Transcript of Uncovering Social Links Through Stochastic Point …...2 The Problem (a) Twitter Tweet 1 Tweet 2...

Uncovering Social Links Through
Stochastic Point Processes
Rui Zhang (u5963436)
Dr Marian-Andrei Rizoiu
Research School of Computer Science
The Australian National University
COMP6470 Final Presentation
May, 2017

2
The Problem
(a) Twitter Tweet 1
Tweet 2
Tweet 3
Tweet 4
Tweet 5
Tweet 8
Tweet 6
Tweet 7
Tweet 9
Tweet 1
Tweet 2
Tweet 3Tweet 4
Tweet 5
Tweet 6
Tweet 7Tweet 8
Tweet 9
(b) Real retweet network
(Tree structure)
1.How tweets diffuse
2.Which user is important
in the diffusion
(c) Retweet network from the Twitter
API
(Star structure)
Wrong diffusion structure

3
The Problem
Purpose: Infer the real parent-offspring relationship between tweets
using only one cascade
Existing methods Probability
distribution
NETINF[Gomez-Rodriguez
et al KDD’11]
Description Predict links based on
probabilities
Choose links improving the
log-likelihood most
significantly
Shortcomings Need cascades for
optimizing parameters
of the distribution
Cascades for training and for
prediction
Sometimes, only one cascade occurring and no more cascades for
training and improving prediction.

4
Contents of this Presentation
• Modeling Retweets Cascades with Hawkes Point
Processes
• Optimization by Expectation Maximization Algorithm
• Constructing the Twitter Dataset
• Evaluation and Results

5
Introduction to Hawkes Point Processes
Point Processesdescribing events occurring at random locations and/or times.
(a) Modeling earthquake aftershocks
Hawkes Point Processes [Hawkes Biometrika’71]
Occurring events increase the likelihood of occurrence of futures events
(self-exciting)
Applications of Hawkes Point Processes.
(b) Modeling trade

Branching Structure and Hidden Vars
6
Occurring
Time
(t1, m1)
Assumption: self-exciting - - retweets in a cascade randomly occur and
occurrence of retweets is likely to cause more retweets
Root tweet
t - - occurring time
m - - user influence (the number of followers)
𝑢1

Branching Structure and Hidden Vars
7
Occurring
Time
(t1, m1)
(t4 m4)(t2, m2)
Assumption: retweets in a cascade randomly occur and occurrence of
retweets is likely to cause more retweets
𝑢1
𝑢4
𝑢2

Branching Structure and Hidden Vars
8
Occurring
Time
(t1, m1)
(t4 m4)
(t5, m5)
(t2, m2)
(t3, m3)
Assumption: retweets in a cascade randomly occur and occurrence of
retweets is likely to cause more retweets
𝑢1
𝑢4
𝑢2
𝑢3
𝑢5

Branching Structure and Hidden Vars
9
Occurring
Time
(t1, m1)
(t4 m4)
(t5, m5)
(t2, m2)
(t3, m3)
𝑝21
𝑝41
𝒑𝟑𝟐 𝑝54
Assumption: retweets in a cascade randomly occur and occurrence of
retweets is likely to cause more retweets
𝑝𝑗𝑖 - - P( the 𝑗𝑡ℎ retweet is caused by the 𝑖𝑡ℎ retweet )
Observed event sequence
𝑢1
𝑢4
𝑢2
𝑢3
𝑢5

10
Modeling Retweet Cascades
Model: Hawkes Point Processes with Power-law Triggering Kernel
[Mishra et al CIKM’16]
𝜆 𝑡 =
𝑡𝑖<𝑡
𝜙𝑚𝑖(t − ti)
𝜙𝑚𝑖𝑡 − 𝑡𝑖
= 𝜅𝑚𝑖𝛽𝑡 − 𝑡𝑖 + 𝑐 −(1+𝜃)
Optimize model parameters (𝜅, 𝛽, 𝑐, 𝜃) and
hidden variables 𝑝𝑗𝑖

11
Contents of this Presentation
• Modeling Retweets Cascades with Hawkes Point Processes
• Optimization by Expectation Maximization Algorithm
• Constructing the Twitter Dataset
• Evaluation and Results

12
Optimization by Expectation Maximization Algorithm
𝜅, 𝛽, 𝑐, 𝜃 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝜅,𝛽,𝑐,𝜃
𝑖=2
𝑛
𝑗=1
𝑖−1
𝑝𝑗𝑖 log𝜙𝑚𝑖(𝑡𝑗 − 𝑡𝑖) − න
𝑡1
𝑡𝑛
𝜆 𝑡 𝑑𝑡
𝑝𝑗𝑖 =𝜙𝑚𝑖
𝑡𝑖 − 𝑡𝑗
𝜆(𝑡𝑗)𝑗 = 1,2, … , 𝑖 − 1 𝑖 = 1,2, … , 𝑛
E step
M step
H.EM
{𝑝𝑗𝑖} ← (𝜅, 𝛽, 𝑐, 𝜃)
(𝜅𝑜𝑙𝑑 , 𝛽𝑜𝑙𝑑 , 𝑐𝑜𝑙𝑑 , 𝜃𝑜𝑙𝑑 , {𝑝𝑗𝑖}) → (𝜅, 𝛽, 𝑐, 𝜃)
Expectation Maximization (EM) Algorithm:
1. An iterative algorithm
2. Alternates between E step and M step

13
Contents of this Presentation
• Modeling Retweets Cascades with Hawkes Point
Processes
• Optimization by Expectation Maximization Algorithm
• Constructing the Twitter Dataset
• Evaluation and Results

14
Constructing the Twitter Dataset
Retweet Cascades
Friend Networks
Twitter Crawler Twitter API
Twitter Users
in Cascades
Sydney Morning Herald (start: 14th Feb)
Simultaneously

15
Item Quantity
Cascades 68040
Tweets in cascades 259186
Users in cascades 61174
Cascades with more than 50 retweets (𝐶50) 274
Users in 𝐶50 16125
Tweets in 𝐶50 33539
Downloaded friends of users in 𝐶50 16051
Statistics on Current Data
Constructing the Twitter Dataset

16
Contents of this Presentation
• Modeling Retweets Cascades with Hawkes Point
Processes
• Optimization by Expectation Maximization Algorithm
• Constructing the Twitter Dataset
• Evaluation and Results

17
Evaluation and Results
Calculate optimal parameters on synthetic data
𝜅, 𝛽, 𝑐, 𝜃 = 𝑎𝑟𝑔𝑚𝑎𝑥(𝜅,𝛽,𝑐,𝜃)
𝑖=2
𝑛
log 𝜆(𝑡𝑖) − න𝑡1
𝑡𝑛
𝜆 𝑡 𝑑𝑡
Baseline: maximizing observed log-likelihood (MLL) of the
same Point Process Models [Mishra et al CIKM’16]
Data: 10 cascades (20 experiments with different initial
parameters on each cascade)

18
Calculate optimal parameters on synthetic data
C (optimal 0.001)
0.001620
0.001635
H.EM MLL H.EM MLL
0.29
0.31
Theta (optimal: 0.2)
Evaluation and Results
H.EM MLL H.EM
0.0174
0.0178
0.0182
K (optimal: 0.025)
MLL
0.60
0.64
Beta (optimal: 0.51)

Performance Measures
19
0 1
𝑝51
𝑝54
𝑝53𝑝52
𝑢1
𝑢4
𝑢2
𝑢3
𝑢5
𝑝52 𝑝54 𝑝51𝑝53
True
False
ROC curve
Area Under Curve (AUC)
the highest probability: an edge
Accuracy
probability
time
Friend Networks

20
Evaluation and Results
Compare with Seven Methods on Real Data
Baselines Description
H.MLLPL infer 𝑝𝑗𝑖 after optimizing log-
likelihood
(H.EM - - during optimization)
(do not need training)
Power-Law Kernel
H.MLLEXP Exponential Kernel
Exponential distribution (E) Directly calculate
probabilities of links
without iterations
(need training)
𝑝𝑗𝑖 = 𝛼 − 1 𝑒−𝛼(𝑡𝑗−𝑡𝑖)
Power-law distribution (PL) 𝑝𝑗𝑖 = 𝛼 − 1 𝑡𝑗 − 𝑡𝑖−𝛼
Rayleigh distribution (R) 𝑝𝑗𝑖 = 𝛼(𝑡𝑗 − 𝑡𝑖)𝑒−0.5𝛼 𝑡𝑗−𝑡𝑖
2
Social Exponential (SE) 𝑝𝑗𝑖 =𝑚𝑖
σ𝑗=1𝑖 𝑚𝑗
𝑒−𝛼(𝑡𝑗−𝑡𝑖)
NETINF Select edges increasing log-likelihood most significantly
(need training)
274 cascades:254 – test
20 – training, E, PL, R, SE (mean AUC) and NETINF (mean Accuracy){

21
Evaluation and Results
Compare with Seven Methods on Real Data
HEM SE HMLL
PL
EXP HMLL
EXP
PL R NETINF
Mean
AUC
0.832 0.872 0.83 0.726 0.726 0.714 0.728 NA
H.EM SE H.MLLPL EXP H.MLLEXP PL R
0.4
0.6
0.8
1.0
AUC
Compare H.EM with baselines (AUC)

22
Evaluation and Results
Compare with Seven Methods on Real Data
HEM SE HMLL
PL
EXP HMLL
EXP
PL R NETINF
Mean Accuracy 0.506 0.556 0.468 0.185 0.187 0.186 0.567 0.249
H.EM SE H.MLLPL EXP H.MLLEXP PL R NETINF
0.0
0.2
0.4
0.6
0.8
1.0
Compare H.EM with baselines (Accuracy)
Accuracy
1. Our method does not need training
2. Infering 𝑝𝑗𝑖 during optimization improves performance

23
Summary
• Modeling by Hawkes Point Processes with Power-law Kernel
• Branching structure of Hawkes used to retrieve the
parenthood relation between retweets
• Inferring 𝑝𝑗𝑖 during optimization is important
• Applied to retrieving the true retweet relations in Twitter
cascades
The Way Ahead
Thank You !
• Experiments on more cascades with different themes
• Try more competitive triggering kernels

24
Reference
• Gomez Rodriguez, M., Leskovec, J., & Krause, A. (2010, July). Inferring networks
of diffusion and influence. In Proceedings of the 16th ACM SIGKDD international
conference on Knowledge discovery and data mining (pp. 1019-1028). ACM.
• Mishra, S., Rizoiu, M.A. and Xie, L., 2016, October. Feature driven and point
process approaches for popularity prediction. In Proceedings of the 25th ACM
International on Conference on Information and Knowledge Management (pp.
1069-1078). ACM.