Post on 16-Dec-2015
StaticGreedy: Solving the Scalability-Accuracy Dilemma in Influence Maximization
Suqi ChengResearch Center of Web Data Sciences & Engineering
Institute of Computing Technology, Chinese Academy of Scienceschengsuqi@ict.ac.cn,chengsuqi@gmail.com
http://www.nascgroup.org/~chengsuqi
Authors: Suqi Cheng, Huawei Shen, Junming Huang, Guoqing Zhang, Xueqi Cheng
3
Information Cascade
• An action or idea are adopted one by one due to social influence– cascade through social relationships
• Main Applications– Word-of-Mouth marketing– Out-break detection– Popularity prediction
social network
4
Word-of-Mouth Marketing
• To promote a product by seeding a few users; users adopting the product will recommend it
• Advantages: efficient; cost-effective
Company seed users follow-up activated users
free product/discount influence
How to select the optimal seed users?
5
Influence Maximization for Viral Marketing
• Objective function– Influence spread I(S) : expected number of activated
(influenced/adpoted) nodes– Maximize I(S)
• Input:– A social influence graph G=(V, E)
– An information cascade model– An integer k, |S| ≤ k
• Output: A seed set S
6
Information Cascade Model
• Independent cascade (IC) model– each edge (u, v) has a propagation probability
p(u, v)– each newly activated node u independently
activates its out-neighbor v with probability p(u, v)
– a discrete time model
• Influence spread estimation on IC model– Monte Carlo simulation– Heuristic methods
0.1 0.2
0.3 0.1
0.1
0.5
0.4
0.1
0.4 0.4
0.2
0.2
0.10.5
0.3
Social influence graph
[Leskovec, 2008]
7
Difficulties in Influence Maximization
Greedy approximate algorithm [Kempe, KDD’03]
(1-1/e-ε)-approximation iteratively select nodes with largest
marginal influence spread guaranteed by submodularity and
montonicity properties of influence spread function
accurate
inefficient
Difficulty 1: Influence maximization problem is NP-hard.[kempe, KDD’03]
Existing solutions
Heuristics Degree Pagerank Betweennes
efficient
inaccurate
8
Difficulties in Influence Maximization
Existing solutions
Heuristic methods DegreeDiscount[Chen,
KDD’09] CGA[Wang, KDD‘10] PMIA[Chen,KDD’10] IRIE[Jung, ICDM’12]
efficient
inaccurate
Monte-Carlo simulation CELF optimization[Leskovec,KDD’07] NewGreedy[Chen, KDD’09] CELF++ optimization[Goyal,WWW’11]
accurate
time-consuming
Difficulty 2: To exactly compute influence spread is #P-hard. [Chen, KDD’10]
A scalability-accuracy delimma!
9
Our works
• Objective : to propose an influence maximization algorithm to solve the scalability-accuracy dilemma
Algorithm Accuracy Scalability
Approximate algorithms
Greedy [Kempe, KDD’03] gurannteed low
CreedyCELF [Leskovec, KDD’07] gurannteed low
GreedyCELF++ [Goyal, WWW’11] gurannteed low
NewGreedy/MixedGreedy
[Chen, KDD’09] gurannteed low
StaticGreedy [cheng, CIKM’13] gurannteed high
Heuristics
Degree ungurannteed high
PageRank [Page, 1999] ungurannteed high
DegreeDiscount [Chen, KDD’09] ungurannteed high
PMIA [Chen, KDD’10] ungurannteed high
IRIE [Jung, ICDM’12] ungurannteed high
SP1M [Kimura, PKDD’06] ungurannteed relatively low
10
Preliminaries-1
• Social influence graph: G=(V, E), n=|V|, m=|E|
• Influence spread: I(S)
• Marginal influence spread: M(v|S)=I(S{v}) - I(S)
guaranteeguarantee
• Greedy approximate algorithm– iteratively select nodes with the largest marginal influence spread– provide 1-1/e-ε approximation
• Properties of I(S) under independent cascade model– submodularity: I(S{v}) - I(S) I(T{v}) - I(S) iff vV, S T V
– monotonicity: I(S{v}) I(S)
Influence spread estimation
11
Preliminaries-2
• Monte Carlo simulation for influence spread estimation– to approximate true values of influence spread by realizations
method An instance Advantage Disadvantage
simulation modeling the information cascade process
relatively low time complexity
estimate one seed set at a time
snapshot[Chen, KDD’09]
removing each edge (u, v) from G with probability 1-p(u, v)
can estimate any seed set simultaneously
relatively high time complexity
equivalent
12
Motivation
• In existing greedy algorithms– a risk of unguaranteed submodularity and monotonicity of influence
spread function
influence graph snapshot1 snapshot 2
iteration 1 iteration 2
Submodularity is breaked!
0 4 0 4
1 4 1 2 4 2
( { }) ( ) ({ }) ( ) 1
( { }) ( ) ({ , }) ({ }) 3
I S v I S I v I
I S v I S I v v I v
– caused by using different results of Monte Carlo simulation across different influence spread estimation
– a very large value of R is required, e.g. R=20000R: number of Monte Carlo simulations for estimation
13
StaticGreedy algorithm
• Core idea: to always use the same snapshots for influence spread estimation– influence spread function is submodular and monotone– a small value of R is required, e.g. R=100
Part1: Generate R static snapshots
Part 2: Greedy selection
14
Performance analysis: Convergence rate
• provide (1-1/e-ε)-approximation with a small value of R
d R,k
log R
*,
, *
( ) ( )
( )k R k
R kk
I S I Sd
I S
seed set size = 50
NetHEPT: a benchmark networkuniform independent cascade (UIC) model: p(u, v) = p = 0.01weighted independent cascade (WIC) model: p(u, v) = 1/(# of in-neighbors of v)
15
Performance analysis: Scalabilitylo
g R
min
seed set size
min ,min{ | 0.005}R kR R d
seed set size
log
runn
ing
time
(sec
)
≈103 times≈102 times
Minimal R required Running time
R is significantly reduced Running time is significantly reduced
16
Performance analysis: Complexity
2
,
' 10
' u v
R R
m p m
n: number of nodes in social influence graphm: number of edges in social influence graphm’: expected number of edges in a snapshot
17
Speed up StaticGreedy
• A dynamic update strategy– calculates the marginal gain in an efficient incremental manner
• at each step t, for each snapshot: M(v) M(v) - |R(v)R(vt*)|, R(v) R(v) - R(v)R(vt*)
– trades space for time
v2v1
v3 v4 v5
v6 v7 v8
M(v1)=4M(v2)=3M(v3)=2M(v4)=1M(v5)=1M(v6)=1M(v7)=2M(v8)=1
v1
snapshot
initial
R(v): reachable nodes from v in the snapshot
18
Speed up StaticGreedy
• A dynamic update strategy– calculates the marginal gain in an efficient incremental manner
• at each step t, for each snapshot: M(v) M(v) - |R(v)R(vt*)|, R(v) R(v) - R(v)R(vt*)
– trades space for time
v2v1
v3 v4 v5
v6 v7 v8
M(v1)=4M(v2)=3M(v3)=2M(v4)=1M(v5)=1M(v6)=1M(v7)=2M(v8)=1
M(v1)=0M(v2)=2M(v3)=0M(v4)=0M(v5)=1M(v6)=0M(v7)=2M(v8)=1
v1
directlyupdate
snapshot
after select v* = v1
R(v): reachable nodes from v in the snapshot
-1-4
-2 -1
-1
19
Experiments: setup
• Algorithms: – Our algorithms: StaticGreedyCELF, StaticGreedyDU– Baselines: CELFGreedy, SP1M, PMIA, Degree, DegreeDiscount
• Tested datasets
• Independent cascade models– uniform independent cascade(UIC) model: p(u, v) = p = 0.01– weighted independent cascade(WIC) model: p(u, v) = 1/(# of in-neighbors of v)
• Metrics: Influence spread, running time
20
Experiments: influence spread
• StaticGreedy achieves better accuracy than other heuristics
NetPHY
DBLP
UIC model
UIC model
WIC model
WIC model
21
Experiments: running time• StaticGreedy runs >103 times faster than CELFGreedy• StaticGreedy has comparable scalability to state-of-the-art heuristics• StaticGreedyDU always runs faster than StaticGreedyCELF
log
runn
ing
time
(sec
)
UIC model WIC model
22
conclusion• Essential reason of the inefficiency of existing greedy algorithms
– a risk of unguaranteed submodularity and monotonicity– caused by different Monte Carlo simulations across different estimations– a very large value of R is required guaranteed accuracy + inefficiency
• StaticGreedy algorithm– guaranteed submodularity and monotonicity– using the same Monte Carlo simulations across different estimations– a small value of R is required guaranteed accuracy + high scalability
– runs >103 times quicker than conventional greedy algorithms
• A dynamic update strategy to speed up StaticGreedy– about 10 times faster