EXPLORING LARGE GRAPHS: FROM RANDOM … exploring...John C.S. Lui Computer science & Engineering...
Transcript of EXPLORING LARGE GRAPHS: FROM RANDOM … exploring...John C.S. Lui Computer science & Engineering...
John C.S. LuiComputer science & Engineering Dept.The Chinese University of Hong Kong
EXPLORING LARGE GRAPHS: FROM RANDOM WALK TO CYBER-INSURANCE
2
Motivationsample of Twitter network
measure characteristics of
networks in the wild
3
measurement distortions
“World Map” in 1459
proved incomplete (Columbus et al. 1492)(Australia 17th
century)
wrong proportions (Africa & Asia)
The Fra Mauro world map (1459)source: Wikipedia
methods to sample graphs (e.g., online social networks)
uniform vertex sampling v.s. uniform edge sampling
random walks
Frontier sampling random walk
results
Outline
5
Sampling graphsrandom sampling
(uniform & independent)
crawling
vertex sampling BFS sampling
random walk sampling edge sampling
uniform vertex sampling θi - fraction of vertices with degree i
vertex with degree i is sampled with probability θi
uniform edge sampling πi - probability that a vertex with degree i is sampled
πi = θi x i / <average degree>
estimating θi from πi (uniform edge) : trivial to remove bias
Independent sampling
v u
7
estimate: θi - fraction of vertices with degree i ;
budget: B samples accuracy metric: Normalized root Mean
Squared Error
uniform vertex
uniform edge
Random sampling: accuracy of estimates
,
8
Independent sampling: uniform vertex vs. uniform edge
Flickr graph (1.7 M vertices, 22M
edges)
sampling budget: B = |V|/100
samples
uniform edge
uniform vertex
head: GOOD tail: BAD
GO
OD
head: BAD tail: GOOD
BA
D
vertex degree
avg
. d
eg
ree
9
uniform vertex
pros: independent
sampling OSN needs numeric
user IDs. E.g.: Livejournal, Flickr, MySpace, Facebook,...
cons: resource intensive
(sparse user ID space)
difficult to sample large degree vertices
pros & consuniform edge
pros:◦ independent sampling
◦ easy to sample large degree vertices
cons:◦ no public OSN interface
to sample edges
◦ difficult to sample small degree vertices
start at v randomly selects a neighbor of v ...
until B samples
vertices can be sampled multiple times
often (resource-wise) cheaper than uniform vertex sampling
graph should be connected
multiple RWs: m independent walkers to capture B/m samples
random walk (RW) [crawling]
11
RW degree distribution estimation
θi – fraction of vertices with degree i
P[sampled degree = i] πi
in steady state samples edges uniformly (only if graph connected)
RW = uniform edge sampling without independence
CC
DF
RW sampling
πi
θi
(i)
distribution observed by RW
true distribution
P[X
> x
]x = degree (log-
scale)
12
uniform vertex
pros: independent
sampling supported by OSNs
with numeric user IDs: Livejournal, Flickr, MySpace, Facebook,...
cons: resource intensive
(sparse user ID space)
difficult to sample large degree vertices
pros & consuniform edge RW
pros:◦ independent sampling◦ easy to sample high
degree vertices◦ resource-wise cheap
cons:◦ graph must be connected
◦ large estimation errors when graph looselyconnected
◦ should start in steady state (discard transient samples, but transient is unknown)
13
uniform vertex samples both A and B subgraphs but is expensive
RW samples either A or B
but is cheap
Hybrid sampling?
A
B
14
design a RW that in steady state samples edges uniformly (importance
sampling)
&
initialize steady state w/ uniform vertex sampling
in steady state we want to sample vertices proportional to degree
to start with uniformly sampled vertices
puzzle
15
Need to think in multiple dimensions (multiple walkers)
16
B – sampling budget
Let S = {v1, v2, … , vm} be a set of m vertices
(1) select vr S w.p. deg(vr)
(2) walk one step from vr
(3) add walked edge to E’ and update vr
(4) return to (1) (until m + | E’ | = B)
Multiple dependentwalkersFrontier Sampling (FS)
17
FS: an m-dimensional RWGm = m-th Cartesian power of G
G
Frontier sampling
random walk on Gm
u
j
k
u
j
k
=
G
u,u
j,u
u,k
u,j
k,u
k,k
j,j
= G2
k,jj,k
18
when in steady state (m → )
FS state at step k: Sk=(v1, v2, ... , v)
FS state at step k+1: Sk+1=(v1, u2, ... , v)
samples edges uniformly (like a RW)
m → number of walkers in v V is uniformly distributed (uniform vertex sampling)
FS property
uniform vertex
distribution
v2 , u2 chosen proportional
to their degrees
19
sample paths of θ1 estimates (Flickrgraph)
Flickr: 1.7M vertices, 22M edges Plot evolution (n) , where n = number of
steps 4 sample paths = 4 curves
20
2 Albert-Barabasi graphs (5x105 vertices) w/ avg. deg. 2 and 10 connected by 1 edge
GAB graph
AB2
AB10
1 edge
Outline
Motivation
Model of strategic invesment
Effect of cyber-insurance market on strategic investment behavior
Performance Evaluation
Summary & Lessons Learnt
From distribution to Cyber-Insurance
Motivation
Technical measures of security are abundant Antivirus software, firewalls, intrusion
detection… Ineffective New virus, worms or new form of attacks Carelessness or controllability of administrators,
….etc
Loss due to lack of network security is still big! AT&T’s chief security officer: cyber-criminals’
annual profit exceeds $1trillion- about 7% of US GDP.
In 2009, total reported losses due to payment fraud in US were $641 million.
Why?
Motivation
Virus spreading make security interdependent The interactions of nodes form graph
First goal: model strategic security investment behavior
Example of virus spreading
The investment of nodes can influence each other
Invest in security or not?
Motivation
Security risks can not be completely eliminated through technical measures
Cyber-insurance, offered by companies (e.g. AIG), can be resorted to deal with the residual risk.
However, cyber-insurance market is slow developing (estimated at $450 million)
Second goal: study the influence of cyber-insurance on strategy security investment
Understand what we need to (re-) engineer so to bring this activate this new business
Model
Model: epidemic model
Combines epidemic theory and game theory Epidemic model: spreading of virus
Investment model: decision on security investment
Epidemic model: Use a graph to denote the interaction
relationship
State of node i: , healthy; infected
Each infected node contaminate neighbors with prob. (bond percolation process on G)
Initial state of node i: , denotes whether a node is attacked
The final state is given by the recursive equation
Model: investment model
Investment model (nodes are risk averse) Increasing, concave utility function:
Assumption: binary action : infected initially with prob.
: infected initially with prob.
Utility of no investment:
Utility with investment:
Determined by the epidemic model: virus spreading process
Loss of getting infected
Cost of secure
measure
Model: Bayesian network game
Bayesian network game: Practical situation: nodes have limited info on
the graph
Assumption: minimum common information, degree distribution of nodes
It defines a Bayesian network game The analysis of BNG is more tractable
Nodes are classified into types according to degree Loss distribution CDF:
Same cost of security investment:
Analysis
Nodes will invest if and only if
Problem: how node i estimate and with incomplete graph infoBy assuming the topology is random graph
with given degree distribution
Make use of local mean field technique
Analysis
The structure of random graph is locally tree-like
The prob. of a node with degree k getting infected
Where is given by
the prob. a neighbor is infected
Analysis
Prob. reduction of getting infected by taking action
Let be the fraction of nodes with degree k taking action is a decreasing function of
As a result, is an increasing function of
This reflects the positive externality effect The value of action increases as other nodes
take action
Nodes with higher degree are more sensitive to the externality effect
Self-fulfilling expectation equilibrium
Final adoption fractionNodes with degree k with take secure
measure if their loss is greater than , the fraction is , they are give by fixed point equations
Theorem: They have at least one equilibrium
Cyber-insurance
Cyber-insurance
Two main issues with insuranceMoral hazard: user with insurance will
invest less in security
Adverse selection: happens when insurance provider can not observe the protection of nodes
Model of insurance market Insurance provider offers insurance at price Pay , compensated if get infected
Assumption: competitive insurance market
Cyber-insurance
Model of insurance marketUtility of buying insurance amount
User will choose = , the loss, to maximize its utility
Full insurance coverage
Effect of cyber-insurance
With insurance market, user will choose iff
Without insurance market, user will choose iff
In order for insurance to be positive incentive,
Define , it
Effect of cyber-insurance
Condition of cyber-insurance to be an incentive greater than
is boundedCondition of
Effect of cyber-insurance
Effect on nodes with different degree For
Thus
Insurance market will be more likely to be an incentive for nodes with higher degree
Simulation & numerical results
Simulation & numerical results Verifying local mean field on random
graphs
Simulation & numerical results
Positive externality
Summary
Random walk on large graphs
Proposed a model, combining epidemic theory and game theory, to study strategic investment behaviorBayesian network game
Positive externality effect
Studied the effect of cyber-insurance Positive incentive: initial secure condition is
bad, while protection is bounded
Thank you!Q&A