Bayesian anomaly detection for networks and Big Data

Bayesian anomaly detection for networks and Big Data

Patrick Rubin-Delanchy

University of Bristol & Heilbronn Institute for Mathematical Research

Joint work with

Nicholas A Heard (Imperial College London)Daniel J Lawson (University of Bristol)

Oxford computational statistics and machine learning reading group24th April 2015

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Examples of large-scale network data

World wide web (seemingly between 100 million and 1 billion websites)

Digital economy: worth $4 trillion by 2016 (BCG, 2012)

Social and communication networks (almost 300 million Twitter users)

Computer networks and the “Internet of Things”


Research opportunities of Big Data

Commercial/marketing benefits (funded by?)

Understanding and influencing human society (iffy)

Arficial intelligence (sci-fi?)

Cyber-security


The cyber-security application

Global cost of cyber-security is estimated at $400 billion (CSIC, 2014)

Botnet behind a third of the spam sent in 2010: earned about $2.7 millionspam prevention: cost about > $1 billion (Anderson et al., 2013)

UK National Security Strategy — Priority Risks (Cabinet Office, 2010)

International terrorism affecting the UK or its interests, including a chemical,biological,radiological or nuclear attack by terrorists; and/or a significant increase inthe levels of terrorism relating to Northern Ireland.Hostile attacks upon UK cyber space by other states and large scale cyber crime.A major accident or natural hazard which requires a national response, such as severecoastal flooding affecting three or more regions of the UK, or an influenza pandemic.An international military crisis between states, drawing in the UK, and its allies aswell as other states and non-state actors.


Statistics and cyber-security

Network flow data at Los Alamos National Laboratory: ∼ 30 GB/day

Typical attack pattern:

A. Opportunistic infection

B. Network traversal

C. Data exfiltration

Figure : Network traversal, source: Neil et al. (2013)


Posterior predictive p-value

The posterior predictive p-value is (Meng, 1994; Gelman et al., 1996, Eq. 2.8, Eq. 7)

P = P{f (D∗, θ) ≥ f (D, θ) | D}, (1)

where θ represents the model parameters, D is the observed dataset, D∗ is ahypothetical replicated dataset generated from the model with parameters θ, andP(· | D) is the joint posterior distribution of (θ,D∗) given D.

In words: if a new dataset were generated from the same model and parameters, whatis the probability that the new discrepancy would be as large?

Some relevant discussion: Guttman (1967), Box (1980), Rubin (1984), Bayarri andBerger (2000), Hjort et al. (2006)

Posterior predictive p-values in practice: Huelsenbeck et al. (2001), Sinharay and Stern(2003), Thornton and Andolfatto (2006), Steinbakk and Storvik (2009)


Estimation

Posterior sample: θ1, . . . , θn

Estimate I:

1 Simulate data D∗1 , . . . ,D∗n .

2 Estimate

P =1

n

n∑i=1

I{f (D∗i , θi ) ≥ f (D, θi )}.

Estimate II:

1 Calculate p-values

Qi = P{f (D∗, θi ) ≥ f (D, θi )},

for i = 1, . . . , n.

2 Estimate

P =1

n

n∑i=1

Qi .


The two main points of this talk are:

1 Posterior predictive p-values are not just Bayesian model-checking devices; theycan be used for hypothesis testing and anomaly detection.

2 Posterior predictive p-values are not uniformly distributed under the nullhypothesis, but do come from a special family of distributions.


Recall:P = P{f (D∗, θ) ≥ f (D, θ) | D}.

How should we interpret P? For example:

Is P = 40% ‘good’?

If P = 10−6, is there cause for alarm?

What if 500 tests are performed, and min(Pi ) = 10−6?

The problem is: P is not uniformly distributed.


Example: testing for dependence between point processes

A critical network question: does one point process ‘cause’ another? Possibleapplications:

Characterizing information flow/network behaviour

Detecting tunnelling on a computer network

Finding ‘coordinated’ activity, e.g. discovering botnets

Sports, finance, neurophysiology, etc..

One statement of the problem: Let A and B be two point processes and consider thehypothesis test

H0 : A and B are independent

H1 : B has an increased rate following A events


BA


BA

r 0(t)

Null hypothesis: B is non-homogeneous Poisson process with intensity r0(t)


BA

r 0(t)

OOO

Let b1=area until first response time


BA

r 0(t)

OOO

Let b2=area until second response time


BA

r 0(t)

OOO

Let b3=area until third response time


Lemma

Under H0, the areas b1, b2, . . . are the event times of a homogeneous Poisson process.


BA

r 1(t)

Alternative hypothesis: B has intensity r1(t) = r0(t)f (t − a(t)), f (x) ∝ exp(−βx) anda(t) is closest A event to t.


BA

r 1(t)

OOO

Alternative hypothesis: B has intensity r1(t) = r0(t)f (t − a(t)), f (x) ∝ exp(−βx) anda(t) is closest A event to t.


BA

r 1(t)

OOO

Alternative hypothesis: B has intensity r1(t) = r0(t)f (t − a(t)), f (x) ∝ exp(−βx) anda(t) is closest A event to t


B~f(

t~)

Alternative hypothesis: B has intensity r1(t) = r0(t)f (t − a(t)), f (x) ∝ exp(−βx) anda(t) is closest A event to t


BA

r 1(t)

Alternative hypothesis: B has intensity r1(t) = r0(t)f (t − a(t)), where f is a step-function with one step


B~f(

t~)

Alternative hypothesis: B has intensity r1(t) = r0(t)f (t − a(t)), where f is a step-function with one step


B~f(

t~)

Alternative hypothesis: B has intensity r1(t) = r0(t)f (t − a(t)), where f a decreasingfunction. (f is maximum likelihood)


BA

r 1(t)

Alternative hypothesis: B has intensity r1(t) = r0(t)f (t − a(t)), where f a decreasingfunction. (r1 is maximum likelihood)


Theorem

Under H0, f (0) is distributed as n/(UT ) where U is a uniform random variable over[0, 1] and T is the length of the observation period.

p-value for A causing B: n/{T f (0)}

In previous example: p ≈ 0.15 (not significant)


Daily emailing behaviour of an individual in the Enron dataset:

Time

Day

365

300

240

180

120

601

0 8 16 24


Bayesian intensity estimation for an individual of interest (change-point model for dailybinned data, and density estimation for within day behaviour)

0 5 10 15 20 25 30

Time

λ(t)


‘Information flow’ through JD in the Enron data. Over 2001, 12 individuals contactand are contacted back by JD. Posterior predictive p-values are calculated for eachi → JD 6 JD→ k , i , k = 1, . . . , 12. Full black p < 0.0001, half-black meansp ≤ 0.05, white means not significant.

1 2 3 4 5 6 7 8 9 10 11 12

12

34

56

78

910

1112


7→ j 6 j → 7: only one email on each edge, they are sent about one month apart,and appear to be unrelated judging by their subject-lines. The p-value iscomputed to be about .2

10→ j 6 j → 10: 14 emails from 10 to i and 9 from i to 10, the most coincidentalemail times falling in July, about 3 1

2 hours from each other. The p-valueis 0.07. If time is not transformed the raw p-value is 0.0035, suggesting asignificant interaction. However, upon inspecting the subject-lines of10→ i and i → 10 it is not in fact obvious that there is consistentreciprocation. For example, the subject-lines of the two most coincidentalemails are “FW: Enron Complaint” and “Dunn hearing link?”, which arenot obviously related.


Based on the subject-lines of the most coincidental email times, the three black circlesare correct detections.

6→ i i → 5 The p-value is around 2 · 10−5. The most coincidental email times aretwo hours and twenty minutes apart, and have the same subject-line,“Re: FW: SoCalGas Capacity”.

9→ i i → 3 The p-value is around 6 · 10−8. The most coincidental email times are50 minutes apart, have the subject-lines “California Update–LegislativePush Underway” and “Re: California Update–Legislative PushUnderway”.

12→ i i → 11 The p-value is around 2 · 10−5. The most coincidental email timesare 30 minutes apart and have the subject-lines “RE: CA Unbundling”and “CA Unbundling”


The notion that PPPs are ‘conservative’

General consensus is that P is conservative (too large), i.e. under-represents evidenceagainst the model, see e.g. discussion of Gelman et al. (1996).

In fact, conservative isn’t quite the right word. Instead, the p-values seem toconcentrate around 1/2. For example, let D ∼ normal(µ, 1), with a vague prior on µand let f (D, θ) = D. Then,

P = P{f (D∗, θ) ≥ f (D, θ) | D} ≈ 1/2.


Meng’s result

In the last pages of Meng (1994) we find:

In our notation: suppose θ and D are drawn from the prior and model respectively, andf (D, θ) is absolutely continuous. Then, (i) E(P) = E(U) = 1/2 and (ii) for any convexfunction h,

E{h(P)} ≤ E{h(U)},

where U is a uniform random variable on [0, 1].


A consequence

Meng (1994) next finds:


The convex order

Let X and Y be two random variables with probability measures µ and ν respectively.We say that µ (respectively, X ) is less variable than ν (respectively, Y ) in the convexorder, denoted µ ≤cx ν (or X ≤cx Y ) if, for any convex function h,

E{h(X )} ≤ E{h(Y )},

whenever the expectations exist (Shaked and Shanthikumar, 2007). A probabilitymeasure P, and a random variable distributed as P, is sub-uniform if P ≤cx U , whereU is a uniform distribution on [0, 1].

Posterior predictive p-values have a sub-uniform distribution.


Meng’s findings could seem quite coarse:

1 The 2α bound suggests that P could be liberal, whereas we had the impressionthey were conservative. Can the bound be improved?

2 The theorem seems to be describing a huge space of distributions. Can itsomehow be reduced?

Our results:

1 It is possible to construct a posterior predictive p-value with any sub-uniformdistribution.

2 Some sub-uniform distributions achieve the 2α bound.

3 Therefore, some posterior predictive p-values achieve the 2α bound. In fact, wecan construct simple examples where this happens.


Theorem (Strassen’s theorem)

For two probability measures µ and ν on the real line the following conditions areequivalent:

1 µ ≤cx ν;

2 there are random variables X and Y with marginal distributions µ and νrespectively such that E(Y | X ) = X .

Theorem due to Strassen (1965) (see also references therein), this version due toMuller and Ruschendorf (2001).


Theorem

Let µ and ν be two probability measures on the real line where ν is absolutelycontinuous. The following conditions are equivalent:

1 µ ≤cx ν;

2 there exist random variables X and Y with marginal distributions µ and νrespectively such that E (Y | X ) = X and the random variable Y | X is eithersingular, i.e. Y = X , or absolutely continuous with µ-probability one.

Theorem (Posterior predictive p-values and the convex order)

P is a sub-uniform probability measure if and only if there exist random variablesP,D, θ and an absolutely continuous discrepancy f (D, θ) such that

P = P{f (D∗, θ) ≥ f (D, θ) | D},

where P has measure P, D∗ is a replicate of D conditional on θ and P(· | D) is thejoint posterior distribution of (θ,D∗) given D.


Exploring the set of sub-uniform distributions

The integrated distribution function of a random variable X with distribution functionFX is,

ψX (x) =

∫ x

−∞FX (t)dt.

We have (Muller and Ruschendorf, 2001)

1 ψX is non-decreasing and convex;

2 Its right derivative ψ+X (x) exists and 0 ≤ ψ+

X (x) ≤ 1;

3 limx→−∞ ψX (x) = 0 and limx→∞{x − ψX (x)} = E(X ).

Furthermore, for any function ψ satisfying these properties, there is a random variableX such that ψ is the integrated distribution function of X . The right derivative of ψ isthe distribution function of X , FX (x) = ψ+(x).

Let Y be another random variable with integrated distribution function ψY . ThenX ≤cx Y if and only if ψX ≤ ψY and limx→∞{ψY (x)− ψX (x)} = 0.


0.0

0.1

0.2

0.3

0.4

0.5

t

ψ(t)

0 0.3 0.5 1

0.0

0.5

1.0

1.5

t

dens

ity

0 0.3 0.5 1

Figure : Uniform distribution


0.0

0.1

0.2

0.3

0.4

0.5

t

ψ(t)

0 0.3 0.5 1

0.0

0.5

1.0

1.5

t

dens

ity

0 0.3 0.5 1

Figure : Some sub-uniform distributions. Blue: uniform, green: Beta(2,2)


0.0

0.1

0.2

0.3

0.4

0.5

t

ψ(t)

0 0.3 0.5 1

0.0

0.5

1.0

1.5

t

dens

ity

0 0.3 0.5 1

●

0.6

Figure : Some sub-uniform distributions. Blue: uniform, green: Beta(2,2), red: a distributionwith maximal mass at .3


X(1)

1 − 2α

2α

θ = 0

θ = 1

X(0)

a) Model 1

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

x

f(x,θ

)

f(x,1)f(x,0)

b) Model 1 discrepancy

0.0 0.2 0.4 0.6 0.8 1.0

0.0

1.0

2.0

3.0

x

p(x|

θ) p(x|θ = 0) p(x|θ = 1)

c) Model 2

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

x

f(x,θ

)

f(x,0) f(x,1)

d) Model 2 discrepancy

Figure : Some constructive examples


Application: combining p-values I

Suppose P1, . . . ,Pm are independent posterior predictive p-values. How can I combinethem into one piece of evidence?

With ordinary p-values (i.e. independent uniform random variables):

min(Pi ) ∼ Beta(1,m)

−2∑

log(Pi ) ∼ χ22m (Fisher’s method)


Combining p-values II

Lemma

Let P1, . . . ,Pm and U1, . . . ,Um each be sequences of independent sub-uniform anduniform random variables on [0, 1], respectively. For x ∈ [0, 1], let

q = 1− (1− x)m = P{min(Ui ) ≤ x}.

Then

P{min(Pi ) ≤ x} ≤ 1− (1− 2x)m,

= 1− {2(1− q)1/m − 1}m,

which is no larger than 2q and tends to 2q − q2 as m→∞. Furthermore, this boundis achievable if the Pi are independent and identically distributed.


Combining p-values III

Lemma (Fisher’s method is asymptotically conservative)

Let P1, . . . ,Pm and U1, . . . ,Um each be sequences of independent and identicallydistributed sub-uniform and uniform random variables on [0, 1] respectively. Forα ∈ (0, 1], let tα,m be the critical value defined by

P

(−2

m∑i=1

log(Ui ) ≥ tα,m

)= α.

Then there exists n ∈ N such that

P

(−2

m∑i=1

log(Pi ) ≥ tα,m

)≤ α,

for any m ≥ n.


Combining p-values IV

Lemma (Sums of posterior predictive p-values)

Let P1, . . . ,Pm denote m independent sub-uniform random variables with meanP = m−1

∑mi=1 Pi . Then, for 0 ≤ t ≤ 1/2,

P(1/2− P ≥ t

)≤ exp(−6mt2).

Note Hoeffing’s inequality (Hoeffding, 1963)

P(1/2− P ≥ t

)≤ exp(−2mt2).


Some final points

1 Other attempts at combining Bayesian inference with hypothesis tests completelybreak down when the posterior contracts.

2 We can also characterize estimated posterior predictive p-values from MCMCsamples, e.g. of two schemes proposed by Gelman et al. (1996), only one producesa sub-uniform estimate, and other types of p-values (e.g. discrete p-values).

3 The convex order seems to have a deeper relevance to statistical inference andMonte Carlo, see e.g. Andrieu and Vihola (2014), Goldstein et al. (2011)


Anderson, R., Barton, C., Bohme, R., Clayton, R., Van Eeten, M. J., Levi, M., Moore, T., and Savage, S. (2013). Measuring the cost of cybercrime.In The economics of information security and privacy, pages 265–300. Springer.

Andrieu, C. and Vihola, M. (2014). Establishing some order amongst exact approximations of MCMCs. arXiv preprint arXiv:1404.6909.

Bayarri, M. and Berger, J. O. (2000). P values for composite null models. Journal of the American Statistical Association, 95(452):1127–1142.

Box, G. E. (1980). Sampling and Bayes’ inference in scientific modelling and robustness. Journal of the Royal Statistical Society. Series A (General),pages 383–430.

Cabinet Office and National security and intelligence (2010). A strong britain in an age of uncertainty:the national security strategy.

Center for Strategic and International Studies (2014). Net losses: Estimating the global cost of cybercrime economic impact of cybercrime II.

Gelman, A., Meng, X.-L., and Stern, H. (1996). Posterior predictive assessment of model fitness via realized discrepancies. Statistica Sinica,6(4):733–760.

Goldstein, L., Rinott, Y., Scarsini, M., et al. (2011). Stochastic comparisons of stratified sampling techniques for some monte carlo estimators.Bernoulli, 17(2):592–608.

Guttman, I. (1967). The use of the concept of a future observation in goodness-of-fit problems. Journal of the Royal Statistical Society. Series B(Methodological), pages 83–100.

Hjort, N. L., Dahl, F. A., and Steinbakk, G. H. (2006). Post-processing posterior predictive p values. Journal of the American Statistical Association,101(475):1157–1174.

Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American statistical association, 58(301):13–30.

Huelsenbeck, J. P., Ronquist, F., Nielsen, R., and Bollback, J. P. (2001). Bayesian inference of phylogeny and its impact on evolutionary biology.Science, 294(5550):2310–2314.

Meng, X.-L. (1994). Posterior predictive p-values. The Annals of Statistics, 22(3):1142–1160.

Muller, A. and Ruschendorf, L. (2001). On the optimal stopping values induced by general dependence structures. Journal of applied probability,38(3):672–684.

Neil, J., Hash, C., Brugh, A., Fisk, M., and Storlie, C. B. (2013). Scan statistics for the online detection of locally anomalous subgraphs.Technometrics, 55(4):403–414.

Rubin, D. B. (1984). Bayesianly justifiable and relevant frequency calculations for the applied statistician. The Annals of Statistics, 12(4):1151–1172.

Rubin-Delanchy, P. and Heard, N. A. (2014). A test for dependence between two point processes on the real line. arXiv preprint arXiv:1408.3845.

Rubin-Delanchy, P. and Lawson, D. J. (2014). Posterior predictive p-values and the convex order. arXiv preprint arXiv:1412.3442.

Shaked, M. and Shanthikumar, J. G. (2007). Stochastic orders. Springer.

Sinharay, S. and Stern, H. S. (2003). Posterior predictive model checking in hierarchical models. Journal of Statistical Planning and Inference,111(1):209–221.

Steinbakk, G. H. and Storvik, G. O. (2009). Posterior predictive p-values in Bayesian hierarchical models. Scandinavian Journal of Statistics,36(2):320–336.

Strassen, V. (1965). The existence of probability measures with given marginals. The Annals of Mathematical Statistics, 36(2):423–439.

The Boston consulting group (2012). The internet economy in the G-20.

Thornton, K. and Andolfatto, P. (2006). Approximate Bayesian inference reveals evidence for a recent, severe bottleneck in a netherlands populationof drosophila melanogaster. Genetics, 172(3):1607–1619.


Bayesian anomaly detection for networks and Big Data

Documents

Transcript of Bayesian anomaly detection for networks and Big Data