Bayesian anomaly detection for networks and Big Data

57
Bayesian anomaly detection for networks and Big Data Patrick Rubin-Delanchy University of Bristol & Heilbronn Institute for Mathematical Research Joint work with Nicholas A Heard (Imperial College London) Daniel J Lawson (University of Bristol) Oxford computational statistics and machine learning reading group 24th April 2015 P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Transcript of Bayesian anomaly detection for networks and Big Data

Page 1: Bayesian anomaly detection for networks and Big Data

Bayesian anomaly detection for networks and Big Data

Patrick Rubin-Delanchy

University of Bristol & Heilbronn Institute for Mathematical Research

Joint work with

Nicholas A Heard (Imperial College London)Daniel J Lawson (University of Bristol)

Oxford computational statistics and machine learning reading group24th April 2015

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 2: Bayesian anomaly detection for networks and Big Data

Examples of large-scale network data

World wide web (seemingly between 100 million and 1 billion websites)

Digital economy: worth $4 trillion by 2016 (BCG, 2012)

Social and communication networks (almost 300 million Twitter users)

Computer networks and the “Internet of Things”

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 3: Bayesian anomaly detection for networks and Big Data

Research opportunities of Big Data

Commercial/marketing benefits (funded by?)

Understanding and influencing human society (iffy)

Arficial intelligence (sci-fi?)

Cyber-security

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 4: Bayesian anomaly detection for networks and Big Data

The cyber-security application

Global cost of cyber-security is estimated at $400 billion (CSIC, 2014)

Botnet behind a third of the spam sent in 2010: earned about $2.7 millionspam prevention: cost about > $1 billion (Anderson et al., 2013)

UK National Security Strategy — Priority Risks (Cabinet Office, 2010)

International terrorism affecting the UK or its interests, including a chemical,biological,radiological or nuclear attack by terrorists; and/or a significant increase inthe levels of terrorism relating to Northern Ireland.Hostile attacks upon UK cyber space by other states and large scale cyber crime.A major accident or natural hazard which requires a national response, such as severecoastal flooding affecting three or more regions of the UK, or an influenza pandemic.An international military crisis between states, drawing in the UK, and its allies aswell as other states and non-state actors.

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 5: Bayesian anomaly detection for networks and Big Data

Statistics and cyber-security

Network flow data at Los Alamos National Laboratory: ∼ 30 GB/day

Typical attack pattern:

A. Opportunistic infection

B. Network traversal

C. Data exfiltration

Figure : Network traversal, source: Neil et al. (2013)

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 6: Bayesian anomaly detection for networks and Big Data

Posterior predictive p-value

The posterior predictive p-value is (Meng, 1994; Gelman et al., 1996, Eq. 2.8, Eq. 7)

P = P{f (D∗, θ) ≥ f (D, θ) | D}, (1)

where θ represents the model parameters, D is the observed dataset, D∗ is ahypothetical replicated dataset generated from the model with parameters θ, andP(· | D) is the joint posterior distribution of (θ,D∗) given D.

In words: if a new dataset were generated from the same model and parameters, whatis the probability that the new discrepancy would be as large?

Some relevant discussion: Guttman (1967), Box (1980), Rubin (1984), Bayarri andBerger (2000), Hjort et al. (2006)

Posterior predictive p-values in practice: Huelsenbeck et al. (2001), Sinharay and Stern(2003), Thornton and Andolfatto (2006), Steinbakk and Storvik (2009)

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 7: Bayesian anomaly detection for networks and Big Data

Estimation

Posterior sample: θ1, . . . , θn

Estimate I:

1 Simulate data D∗1 , . . . ,D∗n .

2 Estimate

P =1

n

n∑i=1

I{f (D∗i , θi ) ≥ f (D, θi )}.

Estimate II:

1 Calculate p-values

Qi = P{f (D∗, θi ) ≥ f (D, θi )},

for i = 1, . . . , n.

2 Estimate

P =1

n

n∑i=1

Qi .

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 8: Bayesian anomaly detection for networks and Big Data

Estimation

Posterior sample: θ1, . . . , θn

Estimate I:

1 Simulate data D∗1 , . . . ,D∗n .

2 Estimate

P =1

n

n∑i=1

I{f (D∗i , θi ) ≥ f (D, θi )}.

Estimate II:

1 Calculate p-values

Qi = P{f (D∗, θi ) ≥ f (D, θi )},

for i = 1, . . . , n.

2 Estimate

P =1

n

n∑i=1

Qi .

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 9: Bayesian anomaly detection for networks and Big Data

The two main points of this talk are:

1 Posterior predictive p-values are not just Bayesian model-checking devices; theycan be used for hypothesis testing and anomaly detection.

2 Posterior predictive p-values are not uniformly distributed under the nullhypothesis, but do come from a special family of distributions.

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 10: Bayesian anomaly detection for networks and Big Data

Recall:P = P{f (D∗, θ) ≥ f (D, θ) | D}.

How should we interpret P? For example:

Is P = 40% ‘good’?

If P = 10−6, is there cause for alarm?

What if 500 tests are performed, and min(Pi ) = 10−6?

The problem is: P is not uniformly distributed.

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 11: Bayesian anomaly detection for networks and Big Data

Example: testing for dependence between point processes

A critical network question: does one point process ‘cause’ another? Possibleapplications:

Characterizing information flow/network behaviour

Detecting tunnelling on a computer network

Finding ‘coordinated’ activity, e.g. discovering botnets

Sports, finance, neurophysiology, etc..

One statement of the problem: Let A and B be two point processes and consider thehypothesis test

H0 : A and B are independent

H1 : B has an increased rate following A events

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 12: Bayesian anomaly detection for networks and Big Data

Example: testing for dependence between point processes

A critical network question: does one point process ‘cause’ another? Possibleapplications:

Characterizing information flow/network behaviour

Detecting tunnelling on a computer network

Finding ‘coordinated’ activity, e.g. discovering botnets

Sports, finance, neurophysiology, etc..

One statement of the problem: Let A and B be two point processes and consider thehypothesis test

H0 : A and B are independent

H1 : B has an increased rate following A events

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 13: Bayesian anomaly detection for networks and Big Data

BA

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 14: Bayesian anomaly detection for networks and Big Data

BA

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 15: Bayesian anomaly detection for networks and Big Data

BA

r 0(t)

Null hypothesis: B is non-homogeneous Poisson process with intensity r0(t)

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 16: Bayesian anomaly detection for networks and Big Data

BA

r 0(t)

OOO

Let b1=area until first response time

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 17: Bayesian anomaly detection for networks and Big Data

BA

r 0(t)

OOO

Let b2=area until second response time

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 18: Bayesian anomaly detection for networks and Big Data

BA

r 0(t)

OOO

Let b3=area until third response time

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 19: Bayesian anomaly detection for networks and Big Data

Lemma

Under H0, the areas b1, b2, . . . are the event times of a homogeneous Poisson process.

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 20: Bayesian anomaly detection for networks and Big Data

BA

r 1(t)

Alternative hypothesis: B has intensity r1(t) = r0(t)f (t − a(t)), f (x) ∝ exp(−βx) anda(t) is closest A event to t.

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 21: Bayesian anomaly detection for networks and Big Data

BA

r 1(t)

OOO

Alternative hypothesis: B has intensity r1(t) = r0(t)f (t − a(t)), f (x) ∝ exp(−βx) anda(t) is closest A event to t.

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 22: Bayesian anomaly detection for networks and Big Data

BA

r 1(t)

OOO

Alternative hypothesis: B has intensity r1(t) = r0(t)f (t − a(t)), f (x) ∝ exp(−βx) anda(t) is closest A event to t

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 23: Bayesian anomaly detection for networks and Big Data

BA

r 1(t)

OOO

Alternative hypothesis: B has intensity r1(t) = r0(t)f (t − a(t)), f (x) ∝ exp(−βx) anda(t) is closest A event to t

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 24: Bayesian anomaly detection for networks and Big Data

B~f(

t~)

Alternative hypothesis: B has intensity r1(t) = r0(t)f (t − a(t)), f (x) ∝ exp(−βx) anda(t) is closest A event to t

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 25: Bayesian anomaly detection for networks and Big Data

BA

r 1(t)

Alternative hypothesis: B has intensity r1(t) = r0(t)f (t − a(t)), where f is a step-function with one step

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 26: Bayesian anomaly detection for networks and Big Data

B~f(

t~)

Alternative hypothesis: B has intensity r1(t) = r0(t)f (t − a(t)), where f is a step-function with one step

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 27: Bayesian anomaly detection for networks and Big Data

B~f(

t~)

Alternative hypothesis: B has intensity r1(t) = r0(t)f (t − a(t)), where f a decreasingfunction. (f is maximum likelihood)

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 28: Bayesian anomaly detection for networks and Big Data

BA

r 1(t)

Alternative hypothesis: B has intensity r1(t) = r0(t)f (t − a(t)), where f a decreasingfunction. (r1 is maximum likelihood)

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 29: Bayesian anomaly detection for networks and Big Data

Theorem

Under H0, f (0) is distributed as n/(UT ) where U is a uniform random variable over[0, 1] and T is the length of the observation period.

p-value for A causing B: n/{T f (0)}

In previous example: p ≈ 0.15 (not significant)

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 30: Bayesian anomaly detection for networks and Big Data

Daily emailing behaviour of an individual in the Enron dataset:

Time

Day

365

300

240

180

120

601

0 8 16 24

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 31: Bayesian anomaly detection for networks and Big Data

Bayesian intensity estimation for an individual of interest (change-point model for dailybinned data, and density estimation for within day behaviour)

0 5 10 15 20 25 30

Time

λ(t)

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 32: Bayesian anomaly detection for networks and Big Data

‘Information flow’ through JD in the Enron data. Over 2001, 12 individuals contactand are contacted back by JD. Posterior predictive p-values are calculated for eachi → JD 6 JD→ k , i , k = 1, . . . , 12. Full black p < 0.0001, half-black meansp ≤ 0.05, white means not significant.

1 2 3 4 5 6 7 8 9 10 11 12

12

34

56

78

910

1112

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 33: Bayesian anomaly detection for networks and Big Data

7→ j 6 j → 7: only one email on each edge, they are sent about one month apart,and appear to be unrelated judging by their subject-lines. The p-value iscomputed to be about .2

10→ j 6 j → 10: 14 emails from 10 to i and 9 from i to 10, the most coincidentalemail times falling in July, about 3 1

2 hours from each other. The p-valueis 0.07. If time is not transformed the raw p-value is 0.0035, suggesting asignificant interaction. However, upon inspecting the subject-lines of10→ i and i → 10 it is not in fact obvious that there is consistentreciprocation. For example, the subject-lines of the two most coincidentalemails are “FW: Enron Complaint” and “Dunn hearing link?”, which arenot obviously related.

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 34: Bayesian anomaly detection for networks and Big Data

Based on the subject-lines of the most coincidental email times, the three black circlesare correct detections.

6→ i i → 5 The p-value is around 2 · 10−5. The most coincidental email times aretwo hours and twenty minutes apart, and have the same subject-line,“Re: FW: SoCalGas Capacity”.

9→ i i → 3 The p-value is around 6 · 10−8. The most coincidental email times are50 minutes apart, have the subject-lines “California Update–LegislativePush Underway” and “Re: California Update–Legislative PushUnderway”.

12→ i i → 11 The p-value is around 2 · 10−5. The most coincidental email timesare 30 minutes apart and have the subject-lines “RE: CA Unbundling”and “CA Unbundling”

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 35: Bayesian anomaly detection for networks and Big Data

The notion that PPPs are ‘conservative’

General consensus is that P is conservative (too large), i.e. under-represents evidenceagainst the model, see e.g. discussion of Gelman et al. (1996).

In fact, conservative isn’t quite the right word. Instead, the p-values seem toconcentrate around 1/2. For example, let D ∼ normal(µ, 1), with a vague prior on µand let f (D, θ) = D. Then,

P = P{f (D∗, θ) ≥ f (D, θ) | D} ≈ 1/2.

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 36: Bayesian anomaly detection for networks and Big Data

Meng’s result

In the last pages of Meng (1994) we find:

In our notation: suppose θ and D are drawn from the prior and model respectively, andf (D, θ) is absolutely continuous. Then, (i) E(P) = E(U) = 1/2 and (ii) for any convexfunction h,

E{h(P)} ≤ E{h(U)},

where U is a uniform random variable on [0, 1].

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 37: Bayesian anomaly detection for networks and Big Data

A consequence

Meng (1994) next finds:

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 38: Bayesian anomaly detection for networks and Big Data

The convex order

Let X and Y be two random variables with probability measures µ and ν respectively.We say that µ (respectively, X ) is less variable than ν (respectively, Y ) in the convexorder, denoted µ ≤cx ν (or X ≤cx Y ) if, for any convex function h,

E{h(X )} ≤ E{h(Y )},

whenever the expectations exist (Shaked and Shanthikumar, 2007). A probabilitymeasure P, and a random variable distributed as P, is sub-uniform if P ≤cx U , whereU is a uniform distribution on [0, 1].

Posterior predictive p-values have a sub-uniform distribution.

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 39: Bayesian anomaly detection for networks and Big Data

Meng’s findings could seem quite coarse:

1 The 2α bound suggests that P could be liberal, whereas we had the impressionthey were conservative. Can the bound be improved?

2 The theorem seems to be describing a huge space of distributions. Can itsomehow be reduced?

Our results:

1 It is possible to construct a posterior predictive p-value with any sub-uniformdistribution.

2 Some sub-uniform distributions achieve the 2α bound.

3 Therefore, some posterior predictive p-values achieve the 2α bound. In fact, wecan construct simple examples where this happens.

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 40: Bayesian anomaly detection for networks and Big Data

Meng’s findings could seem quite coarse:

1 The 2α bound suggests that P could be liberal, whereas we had the impressionthey were conservative. Can the bound be improved?

2 The theorem seems to be describing a huge space of distributions. Can itsomehow be reduced?

Our results:

1 It is possible to construct a posterior predictive p-value with any sub-uniformdistribution.

2 Some sub-uniform distributions achieve the 2α bound.

3 Therefore, some posterior predictive p-values achieve the 2α bound. In fact, wecan construct simple examples where this happens.

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 41: Bayesian anomaly detection for networks and Big Data

Theorem (Strassen’s theorem)

For two probability measures µ and ν on the real line the following conditions areequivalent:

1 µ ≤cx ν;

2 there are random variables X and Y with marginal distributions µ and νrespectively such that E(Y | X ) = X .

Theorem due to Strassen (1965) (see also references therein), this version due toMuller and Ruschendorf (2001).

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 42: Bayesian anomaly detection for networks and Big Data

Theorem

Let µ and ν be two probability measures on the real line where ν is absolutelycontinuous. The following conditions are equivalent:

1 µ ≤cx ν;

2 there exist random variables X and Y with marginal distributions µ and νrespectively such that E (Y | X ) = X and the random variable Y | X is eithersingular, i.e. Y = X , or absolutely continuous with µ-probability one.

Theorem (Posterior predictive p-values and the convex order)

P is a sub-uniform probability measure if and only if there exist random variablesP,D, θ and an absolutely continuous discrepancy f (D, θ) such that

P = P{f (D∗, θ) ≥ f (D, θ) | D},

where P has measure P, D∗ is a replicate of D conditional on θ and P(· | D) is thejoint posterior distribution of (θ,D∗) given D.

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 43: Bayesian anomaly detection for networks and Big Data

Theorem

Let µ and ν be two probability measures on the real line where ν is absolutelycontinuous. The following conditions are equivalent:

1 µ ≤cx ν;

2 there exist random variables X and Y with marginal distributions µ and νrespectively such that E (Y | X ) = X and the random variable Y | X is eithersingular, i.e. Y = X , or absolutely continuous with µ-probability one.

Theorem (Posterior predictive p-values and the convex order)

P is a sub-uniform probability measure if and only if there exist random variablesP,D, θ and an absolutely continuous discrepancy f (D, θ) such that

P = P{f (D∗, θ) ≥ f (D, θ) | D},

where P has measure P, D∗ is a replicate of D conditional on θ and P(· | D) is thejoint posterior distribution of (θ,D∗) given D.

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 44: Bayesian anomaly detection for networks and Big Data

Exploring the set of sub-uniform distributions

The integrated distribution function of a random variable X with distribution functionFX is,

ψX (x) =

∫ x

−∞FX (t)dt.

We have (Muller and Ruschendorf, 2001)

1 ψX is non-decreasing and convex;

2 Its right derivative ψ+X (x) exists and 0 ≤ ψ+

X (x) ≤ 1;

3 limx→−∞ ψX (x) = 0 and limx→∞{x − ψX (x)} = E(X ).

Furthermore, for any function ψ satisfying these properties, there is a random variableX such that ψ is the integrated distribution function of X . The right derivative of ψ isthe distribution function of X , FX (x) = ψ+(x).

Let Y be another random variable with integrated distribution function ψY . ThenX ≤cx Y if and only if ψX ≤ ψY and limx→∞{ψY (x)− ψX (x)} = 0.

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 45: Bayesian anomaly detection for networks and Big Data

Exploring the set of sub-uniform distributions

The integrated distribution function of a random variable X with distribution functionFX is,

ψX (x) =

∫ x

−∞FX (t)dt.

We have (Muller and Ruschendorf, 2001)

1 ψX is non-decreasing and convex;

2 Its right derivative ψ+X (x) exists and 0 ≤ ψ+

X (x) ≤ 1;

3 limx→−∞ ψX (x) = 0 and limx→∞{x − ψX (x)} = E(X ).

Furthermore, for any function ψ satisfying these properties, there is a random variableX such that ψ is the integrated distribution function of X . The right derivative of ψ isthe distribution function of X , FX (x) = ψ+(x).

Let Y be another random variable with integrated distribution function ψY . ThenX ≤cx Y if and only if ψX ≤ ψY and limx→∞{ψY (x)− ψX (x)} = 0.

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 46: Bayesian anomaly detection for networks and Big Data

Exploring the set of sub-uniform distributions

The integrated distribution function of a random variable X with distribution functionFX is,

ψX (x) =

∫ x

−∞FX (t)dt.

We have (Muller and Ruschendorf, 2001)

1 ψX is non-decreasing and convex;

2 Its right derivative ψ+X (x) exists and 0 ≤ ψ+

X (x) ≤ 1;

3 limx→−∞ ψX (x) = 0 and limx→∞{x − ψX (x)} = E(X ).

Furthermore, for any function ψ satisfying these properties, there is a random variableX such that ψ is the integrated distribution function of X . The right derivative of ψ isthe distribution function of X , FX (x) = ψ+(x).

Let Y be another random variable with integrated distribution function ψY . ThenX ≤cx Y if and only if ψX ≤ ψY and limx→∞{ψY (x)− ψX (x)} = 0.

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 47: Bayesian anomaly detection for networks and Big Data

Exploring the set of sub-uniform distributions

The integrated distribution function of a random variable X with distribution functionFX is,

ψX (x) =

∫ x

−∞FX (t)dt.

We have (Muller and Ruschendorf, 2001)

1 ψX is non-decreasing and convex;

2 Its right derivative ψ+X (x) exists and 0 ≤ ψ+

X (x) ≤ 1;

3 limx→−∞ ψX (x) = 0 and limx→∞{x − ψX (x)} = E(X ).

Furthermore, for any function ψ satisfying these properties, there is a random variableX such that ψ is the integrated distribution function of X . The right derivative of ψ isthe distribution function of X , FX (x) = ψ+(x).

Let Y be another random variable with integrated distribution function ψY . ThenX ≤cx Y if and only if ψX ≤ ψY and limx→∞{ψY (x)− ψX (x)} = 0.

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 48: Bayesian anomaly detection for networks and Big Data

0.0

0.1

0.2

0.3

0.4

0.5

t

ψ(t)

0 0.3 0.5 1

0.0

0.5

1.0

1.5

t

dens

ity

0 0.3 0.5 1

Figure : Uniform distribution

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 49: Bayesian anomaly detection for networks and Big Data

0.0

0.1

0.2

0.3

0.4

0.5

t

ψ(t)

0 0.3 0.5 1

0.0

0.5

1.0

1.5

t

dens

ity

0 0.3 0.5 1

Figure : Some sub-uniform distributions. Blue: uniform, green: Beta(2,2)

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 50: Bayesian anomaly detection for networks and Big Data

0.0

0.1

0.2

0.3

0.4

0.5

t

ψ(t)

0 0.3 0.5 1

0.0

0.5

1.0

1.5

t

dens

ity

0 0.3 0.5 1

0.6

Figure : Some sub-uniform distributions. Blue: uniform, green: Beta(2,2), red: a distributionwith maximal mass at .3

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 51: Bayesian anomaly detection for networks and Big Data

X(1)

1 − 2α

θ = 0

θ = 1

X(0)

a) Model 1

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

x

f(x,θ

)

f(x,1)f(x,0)

b) Model 1 discrepancy

0.0 0.2 0.4 0.6 0.8 1.0

0.0

1.0

2.0

3.0

x

p(x|

θ) p(x|θ = 0) p(x|θ = 1)

c) Model 2

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

x

f(x,θ

)

f(x,0) f(x,1)

d) Model 2 discrepancy

Figure : Some constructive examples

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 52: Bayesian anomaly detection for networks and Big Data

Application: combining p-values I

Suppose P1, . . . ,Pm are independent posterior predictive p-values. How can I combinethem into one piece of evidence?

With ordinary p-values (i.e. independent uniform random variables):

min(Pi ) ∼ Beta(1,m)

−2∑

log(Pi ) ∼ χ22m (Fisher’s method)

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 53: Bayesian anomaly detection for networks and Big Data

Combining p-values II

Lemma

Let P1, . . . ,Pm and U1, . . . ,Um each be sequences of independent sub-uniform anduniform random variables on [0, 1], respectively. For x ∈ [0, 1], let

q = 1− (1− x)m = P{min(Ui ) ≤ x}.

Then

P{min(Pi ) ≤ x} ≤ 1− (1− 2x)m,

= 1− {2(1− q)1/m − 1}m,

which is no larger than 2q and tends to 2q − q2 as m→∞. Furthermore, this boundis achievable if the Pi are independent and identically distributed.

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 54: Bayesian anomaly detection for networks and Big Data

Combining p-values III

Lemma (Fisher’s method is asymptotically conservative)

Let P1, . . . ,Pm and U1, . . . ,Um each be sequences of independent and identicallydistributed sub-uniform and uniform random variables on [0, 1] respectively. Forα ∈ (0, 1], let tα,m be the critical value defined by

P

(−2

m∑i=1

log(Ui ) ≥ tα,m

)= α.

Then there exists n ∈ N such that

P

(−2

m∑i=1

log(Pi ) ≥ tα,m

)≤ α,

for any m ≥ n.

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 55: Bayesian anomaly detection for networks and Big Data

Combining p-values IV

Lemma (Sums of posterior predictive p-values)

Let P1, . . . ,Pm denote m independent sub-uniform random variables with meanP = m−1

∑mi=1 Pi . Then, for 0 ≤ t ≤ 1/2,

P(1/2− P ≥ t

)≤ exp(−6mt2).

Note Hoeffing’s inequality (Hoeffding, 1963)

P(1/2− P ≥ t

)≤ exp(−2mt2).

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 56: Bayesian anomaly detection for networks and Big Data

Some final points

1 Other attempts at combining Bayesian inference with hypothesis tests completelybreak down when the posterior contracts.

2 We can also characterize estimated posterior predictive p-values from MCMCsamples, e.g. of two schemes proposed by Gelman et al. (1996), only one producesa sub-uniform estimate, and other types of p-values (e.g. discrete p-values).

3 The convex order seems to have a deeper relevance to statistical inference andMonte Carlo, see e.g. Andrieu and Vihola (2014), Goldstein et al. (2011)

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data

Page 57: Bayesian anomaly detection for networks and Big Data

Anderson, R., Barton, C., Bohme, R., Clayton, R., Van Eeten, M. J., Levi, M., Moore, T., and Savage, S. (2013). Measuring the cost of cybercrime.In The economics of information security and privacy, pages 265–300. Springer.

Andrieu, C. and Vihola, M. (2014). Establishing some order amongst exact approximations of MCMCs. arXiv preprint arXiv:1404.6909.

Bayarri, M. and Berger, J. O. (2000). P values for composite null models. Journal of the American Statistical Association, 95(452):1127–1142.

Box, G. E. (1980). Sampling and Bayes’ inference in scientific modelling and robustness. Journal of the Royal Statistical Society. Series A (General),pages 383–430.

Cabinet Office and National security and intelligence (2010). A strong britain in an age of uncertainty:the national security strategy.

Center for Strategic and International Studies (2014). Net losses: Estimating the global cost of cybercrime economic impact of cybercrime II.

Gelman, A., Meng, X.-L., and Stern, H. (1996). Posterior predictive assessment of model fitness via realized discrepancies. Statistica Sinica,6(4):733–760.

Goldstein, L., Rinott, Y., Scarsini, M., et al. (2011). Stochastic comparisons of stratified sampling techniques for some monte carlo estimators.Bernoulli, 17(2):592–608.

Guttman, I. (1967). The use of the concept of a future observation in goodness-of-fit problems. Journal of the Royal Statistical Society. Series B(Methodological), pages 83–100.

Hjort, N. L., Dahl, F. A., and Steinbakk, G. H. (2006). Post-processing posterior predictive p values. Journal of the American Statistical Association,101(475):1157–1174.

Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American statistical association, 58(301):13–30.

Huelsenbeck, J. P., Ronquist, F., Nielsen, R., and Bollback, J. P. (2001). Bayesian inference of phylogeny and its impact on evolutionary biology.Science, 294(5550):2310–2314.

Meng, X.-L. (1994). Posterior predictive p-values. The Annals of Statistics, 22(3):1142–1160.

Muller, A. and Ruschendorf, L. (2001). On the optimal stopping values induced by general dependence structures. Journal of applied probability,38(3):672–684.

Neil, J., Hash, C., Brugh, A., Fisk, M., and Storlie, C. B. (2013). Scan statistics for the online detection of locally anomalous subgraphs.Technometrics, 55(4):403–414.

Rubin, D. B. (1984). Bayesianly justifiable and relevant frequency calculations for the applied statistician. The Annals of Statistics, 12(4):1151–1172.

Rubin-Delanchy, P. and Heard, N. A. (2014). A test for dependence between two point processes on the real line. arXiv preprint arXiv:1408.3845.

Rubin-Delanchy, P. and Lawson, D. J. (2014). Posterior predictive p-values and the convex order. arXiv preprint arXiv:1412.3442.

Shaked, M. and Shanthikumar, J. G. (2007). Stochastic orders. Springer.

Sinharay, S. and Stern, H. S. (2003). Posterior predictive model checking in hierarchical models. Journal of Statistical Planning and Inference,111(1):209–221.

Steinbakk, G. H. and Storvik, G. O. (2009). Posterior predictive p-values in Bayesian hierarchical models. Scandinavian Journal of Statistics,36(2):320–336.

Strassen, V. (1965). The existence of probability measures with given marginals. The Annals of Mathematical Statistics, 36(2):423–439.

The Boston consulting group (2012). The internet economy in the G-20.

Thornton, K. and Andolfatto, P. (2006). Approximate Bayesian inference reveals evidence for a recent, severe bottleneck in a netherlands populationof drosophila melanogaster. Genetics, 172(3):1607–1619.

P. Rubin-Delanchy Bayesian anomaly detection for networks and Big Data