Realistic Synthetic Generation Allows Secure Development

Post on 28-Jul-2015

421 views 0 download

Tags:

Transcript of Realistic Synthetic Generation Allows Secure Development

© 2014 MapR Technologies 1© 2014 MapR Technologies

Realistic Synthetic Data Allows Secure Development

Ted Dunning

June 11, 2015

© 2014 MapR Technologies 2

Who am I?

Ted Dunning, Chief Applications Architect MapR Technologies

Email tdunning@mapr.com tdunning@apache.org

Twitter @Ted_Dunning

© 2014 MapR Technologies 3

Short Books by Ted Dunning & Ellen Friedman

• Published by O’Reilly in 2014 and 2015• For sale from Amazon or O’Reilly• Free e-books currently available courtesy of MapR

http://bit.ly/ebook-real-world-hadoop

http://bit.ly/mapr-tsdb-ebook

http://bit.ly/ebook-anomaly

http://bit.ly/recommendation-ebook

© 2014 MapR Technologies 4

© 2014 MapR Technologies 5

The basic idea

© 2014 MapR Technologies 6

Anomaly Detection and Fraud Analytics

• Financial customer wants to identify zero-day attacks

• And advanced persistent threats

• By sophisticated adversaries who don’t use known vectors

• Must keep logs and other data secret– But must also collaborate on detection algorithms

© 2014 MapR Technologies 7

Secure Development is Hard

© 2014 MapR Technologies 8

Secure Development is Hard

Outside collaborators are outside the security perimeter

They can’t see the data and they can’t tune new algorithms to fit reality

© 2014 MapR Technologies 9

How To Make Realistic Data

© 2014 MapR Technologies 10

Parametric Simulation

Parametric matching of failure signatures allows emulation of complex data properties

Matching on KPI’s and failure modes guarantees practical fidelity

© 2014 MapR Technologies 11

Do’s and Don’ts

• Do match the KPI’s and failure modes– Speed– Score distribution– False positive rates versus score

• Don’t try to match the actual data distribution precisely– Good enough is good enough and we want to imitate failures,

not create new life forms– Probably impossible to do precisely– Even if possible, it is vastly harder to match distributions

© 2014 MapR Technologies 12

Methods for Generating Numbers

• Well-known distributions– Uniform, normal, gamma, Poisson– Truncations

• Cumulations– Random walk v1

• Mixture distributions• Hyper-parameters

– Random walk v2

© 2014 MapR Technologies 13

Normal

data = data.frame(x=rnorm(10000), y=rnorm(10000))

© 2014 MapR Technologies 14

Mixture of Normals

© 2014 MapR Technologies 15

Random Walk

y = cumsum(rnorm(10000))

© 2014 MapR Technologies 16

Pick Mean from Multinomial

© 2014 MapR Technologies 17

Random Walk with Variable Standard Deviation

y = cumsum(rt(10000, df=0.9))

© 2014 MapR Technologies 18

Methods for Generating Symbols

• Symbols are really just integers with a dictionary• Well-known distributions

– Multinomial– Dirichlet processes– Rich-get-richer, Pittman-Yor

• Mixture distributions• Hyper-parameters• Lookup tables!!!

– Simple tables– Data table joins for correlated components

© 2014 MapR Technologies 19

Skewed Integers

207 3 203 0 198 7 196 4 195 12 193 10 189 2 187 1 185 13 179 6 178 9 177 5 177 25 174 21 173 8 173 14 170 18

[ {"name":"x", "class":"int", "skew":1}]

© 2014 MapR Technologies 20

Methods for Generating Behaviors

• Use structured data!– Generate user meta-data– Generate list of transactions

• Only flatten if necessary• See Apache Drill for post-processing

© 2014 MapR Technologies 21

Methods for Generating Databases

• Use integers (see previous) as foreign keys• Normalized form implies (approximate) independence of tables

© 2014 MapR Technologies 22

© 2014 MapR Technologies 23

Go get log-synth

https://github.com/tdunning/log-synth

© 2014 MapR Technologies 24

A worked example...

© 2014 MapR Technologies 25

Simulation Setup

© 2014 MapR Technologies 26

© 2014 MapR Technologies 27

© 2014 MapR Technologies 28

Questions?

© 2014 MapR Technologies 29

Thank You

@mapr maprtech

tdunning@mapr.comtdunning@apache.org

Ted Dunning, Chief Application Architect

MapRTechnologies

maprtech

mapr-technologies