DLP Systems: Models, Architecture and Algorithms

47
Copyright 2011 Trend Micro Inc. Classification 8/2/2013 1 DLP Systems: Models, Architecture and Algorithms Liwei Ren, Ph.D, Sr. Architect Data Security Research, Trend MicroMay, 2013, UCSC, Santa Cruz, CA

Transcript of DLP Systems: Models, Architecture and Algorithms

Page 1: DLP Systems: Models, Architecture and Algorithms

Copyright 2011 Trend Micro Inc.Classification 8/2/2013 1

DLP Systems: Models, Architecture and Algorithms

Liwei Ren, Ph.D, Sr. Architect

Data Security Research, Trend Micro™

May, 2013, UCSC, Santa Cruz, CA

Page 2: DLP Systems: Models, Architecture and Algorithms

Copyright 2011 Trend Micro Inc.

Backgrounds:

• Liwei Ren, Data Security Research, Trend Micro™– Research interests:

• DLP, differential compression, data de-duplication, file transfer protocols, database security, and practical algorithms.

– Education:

• MS/BS in mathematics, Tsinghua University, Beijing

• Ph.D in mathematics, MS in information science, University of Pittsburgh

– Relevant works for this talk:

• Provilla, Inc : a startup focusing on endpoint based DLP products and solutions. It was co-founded by Liwei and acquired by Trend Micro a few years ago.

• Patents --- Liwei holds 10+ patents for DLP, mostly, for DLP content inspection techniques.

• Trend Micro™

– Global security software company with headquarter in Tokyo, and R&D centers in Nanjing, Taipei and Silicon Valley.

– One of top 3 anti-malware vendors

– Pioneer in cloud security

– DLP vendor via Provilla™ acquisition2

Page 3: DLP Systems: Models, Architecture and Algorithms

Copyright 2011 Trend Micro Inc.

Agenda

• What is Data Loss Prevention (DLP) ?

• Concepts, Models, Architecture

• Content Inspection Problems

• Practical Algorithms for DLP

• Summary

• References

• Q&A

Classification 8/2/2013 3

Page 4: DLP Systems: Models, Architecture and Algorithms

Copyright 2011 Trend Micro Inc.

What Is Data Loss Prevention?

• What is Data Loss Prevention?– Data loss prevention (aka, DLP) is a data security technology that detects

data breach incidents in timely manner and prevents them by monitoring data in-use (endpoints), in-motion (network traffic), and at-rest (data storage) in an organization’s network.

– A.k.a. ,Data Leak Prevention (DLP),Information Leak Prevention (ILP) or Information Leak Detection and Prevention (ILDP).

Classification 8/2/2013 4

Page 5: DLP Systems: Models, Architecture and Algorithms

Copyright 2011 Trend Micro Inc.

What Is Data Loss Prevention?

• A Few Elements of a DLP system:– WHAT data to protect?

– WHO leaks data?

– HOW the data is leaked?

– WHERE to protect data?

– WHAT actions to take?

Classification 8/2/2013 5

Page 6: DLP Systems: Models, Architecture and Algorithms

Copyright 2011 Trend Micro Inc.

Concepts, Models and Architecture

• WHAT data to protect?

Classification 8/2/2013 6

• WHO causes data leaks?

External Hackers

Page 7: DLP Systems: Models, Architecture and Algorithms

Copyright 2011 Trend Micro Inc.

Concepts, Models and Architecture

Three Data States:

Classification 8/2/2013 7

Page 8: DLP Systems: Models, Architecture and Algorithms

Copyright 2011 Trend Micro Inc.

Concepts, Models and Architecture

• Data-in-use:

Classification 8/2/2013 8

• Data-in-motion:

Page 9: DLP Systems: Models, Architecture and Algorithms

Copyright 2011 Trend Micro Inc.

Concepts, Models and Architecture

• Data-at-rest at risk:

Classification 8/2/2013 9

Page 10: DLP Systems: Models, Architecture and Algorithms

Copyright 2011 Trend Micro Inc.

Concepts, Models and Architecture

• DLP for data-in-use and data-in-motion:

Classification 8/2/2013 10

• A conceptual view!

Page 11: DLP Systems: Models, Architecture and Algorithms

Copyright 2011 Trend Micro Inc.

Concepts, Models and Architecture

• DLP for data-in-use and data-in-motion:

Classification 8/2/2013 11

• A technical view!

Page 12: DLP Systems: Models, Architecture and Algorithms

Copyright 2011 Trend Micro Inc.

Concepts, Models and Architecture

• DLP Model for data-in-use and data-in-motion:– If DATA flows from SOURCE to DESTINATION via CHANNEL, the

system takes ACTIONs

Classification 8/2/2013 12

– DATA specifies what confidential data is

– SOURCE can be an user, an endpoint, an email address, or a group of them

– DESTINATION can be an endpoint, an email address, or a group of them, or simply the external world

– CHANNEL indicates the data leak channel such as USB, email, network protocols and etc

– ACTION is the action that needs to be taken by the DLP system when an incident occurs

Page 13: DLP Systems: Models, Architecture and Algorithms

Copyright 2011 Trend Micro Inc.

Concepts, Models and Architecture

• DLP for data-at-rest:

Classification 8/2/2013 13

Page 14: DLP Systems: Models, Architecture and Algorithms

Copyright 2011 Trend Micro Inc.

Concepts, Models and Architecture

• DLP Model for data-at-rest:– If DATA resides at SOURCE , the system takes ACTIONs

Classification 8/2/2013 14

– DATA specifies what the sensitive data (which has potential for leakage) is

– SOURCE can be an endpoint, a storage server or a group of them

– ACTION is the action that needs to be taken by the DLP system when confidential data is identified at rest.

Page 15: DLP Systems: Models, Architecture and Algorithms

Copyright 2011 Trend Micro Inc.

Concepts, Models and Architecture

• Typical DLP systems:– DLP Management Console

– DLP Endpoint Agent

– DLP Network Gateway

– Data Discovery Agent (or Appliance)

Classification 8/2/2013 15

Page 16: DLP Systems: Models, Architecture and Algorithms

Copyright 2011 Trend Micro Inc.

Concepts, Models and Architecture

• Typical DLP system architecture:

Classification 8/2/2013 16

Page 17: DLP Systems: Models, Architecture and Algorithms

Copyright 2011 Trend Micro Inc.

Agenda

• What is Data Loss Prevention (DLP) ?

• Concepts, Models, Architecture

•Content Inspection Problems• Practical Algorithms for DLP

• Summary

• References

• Q&A

Classification 8/2/2013 17

Page 18: DLP Systems: Models, Architecture and Algorithms

Copyright 2011 Trend Micro Inc.

Content Inspection Problems

• Two fundamental problems for a DLP system:

Classification 8/2/2013 18

• It is a pair of problems that always come together:

• One determines data sensitivity based on what has been defined.

Page 19: DLP Systems: Models, Architecture and Algorithms

Copyright 2011 Trend Micro Inc.

Content Inspection Problems

• Four typical approaches for <defining, determining> sensitive data in a DLP system:

Classification 8/2/2013 19

1. Document fingerprinting

2. Database record fingerprinting

3. Multiple Keyword matching

4. Regular expression matching

Page 20: DLP Systems: Models, Architecture and Algorithms

Copyright 2011 Trend Micro Inc.

Content Inspection Problems

• Document fingerprinting:• A technique for identifying modified versions of known documents

• Problem Definition (Model 1):– Let S= { T1, T2, …,Tn} be a set of known texts

– Given a query text T, one needs to determine if there exist at least a document t ϵ S such that T and t share common textual content significantly, where multiple returned documents are ranked by how much common content are shared.

Classification 8/2/2013 20

Page 21: DLP Systems: Models, Architecture and Algorithms

Copyright 2011 Trend Micro Inc.

Content Inspection Problems

• An alternative model (Model 2):– Let S= { T1, T2, …,Tn} be a set of known texts

– Given a query text T and X%, one needs to determine if there exist at least a text t ϵ S such that SIM(T,t)≥ X%, where SIM() is a function to measure the similarity between two texts.

• Multiple documents are ranked by the percentiles .

Classification 8/2/2013 21

Page 22: DLP Systems: Models, Architecture and Algorithms

Copyright 2011 Trend Micro Inc.

Content Inspection Problems

• Database record fingerprinting:– A technique for identifying sensitive data records within a text.

– A.k.a., Exact Match in DLP field

• Use Case: – We have several personal data records of <SSN, Phone#, address>

that are included in a text, we want to extract all records from the text to determine the sensitivity of the file.

Classification 8/2/2013 22

Page 23: DLP Systems: Models, Architecture and Algorithms

Copyright 2011 Trend Micro Inc.

Content Inspection Problems

Hhhhhdds ghghg 178-76-6754 ggkjkfddfdkkkk879-45-6785kjkjjk 43 Atword Street, Pittsburgh, PA 15260 kllkll 412-876-6789 kjkjjkj 76 Parkview Ave, Sunnyvale, CA 94086 hhsjskkdhjhjhj 408-780-8876hjhjkjkjjj 159-87-8965 hjhjhjhjmnnmnxcbls w243 54y45 wefddewdddw3n nn xxxxxxxxxx

23

SSN Phone # Address

178-76-6754 412-876-6789 43 Atword Street, Pittsburgh, PA 15260

159-87-8965 408-780-8876 76 Parkview Ave, Sunnyvale, CA 94086

…… …… ……

An example: a text contains a few data records:

Page 24: DLP Systems: Models, Architecture and Algorithms

Copyright 2011 Trend Micro Inc.

Content Inspection Problems

• Problem Definition (Model 3) :– Let S= { R1, R2, …,Rn} be a set of known data records from a same table.

– Given any text T, one needs to extract all records or sub-records from T while the record cells may appear randomly within the text.

Classification 8/2/2013 24

Page 25: DLP Systems: Models, Architecture and Algorithms

Copyright 2011 Trend Micro Inc.

Content Inspection Problems

• Problem Definition for Keyword Match:– Let S= {K1,K2,…,Kn} be a dictionary of keywords.

– Given any text T, one needs to identify all keyword occurrences in T.

• Problem Definition for RegEx Match:– Let S= {P1,P2,…,Pm} be a set of RegEx patterns.

– Given any text T, one needs to identify all pattern instances from T.

Classification 8/2/2013 25

Easy problems?– Not at all! For large n and m, one will

have performance issue.– That’s the problem of scalability.– Scalable algorithms must be provided.

Page 26: DLP Systems: Models, Architecture and Algorithms

Copyright 2011 Trend Micro Inc.

Agenda

• What is Data Loss Prevention (DLP) ?

• Concepts, Models, Architecture

• Content Inspection Problems

• Practical Algorithms for DLP• Summary

• References

• Q&A

Classification 8/2/2013 26

Page 27: DLP Systems: Models, Architecture and Algorithms

Copyright 2011 Trend Micro Inc.

Practical Algorithms for DLP

• We investigate some algorithms for 2 problems:

Classification 8/2/2013 27

1. Document fingerprinting

2. Multiple keyword matching

Assumption: a text T is a sequence of UTF-8 characters without loss of generality.

Page 28: DLP Systems: Models, Architecture and Algorithms

Copyright 2011 Trend Micro Inc.

Document Fingerprinting Algorithms

• Lets investigate algorithmic solutions for Model 2 ( document fingerprinting).

• Analysis for Solution:1. We need to construct the function SIM(T,t). For example:

– SIM(T,t) = |T ∩t| /Min(|T|,|t|) based on common sub-strings.

2. An Obvious Challenge:– If n is large, say, in scale of millions, we can not compute SIM(T, Tk) one by one

to find the t that satisfies SIM(T,t) ≥ X%

– We need to figure out an approach that can identify a possible candidate quickly.

3. General search engines like Google use keywords to index/identify the documents. Should we? There are too many keywords and language dependency. The answer is NO.

4. So, which features can we use for indexing/searching?– One answer is documents fingerprints.

28

Page 29: DLP Systems: Models, Architecture and Algorithms

Copyright 2011 Trend Micro Inc.

Document Fingerprinting Algorithms• What are document fingerprints?

– A fingerprint is a hash value

– One text has multiple fingerprints

– Unique to the text: two irrelevant texts do not share any fingerprints.

– Robustness: it can survive moderate textual changes.

29

Page 30: DLP Systems: Models, Architecture and Algorithms

Copyright 2011 Trend Micro Inc.

Document Fingerprinting Algorithms• How to extract fingerprints from a text?

– Anchoring point:• A point in the text that can endure the moderate changes.

• Its neighborhood (of fixed size) is unique to the text

– We select a few anchoring points to fingerprints:

• To generate hash values around their neighborhoods.

• These hash values are the fingerprints

30

•Samples of anchoring points and their neighborhood:Thereareabundantliteraturesonhowtogeneratedifferencebetween

twofilesBasicallytherearetwofundamentalapproachestoattackthisgenericp

roblemLCSmodelwhereLCSstandsforlargestcommonsubsequenceCalculate

thelargestcommonsubsequenceoftwostringFindasequenceofeditoperationsbasedontheLCSsothatonecanapplytheeditoperationstothereferencefiletoconstructthetargetfileBlock movemodel

Page 31: DLP Systems: Models, Architecture and Algorithms

Copyright 2011 Trend Micro Inc.

Document Fingerprinting Algorithms

• Conclusion : we have a solution that consists of two algorithms and one search technology:

– An algorithm for computing SIM(T,t)

– An algorithm for fingerprint generator FPGEN(T)

– Fingerprint search engine

31

Page 32: DLP Systems: Models, Architecture and Algorithms

Copyright 2011 Trend Micro Inc.

Document Fingerprinting Algorithms

• Fingerprint generation algorithm 1:– INPUT: String T

• Select top M candidate characters based on a score function– Character frequency n

– Character positions in the text T: P(1), …, P(n)

– SCORE(c) = SQRT(D(n) * [ P(n)-P(1)] / SQRT(D)

» Where D= [(P(2)-P(1)]2+ [(P(3)-P(2)] 2 + … + *(P(n)-P(n-1)] 2

• For each selected character c– Create a hash around the neighborhood at each occurrence

– Sort these hashes

– Select the top N hashes

– These N hashes are fingerprints

– OUTPUT: M*N fingerprints

32

Note 2: Two keys of this algorithm are (a) the score function; (b)sorting the hashes.

Note 1: M and N are pre-defined.

Page 33: DLP Systems: Models, Architecture and Algorithms

Copyright 2011 Trend Micro Inc.

33

Document Fingerprinting Algorithms

• About the score function:– Why SQRT(n) ?

• Measurement of frequency for the given character

• The larger the value, more stable the character is

– Why [ P(n)-P(1)] / SQRT(D) ?• Measurement of distribution for the given character

• The larger the value, more even distributed the character, and more stable the character;

• WHY? Think about a constrained optimization problem:

– min f(X1,X2 , … Xm) = X12+ X2

2 + … Xm2

» subject to

» X1+ X2 + … Xm = c AND

» Xk ≥ 0, k=1,2,…,m

Note: The solution of the optimization problem is Xk

= c/m, k=1,2,…,m

Page 34: DLP Systems: Models, Architecture and Algorithms

Copyright 2011 Trend Micro Inc.

Document Fingerprinting Algorithms

There are alternative algorithms to construct a fingerprint generation function.

34

We recently constructed algorithm 2:– A novel approach based on rolling hash function

H(x);

– It selects anchoring points with first filter H(x) = 0 mod p;

– It further selects anchoring points with a heuristic second filter.

– It also employs the asymmetric architecture of fingerprint match;

Note 1: The anchoring points have better distribution across text.

Note 2: Two keys of this algorithm are (a) Rolling hash; (b)Asymmetric use of two filters.

Page 35: DLP Systems: Models, Architecture and Algorithms

Copyright 2011 Trend Micro Inc.

Multiple Keyword Match

Essentially, it is a multi-pattern string match problems.

35

Problem Definition:– Let S={P1,P2,…,Pk} be multiple short strings as

patterns;

– Given any string T, one needs to identify all pattern occurrences in T.

Page 36: DLP Systems: Models, Architecture and Algorithms

Copyright 2011 Trend Micro Inc.

Multiple Keyword Match

Existing string match algorithms:

36

Algorithm Type

Naïve string match One pattern

Knuth–Morris–Pratt One pattern

Boyer-Moore One pattern

Boyer-Moore-Horspool One pattern

Boyer-Moore-Horspool-Raita One pattern

Rabin-Karp Multi-patterns

Aho-Corasick Multi-patterns

Sun-Manber Multi-patterns

Page 37: DLP Systems: Models, Architecture and Algorithms

Copyright 2011 Trend Micro Inc.

Multiple Keyword Match

37

Key elements of the algorithm:– Character comparison can be made from right to left, starting from the end of

the pattern.

– Ending Character Heuristics • Consider that we are pointing to character R[i] and try to compare it with the

ending character of P

• Bad character– If R[i- ≠P,m- and R,i- is not included in P’s alphabet, then it is safe for the pointer to skip

m positions arriving at R[i+m].

– If R[i- ≠P,m-, R,i- is included in P’s alphabet, and R,i-’s last occurrence within P has distance q from the end of P, then it is safe for the pointer to skip q positions arriving at R[i+q].

• Good character– If R[i] =P[m] , P is not matched , and R[i] has no other occurrences within P, then it is safe

for the pointer to skip m positions arriving at R[i+m].

– If R[i] =P[m] , P is not matched and R[i-’s last occurrence other than P,m- has distance q from the end of P, then it is safe for the pointer to skip q positions arriving at R[i+q].

• Matched instance– If R[i] =P[m] and P is matched, then save the instance.

– It is almost safe to move the pointer to skip m positions arriving at R[i+m].

Boyer-Moore-Horspool (BMH) Algorithm

Page 38: DLP Systems: Models, Architecture and Algorithms

Copyright 2011 Trend Micro Inc.

Multiple Keyword Match

38

• Rabin-Karp Algorithm – Hash based string match

• Rabin-Karp hash function H(S):– For a given string S = x1x2…xm with length m, a hash function can be

constructed as:

• H(S) = x1bm-1 + x2 bm-2 + … + xm-1 b + xm mod q

• Where b is a base number, usually we take b=256 , and q is a big prime number.

– For pattern P, H(P) = p1bm-1 + p2 bm-2 + … + pm-1 b + pm mod q

– If we denote Rk = R[k,k+m-1], we can derive H(Rk+1) from H(Rk) with relatively small cost

– H(Rk+1) = [ H(Rk) – rkbm-1 ] b + rk+m mod q

– This is an iterative formula which is a common practice for algorithm optimization

Page 39: DLP Systems: Models, Architecture and Algorithms

Copyright 2011 Trend Micro Inc.

Multiple Keyword Match

39

• Rabin-Karp hash function:– The quantity bm-1 mod q can be pre-calculated to save CPU time.

– For each iteration, we only need 5 arithmetic operations.

• It can be further reduced to 4

• One considers the number rkbm-1

– Horner’s rule

• H(S) = (…( (x1b + x2)b + x3) b + … + x m-1 ) b + xm mod q

• Yet another formula for performance tuning

Page 40: DLP Systems: Models, Architecture and Algorithms

Copyright 2011 Trend Micro Inc.

Multiple Keyword Match

40

• Rabin-Karp algorithm for multiple patterns:

– Input: • String R, multiple patterns {P1,…,Pk},

• n= Length(R), mj =Length(Pj), q, b,

– Procedure:• Step 0:

– Let m = Min(mk)

– Calculate the number bm-1 mod q

– Calculate all H(Pj,1,…,m-) (j=1,..,k) and H(R1) by Horner’s rule• Step 1: Let i=1

• Step 2:

If there exists j in *1,2,…,k+ such that

H(Pj,1,…,m-) = H(Ri) and Pj = R[i,…, mj +i-1],

it is a match and output the instance

• Step 3: i = i + 1

• Step 4: If i > n-m, stop

• Step 5: Calculate H(Ri+1) using the iterative formula.

• Step 6 Go to step 2

– Output: All matched instances

Page 41: DLP Systems: Models, Architecture and Algorithms

Copyright 2011 Trend Micro Inc.

Multiple Keyword Match

41

A practical hybrid method:– BMH or Rabin-Karp

– If k < Magic-number,

• Use BMH k times,

• Otherwise, use Rabin-Harp

– Magic-number=100 is my exercise in DLP products.

Rabin-Karp has its weakness :

• when Min({Length(Pi)| i =1,2,…,k +) is small, say, less than 4, we have trouble.

• We need to introduce efficient multiple pattern match for short patterns.

Page 42: DLP Systems: Models, Architecture and Algorithms

Copyright 2011 Trend Micro Inc.

Multiple Keyword Match

42

We have a complimentary solution to RK algorithm when handling multiple short patterns

– This is Reverse-trie matching algorithm.

A reverse-trie presents a set of keywords, especially, it is good for CJK languages in UTF-8 encoding :

c d

b

a

c

b a

a

root

The keyword set: {abc,abcd,acd}

Page 43: DLP Systems: Models, Architecture and Algorithms

Copyright 2011 Trend Micro Inc.

Agenda

• What is Data Loss Prevention (DLP) ?

• Concepts, Models, Architecture

• Content Inspection Problems

• Practical Algorithms for DLP

• Summary

• References

• Q&A

Classification 8/2/2013 43

Page 44: DLP Systems: Models, Architecture and Algorithms

Copyright 2011 Trend Micro Inc.

Summary

• What DLP is.

• DLP Security Model

• Architecture of a DLP System

• Four Content Inspection Problems

• Two Algorithms for DLP Content Inspection – Document Fingerprinting

– Multi-keyword matching

Classification 8/2/2013 44

Page 45: DLP Systems: Models, Architecture and Algorithms

Copyright 2011 Trend Micro Inc.

References

• Liwei Ren et al., Document fingerprinting with asymmetric selection of anchor points, US patent 8359472

• Liwei Ren et al., Two tiered architecture of named entity recognition engine, US patent 8321434.

• Yingqiang Lin el al., Scalable document signature search engine, US patent 8266150

• Liwei Ren et al., Fingerprint based entity extraction, US patent 7950062

• Liwei Ren et al., Document match engine using asymmetric signature generation, US patent 7860853

• Liwei Ren et al., Match engine for querying relevant documents, US patent 7747642

• Liwei Ren et al., Matching engine with signature generation, US patent 7516130

Classification 8/2/2013 45

Page 46: DLP Systems: Models, Architecture and Algorithms

Copyright 2011 Trend Micro Inc.

Q&A

Any questions?

Classification 8/2/2013 46

Page 47: DLP Systems: Models, Architecture and Algorithms

Copyright 2011 Trend Micro Inc.

Thank You!

Classification 8/2/2013 47

Innovation is not a part time job, and it is not even a full-time job. It’s a life style.