Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf ·...

120

Transcript of Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf ·...

Page 1: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica
Page 2: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica
Page 3: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica
Page 4: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica
Page 5: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

INSTITUTO TECNOLOGICO Y DE ESTUDIOS SUPERIORES DE MONTERREY

A SOCIAL NETWORK BASED MODEL TO DETECTANOMALIES ON DNS SERVERS

by

ROBERTO ALONSO RODRIGUEZ

Submitted to the Department of Computer Sciencesin partial fulfillment of the requierements for the degree of

Doctor of Philosophy

Advisor: PROF. DR. RAUL MONROYCo-advisor: DR. LUIS ANGEL TREJO RODRIGUEZ

Dissertation committee: DR. JOSE TORRES JIMENEZDR. CARLOS MEX PERERA

DR. MIGUEL GONZALEZ MENDOZAPROF. DR. RAUL MONROYDR. LUIS ANGEL TREJO RODRIGUEZ

Atizapan de Zaragoza, Edo. de Mex., January 19th, 2015

Page 6: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

(This page intentionally left blank)

Page 7: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

Acknowledgement

If I had to compare doing a Ph.D. with any sport, I would say running a race. During the race, youneed to keep looking in front, save strength for the harder part, don’t forget to take time to rest andmost importantly keep the pace until you reach the finish line. Here I am at the finish line, hence,I would like to thank my coach, my advisor, Prof. Raul Monroy, whose guidance and training haveallowed me to succeed in this long journey. There are not enough words or space to express mygratitude towards him so I will restrain only to say: ”Thank you Raul for everything”. I would like tothank my second coach, my co-advisor, Dr. Luis Angel Trejo Rodrıguez for the numerous and fruitfultalks about my thesis work, my deepest admiration to you.

I want to thank to the members of the Network and Security Group at Tecnologico de MonterreyCampus Estado de Mexico, specially to my friends, the members of ”El Cubo”, Jorge Vazquez, JoseBenito Camina and Victor Ferman, with whom I had numerous discussions about my work. Also,I want to thank all the Ph.D. students of Tecnologico de Monterrey (CEM) which were part of thisjourney, specially I want to thank to my friends Alfredo Villagran and Mauricio Martınez for thenumerous coffees we had while we talked about our thesis. I would like to acknowledge Dr. MiguelGonzalez Mendoza, head of the Ph.D. program, for his valuable work at the university. I want to thankalso my external reviewers, Dr. Jose Torres Jimenez from CINVESTAV and Dr. Carlos Mex Pererafrom Tecnologico de Monterrey (MTY).

Lastly, but not less important I would like to thank my family, starting with the love of my lifeKarla Solis, for being the source of my happiness and cheering me up to keep my pace, thank youfor believing in me. I would also want to thank my brother and my sister, who contribute in my lifesignificantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica forshowing me how to achieve whatever I want in life, thank you for your support.

I want to thank CONACyT for the grant to conduct my studies towards a Ph.D. (Scholarship45904)

This thesis is dedicated to my dad and to my mom.

Page 8: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

Summary

The Domain Name System (DNS) is a critical service that provides translation to link requestsfrom Internet users, i.e., form URL’s to IP addresses and vice versa. Regardless of the efforts toprotect it, during the last decade the DNS service has been subject to several attacks; thus, a securitymechanism that guarantees its availability is still required. In this thesis, we have built an anomalybased Intrusion Detection System (IDS) following the hypothesis that DNS usage, i.e., the set of IPsrequesting domains translation together with the associated URL’s, gives rise to social structures thatchanges during a DNS attack. Through our experimentation we have found that the number of socialstructures grows exponentially so there is no algorithm to compute social structures in polynomialtime, and thus a method that approximates a solution, i.e. heuristic-based, is required. We support thisobservation by formally proving the NP-Completeness in the associated decision problem of socialstructure computation. We have proposed a heuristic to compute social structures, its properties wereevaluated by conducting several tests on which exact answers (complete and correct) were contrastedwith heuristic solutions. This set of tests have allowed us to identify meaningful characteristics suchthat we can determine which social structures can, and cannot, be computed in polynomial time;namely, we have found the Phase Transition. The cumulative knowledge of this thesis has allowed usto develop a classifier so as to identify changes on the social structure of the DNS server. Indeed, wehave detected such changes that indicates an anomaly, i.e. a Distributed Denial of Service (DDoS)attack, on a DNS server. We have thoroughly validated this classifier, the detection rate is encouraging.The outcomes of this thesis work will help, computer security researchers, to develop a mitigationmechanism of attacks over the DNS service. Furthermore, the anomaly detection on the DNS servicecould be useful to detect compromised computers that belong to a botnet. Finally, it motivates furtherstudies over the social structure approach that will help other areas.

Page 9: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

Contents

1 Introduction 111.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.2 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 The Domain Name System: Definition and Related Work 142.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.2 Name Systems: Domain Name System (DNS) . . . . . . . . . . . . . . . . . . . . . 15

2.2.1 Name Systems Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . 152.2.1.1 Name Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2.1.2 Name Registry . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.2.1.3 Name Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2.2 The DNS Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.2.2.1 Domains and Authoritative Zones . . . . . . . . . . . . . . . . . . 162.2.2.2 Resource Records . . . . . . . . . . . . . . . . . . . . . . . . . . 172.2.2.3 Zone Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2.2.4 Root Servers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2.2.5 DNS Resolution: Iterative and Recursive . . . . . . . . . . . . . . 18

2.3 Security Issues with DNS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.3.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4 Approaches for the Detection and Mitigation of DDoS Attacks on DNS servers . . . 202.4.1 Statistical Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.4.1.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.4.2 Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.4.2.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.4.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.4.3.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.4.4 DNS Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.4.4.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3 The Commonality Amongst the Objects upon which a Collection of Agents has Performedan Action 313.1 Social Groups: Computing Commonality Amongst Users, Actions, and Objects . . . 32

3.1.1 Agents Executing Actions over Objects for a Period of Time . . . . . . . . . 323.1.2 Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.1.3 Queried Objects and Queried by Individuals . . . . . . . . . . . . . . . . . . 333.1.4 “Real” World Examples of SGC . . . . . . . . . . . . . . . . . . . . . . . . 333.1.5 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2 SGC is NP-complete . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.2.1 Longest Common Subsequence problem . . . . . . . . . . . . . . . . . . . . 363.2.2 SGC NP-Completeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Page 10: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

3.3 2-SGC is NP-complete . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.3.1 2-SGC Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.3.2 Hitting Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.3.3 2-SGC NP-Completeness . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.4 The Gram Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.4.1 Characteristics from Q and the Gram Matrix of Q . . . . . . . . . . . . . . . 46

3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4 On the Experimental Tractability of the SGC Problem 484.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.2 The Phase Transition Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2.1 Phase Transition Study Procedure . . . . . . . . . . . . . . . . . . . . . . . 504.3 Step 1 - On the Identification of the Order Parameter . . . . . . . . . . . . . . . . . 50

4.3.1 Outline of Order Parameter Selection . . . . . . . . . . . . . . . . . . . . . 514.3.1.1 EASY and HARD classes . . . . . . . . . . . . . . . . . . . . . . 51

4.3.2 Characteristics under Study . . . . . . . . . . . . . . . . . . . . . . . . . . 514.3.3 Construction of the Classifier . . . . . . . . . . . . . . . . . . . . . . . . . 524.3.4 Classification Trees and Classification Rates . . . . . . . . . . . . . . . . . . 534.3.5 Evaluation of the Classifier Using ROC curves . . . . . . . . . . . . . . . . 54

4.3.5.1 Evaluation of the Classifier Using F-Measure curves . . . . . . . . 554.3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.4 Step 2 - Instances under Consideration . . . . . . . . . . . . . . . . . . . . . . . . . 574.5 Step 3 - Algorithm to Solve SGC and Measure of Computational Expense . . . . . . 58

4.5.1 Algorithm Applied into the SGC Instances . . . . . . . . . . . . . . . . . . 584.5.2 Measure of Computational Expense . . . . . . . . . . . . . . . . . . . . . . 60

4.6 Step 4 - The Phase Transition of the SGC Problem . . . . . . . . . . . . . . . . . . 614.6.1 Phase Transition of the SGC Decision Problem . . . . . . . . . . . . . . . . 61

4.6.1.1 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . 624.6.1.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624.6.1.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.6.2 Phase Transition on the Optimality Version of SGC . . . . . . . . . . . . . . 724.6.2.1 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . 724.6.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.6.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5 On the Detection of Anomalies Using Social Structure Characteristics 765.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 775.2 Characteristics under Consideration from DNS Traffic . . . . . . . . . . . . . . . . . 78

5.2.1 Hybrid Segmentation Method . . . . . . . . . . . . . . . . . . . . . . . . . 785.2.1.1 Characterising the Estimated Social Structure . . . . . . . . . . . 83

5.2.2 Features under Consideration for the Characteristic Vector . . . . . . . . . . 865.3 Characteristics from our DNS Sample . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.3.1 Identifying Populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 875.4 Selection of the One-class Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.4.1 The One-class Support Vector Machine (SVM) . . . . . . . . . . . . . . . . 89

Page 11: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

5.4.1.1 Representation Space . . . . . . . . . . . . . . . . . . . . . . . . 895.4.1.2 Type of SVM - ⌫-SVM Classifier . . . . . . . . . . . . . . . . . . 895.4.1.3 Parameter Values of the SVM . . . . . . . . . . . . . . . . . . . . 90

5.5 Construction of the Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 905.5.1 Outline of the Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.6 The Test Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 915.6.1 Classification on a Synthetic Attack . . . . . . . . . . . . . . . . . . . . . . 915.6.2 Classification on a Conducted Attack . . . . . . . . . . . . . . . . . . . . . 925.6.3 Classification on Abnormal Activity . . . . . . . . . . . . . . . . . . . . . . 925.6.4 Classification on Monthly Traffic . . . . . . . . . . . . . . . . . . . . . . . 93

5.7 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 935.7.1 Classification on a Synthetic Attack . . . . . . . . . . . . . . . . . . . . . . 935.7.2 Classification on a Conducted Attack . . . . . . . . . . . . . . . . . . . . . 975.7.3 Classification on Abnormal Activity . . . . . . . . . . . . . . . . . . . . . . 985.7.4 Classification on Monthly Traffic . . . . . . . . . . . . . . . . . . . . . . . 99

5.8 Analysis of the Number of Alarms . . . . . . . . . . . . . . . . . . . . . . . . . . . 1005.8.1 Number of Alarms Considering the Synthetic Attack . . . . . . . . . . . . . 1015.8.2 Number of Alarms Considering the Conducted Attack . . . . . . . . . . . . 1015.8.3 Number of Alarms Considering the Abnormal Activity . . . . . . . . . . . . 1015.8.4 Number of Alarms Considering Monthly Activity . . . . . . . . . . . . . . . 1015.8.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.9 Comparative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1025.9.1 Related Work under Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 1035.9.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.9.2.1 Time to Detect (TD) . . . . . . . . . . . . . . . . . . . . . . . . . 1045.9.2.2 Test Set (TS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1045.9.2.3 Representative Sample of Ordinary DNS traffic (RS) . . . . . . . . 1045.9.2.4 Precision and Recall (PR) . . . . . . . . . . . . . . . . . . . . . . 1045.9.2.5 Alarm Filtering (AF) . . . . . . . . . . . . . . . . . . . . . . . . 105

5.10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

6 Conclusions and Indications for Future Work 107

Page 12: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

Figure Index

2.1 DNS Hierarchy illustration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1 Incidence matrix Q, and associated connectivity graph; as is standard in the literature(resource usage), agents are denoted with circles, and objects with squares. . . . . . 40

3.2 A partially social cell. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.3 A social cell. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.4 New agent chained to the component. . . . . . . . . . . . . . . . . . . . . . . . . . 413.5 Boxed component, G

i

, comprising agent aj

. . . . . . . . . . . . . . . . . . . . . . . 423.6 The 2-SGC instance that results from applying our reduction to the HS instance S =

{s1

, s

2

, s

3

, s

4

}, C = {C1

, C

2

, C

3

}, with C

1

= {s1

, s

2

, s

3

}, C2

= {s2

, s

4

}, and C

3

={s

1

, s

2

, s

4

}. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.7 2-SGC made out of social components G

1

and G

4

, taken from the graph shown inFig ??. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.8 2-SGC made out of social components G

1

and G

3

, taken from the graph shown inFig. ??. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.9 Gram matrix Q

w ⇥Q

wT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.10 Gram matrix Q

wT ⇥Q

w . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.1 Resulting rules after applying the C4.5 classifier into the learning set with windowsize 250. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.2 Proportion of HARD windows classified correctly with the rule. . . . . . . . . . . . 544.3 Resulting ROC curves reported from the classification tree. Notice that both FPR and

FNR range over the interval [0,1], and that, to better appreciate our results, we haveplotted both axis using a 10�x scale. . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.4 F

measure. A major inflection point is in � = 1. . . . . . . . . . . . . . . . . . . . 564.5 Mean cost of finding a solution with t = 2 and z = 2 for window sizes 50 to 200 in

steps of 25. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.6 Mean cost of finding a solution with t = 2 and z = 3 for window sizes 50 to 200 in

steps of 25. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644.7 Mean cost of finding a solution with t = 2 and z = 4 for window sizes 50 to 200 in

steps of 25. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654.8 Mean cost of finding a solution with t = 2 and z = 5 for window sizes 50 to 200 in

steps of 25. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664.9 Mean cost of finding a solution with t = 2 and z = 6 for window sizes 50 to 200 in

steps of 25. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674.10 Percentile 90%, 25% and median cost of finding a group with size z = 4 and weight

t = 2 for window sizes 50 to 200 in steps of 25. . . . . . . . . . . . . . . . . . . . . 684.11 Percentile 90%, 25% and median cost of finding a group with size z = 5 and weight

t = 2 for window sizes 50 to 200 in steps of 25. . . . . . . . . . . . . . . . . . . . . 694.12 Percentile 90%, 25% and median cost of finding a group with size z = 6 and weight

t = 2 for window sizes 50 to 200 in steps of 25. . . . . . . . . . . . . . . . . . . . . 70

Page 13: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

4.13 Mean cost of finding a solution for the optimality version of SGC for a window withsize 75 and 100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.14 Mean cost of finding a solution for the optimality version of SGC for a window withsize 125 and 150. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.1 Step 1. Example of applying the sorting method on a Q

w matrix. IP addresses aredenoted by the symbol a

i

while objects are denoted by the symbol dj

. . . . . . . . . 795.2 Step 2. Example of mapping a sorted Q

w matrix into an image. . . . . . . . . . . . . 795.3 Step 3. Example of splitting the image into 2 ⇥ 2 square cells.The dashed line

represents the limit of the square cell. Notice that there are some empty square cells. . 805.4 Step 4. Example of labelling 2 ⇥ 2 square cells. To better illustrate the result, each

label is represented in a circle with the letter L. . . . . . . . . . . . . . . . . . . . . 805.5 Step 5. Example of merging 2⇥ 2 labelled square cells. To better illustrate the result

from the method we have labelled each region. . . . . . . . . . . . . . . . . . . . . 815.6 Example of applying our split method into a region (right) given by the Hybrid Segmentation

approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 835.7 Example of the size and weight of the estimated group. To better illustrate the result

we have gray out the estimated groups except for one. . . . . . . . . . . . . . . . . . 835.8 Distribution of the estimated social structure size in a set of 680000 windows. . . . . 845.9 Distribution of the estimated social structure weight in a set of 680000 windows. . . 855.10 Average number of packets in a minute. The plot shows 1 month of activity from the

studied DNS server. The solid bold line represent the day with most of the averageactivity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.11 Box plot for the H

agt

feature. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 885.12 Box plot for the H

obj

feature. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 885.13 Average packets per second in a minute. The solid line stands for a typical day of

DNS activity while the dashed line refers to abnormal activity. . . . . . . . . . . . . 935.14 ROC curves for both windows, 250 and 500. Notice that both FPR and TPR range

over the interval [0,1], and that, to better appreciate our results, we have plotted bothaxis using a log scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.15 Recall/Precision curves for both windows, 250 and 500. . . . . . . . . . . . . . . . . 955.16 F-� curve for both window, 250 and 500. . . . . . . . . . . . . . . . . . . . . . . . 955.17 ROC curves for both windows, 250 and 500. . . . . . . . . . . . . . . . . . . . . . . 975.18 Recall/Precision curves for both windows, 250 and 500. . . . . . . . . . . . . . . . . 985.19 F-� curve for the best classifier according to our F1-score curve. . . . . . . . . . . . 995.20 Average number of packets in a minute. The solid line stands for a typical day of DNS

activity while the dashed line refers to abnormal activity dated February 04th. . . . . 100

Page 14: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

Table Index

4.1 False Positives and False Negatives from resulting trees. . . . . . . . . . . . . . . . 544.2 F

measure for different values of �. . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.1 Confusion matrix for the A (anomaly) class and N (normal). . . . . . . . . . . . . . 945.2 Confusion matrix for the A (real attack) class and N (normal). . . . . . . . . . . . . 975.3 Characteristics satisfied by related works. . . . . . . . . . . . . . . . . . . . . . . . 103

Page 15: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

1. Introduction

The Domain Name System (DNS) is a distributed naming system for computers. Mainly, it is used totranslate URLs to IP addresses, which is required for the localization of computer services, or devicesworldwide. DNS is hence a critical service, and, not surprisingly, a common target of cybercrime,most prominently to the so-called Distributed Denial of Service (DDoS) attack. Roughly, a DDoSattack is intended to disrupt the availability of the translation service by flooding a DNS serverwith thousands if not millions of network packets. During the last decade DNS has been subject toattacks disrupting the service temporarily. For example, an attack occurred in 2002 targeting severalhigher-order DNS servers. As another example, five years later the same DNS servers were underattack. Because of its distributed nature, DNS can also be use as a vector of attack in the so-calledamplification attacks, where an attacker only needs to compose a DNS request to generate a biggeranswer that is sent to a victim in an attempt to overwhelm it. Several other attacks have been reportedby organizations like CERT, DNS-OARC, ICANN, amongst others.

Despite of the critical service offered by DNS, in the last decade little progress has been madetowards its protection. Examples of such efforts include the detection of attacks when a value isoutside the allowed limits or suggesting a DNS deployment in an attempt to mitigate the effect ofa DDoS attack. However, after studying such efforts and because recent attacks in 2012, 2013 and2014, it is possible to conclude that DNS still requires a mechanism to enhance its security.

This thesis introduces a mechanism that is capable of detecting anomalies in DNS traffic. Thisapproach relies on a simple observation: DNS queries, from IP addresses to URLs, form social groups;hence, anomalies in DNS traffic should result in drastic changes on DNS social structure. For example,consider that the academic staff visits the same websites forming a group with common URLs, alsoconsider that students form a group by visiting other websites. Then, during an attack such groupswill suffer changes allowing a detector to spot the abnormal activity

11

Page 16: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

12

Computing social structures, however, is a complex task as it requires an exponential numberof computational resources, hence this thesis proves that it is an NP-Complete problem. Roughly,this proof consists of a Karp reduction from the well-known Longest Common Subsequence (LCS)problem to the problem of computing the social structure. Hence, in the context of computer security,a quick detection of anomalies on the social structure is not feasible unless we can estimate it.

Complementarily, this thesis studies the experimental tractability of computing social structuresso as to investigate its real hardness. Roughly, this study, called a phase transition study, suggests acritical value where the problem hardness goes from easy-to-solve to hard-to-solve. Indeed, severalwell-known NP-complete problems exhibit this particular behaviour some of which follow morecomplex patterns, e.g., an easy-hard-easy-to-solve pattern. The results from this study aim to identifythe hardest problems. Also, the results set the basis for a fair base line of comparison for any algorithmattempting to compute social structures.

Still, this thesis considers the development of an anomaly-based detector, which, given a timewindow of DNS usage, makes use of features that attempt to capture the DNS social structure,including a heuristic that estimates group composition. The detector has been successfully validated:it has been able to spot both DNS DDoS attacks, and activity of bots, colluding with a master, possiblyto agree on the initiation of an attack. The results are encouraging given that, it is possible to raise analarm with 93.9% of correct detection rate. In general, the social structure approach exhibits a betterdetection rate, a low number of false alarms and it has been tested with a more robust test set.

Lastly, this work provides the basis for a system that formulates a problem in terms of the socialstructures raised in a real-world process. One possible application of this thesis is in a context wherea recommendation is given to a set of users as a function of the common preferences they have,the so-called recommender systems. As another application of this work, consider computing thesemantic relatedness of text documents using wikification. Wikification is the process of relatingwords in an arbitrary text to concepts, structured as wikipedia articles [68, 46, 45]. We might liketo know the relation (whether weak or strong) amongst these articles, following the idea that anarticle should be supported by a group of concepts. We could use this relation as a criterion foran article-concept ranking. Semantic meaning of wikipedia articles could be useful for numerousapplications, including the semantic web. Works like [30] study the complexity about some aspectsof the semantic web.

1.1 ContributionsThis thesis makes three major contributions: First, it proves that any algorithm trying to compute theso-called social structure requires, theoretically, an exponential amount of computational resources.Second, it also proves that it is possible, for some cases, to compute the social structure with areasonable amount of computational resources. Third, it shows that the DNS social structure is asignificant factor to detect anomalies attempting to compromise the DNS server availability.

Page 17: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

13

1.2 OrganizationChapter 2 describes the DNS service and to discuss the advances attempting to protect this service.Next, in Chapter 3, the social structure approach and its complexity will be presented. Also, thischapter helps to show that formulating the problem of computing the social structure in terms of aadjacency matrix with its corresponding Gram matrix, allows to extract information regarding thesocial structure. Chapter 4 shows the results of studying the problem of computing social structuresin terms of it’s experimental complexity. Chapter 5 shows the approach for the construction of theanomaly-based detector; it also presents the results of applying the detector on a series of experimentsaiming to test its robustness. Chapter 6, concludes and gives indications for future work.

Page 18: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

2. The Domain Name System: Definitionand Related Work

The Domain Name System (DNS) service provides, to network users, translation from URLs to IPaddresses. This is required for the localisation of computer services, or devices worldwide. In thischapter, we show that despite the efforts to protect this critical service, DNS is still a common targetof cybercrime, most prominently to the so-called Distributed Denial of Service (DDoS) attack.

2.1 IntroductionOnly two years after the release of the first draft of DNS service, researchers warned about severalflaws on the DNS design [6]. Such flaws may allow an attacker to compromise the service or to use itas a vector of an attack. One of these flaws is related with the so-called Distributed Denial of Service(DDoS) attack. A DDoS attack aims to overwhelm a victim with thousands of network packets so asto affect the performance of the network or devices. Examples of these attacks are numerous but themost significant are the attack to the root servers in 2002 [64] and in 2007 [32]. After these attacks,some efforts have been conducted towards the security of DNS. For example, the DNS communityhave implemented load balancing mechanisms. After a severe DDoS attack to the DNS service in2002, several works to timely detect or mitigate DDoS attacks on DNS servers, appeared. We haveclassified these works according to the employed technique, as follows: Statistical characterization,data visualization, machine learning, and DNS deployment.

After a thorough evaluation, we have found missing properties on these works. For example,a timely detection, coping with changing DNS behaviour and difficult to deceive. Consequently, a

14

Page 19: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

15

security mechanism that guarantees DNS availability is still required. In this thesis, we have built ananomaly based Intrusion Detection System (IDS) following the hypothesis that DNS usage gives riseto social structures that change during a DNS attack. The reasoning behind this proposition is thatsocial structures allow us to investigate how a collection of agents, namely IP addresses, relate oneanother on the basis of the web domains (an URL) they have commonly visited, over a period of time.Consequently, during an attack such relations will change, possibly, indicating a DDoS attack.

Chapter overview: First, we describe the operation of the DNS service (sec. 1.2). This will helpus, in section 1.3 to depict the security issues with the DNS service, e.g. a DDoS attack. Then, we willdiscuss the approaches attempting to protect the DNS service (sec. 1.4). We conclude this chapterproposing the use of social structures so as to identify anomalies in the DNS server.

2.2 Name Systems: Domain Name System (DNS)The Name Systems were created to easily identify network devices (e.g. database servers, web servers,printers, etc.) with a human-friendly label. Certainly, the network protocols use numerical labels,namely IP address, to communicate with network devices. Then, a Name System also help a networkuser to translate a human-friendly label to an IP address and vice versa. This translation is also knownas resolution.

2.2.1 Name Systems Characteristics

To consider a Name System as completely functional, it must have the following features:

• Name Space

• Name Registry

• Resolution Process

2.2.1.1 Name Space

The name space specifies the syntax of the labels by providing rules like the number of charactersand the length of the label. This name space also provides a logic hierarchy so as to organize allthe labels easily in a network, this hierarchy is known as Architecture. Indeed, there are two typesof architectures, plain and hierarchical, each of one has its own benefits. For example, in a plainarchitecture, an authorized user may label a network device without any particular restrictions (e.g.printer, JohnPC, server1), while in the hierarchical architecture users label network devices accordingto a given relation (e.g. sales.JohnPC, sales.printer, sales.server1). Clearly, a plain architecture is easyto deploy but hard to maintain; in contrast, a hierarchical architecture is easier to maintain but hard todeploy.

Page 20: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

16

2.2.1.2 Name Registry

The name registry function guarantees that the network device has a unique name by using a storagemechanism. The organization given the responsibility of maintaining this coherence is named aRegistry Authority. For example, AKKY Mexico is the Registry Authority responsible of the nameregistry in Mexico.

The Registry Authority maintains the names of the network devices by using one of the followingmechanisms:

1. Registry Tables.- The names and its corresponding IP addresses are stored in a file.

2. Propagation.- The names and its corresponding IP addresses are propagated upon request.

3. Distributed Database.- Some names and its corresponding IP addresses are assigned to anauthority.

2.2.1.3 Name Resolution

The name resolution is a resolution that depends on the storage mechanism. For example, whenasking for translation in a distributed name database, several queries are propagated amongst severalservers so as to answer with a existence or non-existence flag to a name.

2.2.2 The DNS Protocol

In the early ages of Internet the preferred Name System had a plain architecture, ARPAnet as itsRegistry Authority and a name resolution by means of asking to the registry tables. Over the years tocome this Name System became obsolete, in part, because of the growth of Internet.

In 1983 the first standard of a new Name System appeared, the Domain Name System (DNS).The DNS Name System has a hierarchical name space; the Registry Authority is formed by severalorganizations around the globe and; the name resolution is performed using a distributed database.Moreover, DNS proved that is scalable, easy to maintain and its distributed nature prevents bottlenecks.

Finally, because the hierarchical name space, the DNS service assigns names to network devicesaccording to a given relation, this names are known as domains and are controlled by several RegistryAuthorities.

2.2.2.1 Domains and Authoritative Zones

The domains in the DNS service are constructed hierarchically. For instance, the Registry AuthorityGoogle, controls the name assignation over the domain google; consequently, web servers likemail.google or maps.google exist. Notice that, because the hierarchical name space, domainslike mail.google and mail.yahoo refer to different network devices named mail.

Every domain belongs to an upper domain before reaching the main DNS domain, the root node,denoted by the character (.). From this root node other domains rise (Fig. 2.1). Indeed, below the root

Page 21: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

17

!""#

$%"& $"!' $()#

'""'*)$%"& +,-""$%"& .,%)/""0$%"&

Figure 2.1: DNS Hierarchy illustration.

node the domains are named Top Level Domains (TLDs) and, typically, represent countries, regions,organization, etc. For instance, the .com TLD represents commercial websites.

For the sake of simplicity the domains below the TLDs are named subdomains. For instance,google is a subdomain of the TLD .com.

Because its complexity, some part of the DNS hierarchy is maintained by different organizationsaround the globe, each of which has a group of delegated domains to maintain. This set of delegateddomains are also known as an Authoritative Zone. Finally, all the information regarding the zone isstored and managed with a server named Authoritative server.

2.2.2.2 Resource Records

The Resource Records (RRs) are a set of tuples containing the information about the domains and itscorresponding IP addresses. Each authoritative zone have its own RRs stored in a master file such thatit can be modified if required.

According to the information requested to an authoritative server, there are several RRs types:

• A Type.- To resolve addresses over IPv4.

• AAAA Type.- To resolve addresses over IPv6.

• MX Type.- To resolve mail server addresses.

• PTR Type.- To resolve a reverse DNS resolution request.

• NS Type.- To set the addresses of zone authoritative servers.

• SOA Type.- To determine if the authoritative server have new information regarding the zone.

• NX Type.- To indicate that the requested domains does not exist.

Page 22: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

18

The RRs associate a class to the information sent. Typically, the class IN is the most common asit refers to the TCP/IP protocol, the remaining classes are reserved by Registry Authorities.

2.2.2.3 Zone Transfer

Given that RRs change over time (e.g. a server may change its IP address) a primary authoritativeserver is the only server in the zone that stores and manage the RRs. This is because propagating thisinformation along the authoritative zone may increase DNS traffic in the network.

However, given the critical service that provides an authoritative server in the zone, there areredundancy mechanisms that guarantee DNS service availability by means of backup servers. Then,the RRs information of the primary authoritative server is propagated according to a timer, suchpropagation is named a zone transference. When a primary server is disabled for any reason, oneof the backup servers act as a primary server and use the most recent RRs to provide the translationservice to the zone.

2.2.2.4 Root Servers

On the top of the DNS hierarchy we can find the root node that is formed by 13 logical servers, namedfrom the letter A to the letter M, managed from several physical servers around the globe, as it mayhelp to create a wide DNS service. Indeed, over 800 thousand requests are sent to the root node ina second [12] so auxiliary mechanisms are necessary to ensure the availability of the root node. Oneof these mechanisms is load balancing which allows the servers to decide the best possible way toresolve a DNS request. Another mechanism is the cache memory, which stores temporally the mostrequested domains, making unnecessary to search along the DNS hierarchy for such domain, in agiven time period. Clearly, the cache must ensure that it actually contains fresh information about theRRs so the client have certainty about the information. To address this issue, the following mechanismwere created:

• The first one consists in tagging the RRs indicating that the information comes from the cache.When the client receives a RR it may ask to the DNS server for a fresher RR.

• The second one consists in tagging the RRs with a temporal label named Time to Live (TTL).The TTL allows a client to know the freshness of the information. Moreover, when a RRbecomes too old (a low value of the TTL parameter) the DNS server will ask for a new RR toits corresponding DNS server.

Finally, given the clear boost on the DNS server performance, these mechanisms (load balancingand cache memory) are present in every DNS server.

2.2.2.5 DNS Resolution: Iterative and Recursive

The DNS Resolution process is performed by a resolver. This resolver has the function of initiatingand sequencing the queries that lead to a resolution. A resolver could be a DNS server or an end-usercomputer, given that end-user computers have operating systems with built-in DNS resolvers.

Page 23: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

19

Typically, an end-user resolver forwards the DNS query to a recursive DNS server, this process isnamed a recursive resolution. Next, the recursive DNS server takes the role of the resolver and triesto resolve the query, if the answer is not in its cache memory, the recursive server starts an iterativeresolution.

In the iterative resolution, the resolver asks for the domain to authoritative DNS servers, whichin turn reply with the answer if it is known or with the Resource Record (type NS) pointing to otherDNS server that may have the answer. Then, the recursive server query the domain to this new DNSserver and the process is repeated until a RR with an answer or indicating a non-existent domain issent to the initial resolver.

As expected, the role of cache memory impacts on the performance of the global DNS service.Indeed, if the cache memory of every DNS server around the globe is corrupted, such DNS serversmust refresh its RRs by iterative resolutions.

Finally, we can notice that to force an iterative resolution amongst all the DNS infrastructure, itis enough to ask for a random domain from a web browser. Indeed, because it is difficult to identifyan intended DNS query, several random queries can easily overwhelm a DNS server impacting theoverall responsiveness of the DNS service.

2.3 Security Issues with DNSSince the release of the first DNS Request For Comments (RFC) in 1983, researchers expressed someconcerns about the role of this service in the growth and evolution of Internet. Indeed, four yearsafter the release of the first implementation of DNS, some researchers [6] suggested that DNS isa critical infrastructure that must be protected given that by design, it has vulnerabilities that mayallow an attacker to compromise it. Namely, they proposed that RRs can be intercepted with aman-in-the-middle attack and the DNS replies could be changed to redirect an user to a subvertedmachine. Several years later, this vulnerability was known as a cache poisoning attack which wasofficially reported and patched in 2008 [62]. Later, with the global implementation of a securityenhancement of DNS, DNSSec [44], conducting such attack became harder.

Still, after more than 30 years of DNS implementations around the globe, its security assurancecan be compromised. Specifically, one weakness that is related with the difficulty to differentiateintended URLs, may compromise its availability. Certainly, because it is easy to create random URLsand send them formulated as a question to a DNS server, a Distributed Denial of Service (DDoS)attack can be easily performed. Basically, a DDoS attack is about using thousands or millions ofnetwork requests to make a machine or network resource unavailable to a user.

Indeed, in October 2002 there was the first attempt to disable Internet through a DDoS attack onthe DNS root servers [64]. The 13 root servers were subject of this kind of attack using mostly ICMP,TCP, SYN and UDP network packets. This anomalous event lasted two hours, disabling 9 of the 13root servers. Although several parts of the world were unable to resolve addresses, the remaining rootservers were capable of resolving Internet addresses though with a poor performance.

Page 24: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

20

After the events of 2002, three mechanisms were implemented in the DNS infrastructure: Aload balancing algorithm to distribute DNS queries amongst different root servers; improvementsin the performance of the cache memory and; the DNS root servers went from 13 physical servers tohundreds of physical servers organized in 13 logical servers named from the latter A to the letter M.

A second severe attack in 2007 [32] lasted five hours affecting the logical servers F, G, L and M.The attacking botnet consist in 4500 bots. Each bot created 3 random domains sending them to itsrecursive DNS server; consequently, the botnet force thousands of iterative resolutions amongst allroot servers. To mitigate the effect of this attack the DNS Registrars in collaboration with the InternetService Providers (ISPs) started to filter network packets with size greater than 300 bytes.

Although previous two attacks were considered as the most significant affecting the DNS service,there are numerous reports of attempts to disable DNS servers through DDoS attacks. As an example,in January 2009 1, the DNS registrar Network Solutions was attacked slowing down the DNS queries,consequently, the access to thousands of websites. Later in April 2009 2, the DNS registrar thathandles the Amazon Web Service (AWS) was subject of a DDoS attack, disrupting the ongoing serviceof the Amazon headquarters since VoIP, email, IM depend on domains.

As another example, in August 2013 3, the China Internet Network Information Center (CINIC)reported a DDoS attack against its DNS servers affecting all websites of the .cn domain, this attacklasted over 10 hours. The increment on volume of traffic was about 718% (48.25 Gbps) with anaverage of 32.4 million packet-per-second rate.

Finally, in the early months of 2013 another DDoS attack occurred4 affecting DNS registrars fromCanada and Florida, consequently, slowing down the performance of websites.

2.3.1 Discussion

It has been more than 12 years since the first DDoS attack, still, it is a current security issue. This is,in part, because such attacks are easy to conduct and difficult to timely detect. Moreover, because itscharacteristics, DNS is used to conduct attacks where the victim is overwhelmed by DNS answers.Therefore, researchers are conducting several studies to prevent, protect or optimize the DNS service.

2.4 Approaches for the Detection and Mitigation of DDoS Attackson DNS servers

After the severe DDoS attack to the DNS service in 2002 several works, to timely detect or mitigateDDoS attacks on DNS servers, appeared. We have classified these works according into:

• Statistical characterization.- This category discuss works focused on characterizing DNS traffic.Typically, statistical characterization is used as a first approach to understand normalcy about

1http://goo.gl/auxzA5

2http://goo.gl/rq7PKQ

3http://goo.gl/VwA6t3

4http://goo.gl/FqACxg

Page 25: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

21

network traffic. In these works, researchers propose several characteristics to create a profilefrom a DNS server. Some of these characteristics are the volume, the packet-per-second rate,the number of clients, etc.

• Data visualization.- This category discuss works using graphical methods so as to identifyanomalies in the DNS traffic. Indeed, data visualization can show in a human-friendly waythe status of the network. Also, if we merge these methods others, e.g. machine learning, wecan automatize the process of decision whether a DNS request is abnormal or not.

• Machine learning.- The machine learning methods aims to learn patterns of interest from DNStraffic. Later, this patterns can be readily applied in the identification of anomalies. We willdiscuss in the advances of these techniques in the context of DNS.

• DNS implementation.- The motivation behind this kind of works is related with the improvementof the performance of the DNS service. For example, a correct deployment of local DNSinfrastructure can reduce the volume of traffic enhancing the availability of a DNS server duringa DDoS attack.

2.4.1 Statistical Characterization

Analysing the requirements of the root server F. In [10] the authors conducted an statisticalanalysis of the root server F. They showed that 65% of the DNS requests are somehow invalid meaningthat it is a query that should not be at the root level. They speculated that these requests weregenerated and sent to the root server because low-level DNS servers were misconfigured. Indeed,the authors found over 146,783 different TLDs with names like .local, which corresponds to localDNS lookups that should not reach the root servers. Through this observation, they concluded that16.5% of the servers asking to the F server ask several questions that should not reach the root level,while a 37.5% of the servers perform at least one invalid query.

According to the authors, during its analysis, they found a DDoS attack which was intended tocompromised the RRs integrity. They supported this by making the observation that several PTR RRswere generated simply varying the octets of the IP addresses with a fixed pattern. With this processthe attacker had a relation of IP with its corresponding domain that can be used for malicious activity,e.g. phising.

Finally, the authors proposed to improve the negative-cache (RRs indicating that an address isinvalid) performance to reduce the amount of queries that will spend network resources of the rootserver.

Although the work is interesting, since it provides an statistical perspective of a root serverperformance, it fails to provide solid evidence of the fact that they found an attack. Moreover, itis not clear the criteria used to conclude that the PTR RRs are coming from an attacker instead of animproper configuration of other DNS servers.

Page 26: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

22

A day at the Root of the Internet. The work in [12] analysed the traffic in the 13 root servers.It is a significant breakthrough towards the understanding of the DNS service since it is the first studyof this kind. Interesting facts about this study are that in average there are over 5 billion queries to theroot servers in a day which means a rate over 15,100 packets-per-second.

The authors proved that 98% of the queries on the root servers are invalid because of improperconfigurations on low-level DNS servers, supporting the observation of [10]. Also, approximately50% of the RRs are A type, contrasting with the 14% of PTR type. They also find a strange behaviouron which 40% of the queries are repeated during a day and invalid TLDs.

They concluded hypothesising the difficulty of correctly configure all DNS servers around theglobe given the distributed nature of the DNS service. Finally, they urged DNS managers to haveproper and up-to-date configurations so as to improve the performance of the DNS service.

The authors presented a breakthrough study of the Internet. Although much work has to bedone, they provide the capture of the root servers which is available upon request to the DNS-OARCorganization.

Analysing requirements of authoritative DNS servers. In [60] the authors studied severalauthoritative servers. They found that most of the DNS requests are resolved by only a couple ofDNS servers. Moreover, they noticed that DNS resolvers belonging to ISPs create several iterativeresolutions to resolve invalid domains.

The authors failed to prove that DNS requests are invalid since the criteria was unclear. Still,the study is interesting since they make the observation that the DNS load balancing mechanism isnot working correctly, consequently, an attacker could use this misconfiguration to conduct a DDoSattack.

Characterizing DNS traffic to detect a DDoS attack. In [34] the authors characterised DNStraffic during an anomalous event, namely, a DDoS attack. Indeed, they found that the attack wasoriginated by a computer worm which have compromised several machines. This worm asked fora particular domain, possibly, to establish communication with the attacker. The DNS requestsoriginated from the infected machines started a DDoS attack against the local server. They noticed thatthe packet-per-second rate went from 500 to 4500, suggesting the presence of a DDoS attack. Oncethe attack was identified, the DNS operators give a non-existent reply to all clients asking for thisparticular domain. The authors concluded the paper by claiming that the best protection mechanismagainst DDoS attacks is to give negative answers to the clients.

The authors ignore that simply varying the domains this protection mechanism is ineffective giventhat it is not possible to differentiate intended users.

Recursive DNS server analysis. In [41] they characterize DNS traffic of a recursive DNS serverfrom a university campus. They reported several statistical measures about this server concluding thatthere are several bots in the local network.

Page 27: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

23

This work fails in providing evidence of the presence of bots since they claimed that an anomalyon the DNS server can be noticed because the traffic of one day is different compared to other day.

Resolvers Revealed: Characterizing DNS Resolvers and their Clients. In [58] the authorsstudied DNS resolvers by means of a statistical analysis. Given that most of the research in theDNS area is conducted on authoritative servers, the authors analysed low order DNS servers suchas resolvers. Also, the work attempted to answer the question: is it possible to find patterns on theresolvers so as to relate resolvers with DNS software distributions and clients?. Using statisticaltechniques over real-world traffic, they claimed that it is possible to differentiate DNS softwaredistributions (e.g. DNS BIND). For example, one of the characteristics that differentiate BIND fromother DNS software is that BIND sends some RRs over some time period.

Through their experimentation, they related DNS resolvers with some clients. Although theauthors claimed that using this association will enhance the DNS security, it is unclear how this ispossible. Further, the criteria to associate a resolver with a client is vague because they did notvalidate thoroughly the heuristics for this association. Still, it is interesting how the DNS communityis starting to research into resolvers instead of higher order DNS servers.

Towards Passive DNS Software Fingerprinting. The authors attempted to identify DNSsoftware from several DNS servers using heuristic rules, following an approach similar to [15, 58].In average, they had a 92% of accuracy to identify DNS software and a 99% to identify DNS BIND.These results suggest that every DNS software follows patterns

Comparing DNS Resolvers in the Wild. The authors of this work [1] studied DNS resolversso as to understand the performance of DNS resolution. Given that much of the Internet traffic relieson the responsiveness of DNS, the authors were interested in studying the performance of the DNSresolvers from different ISPs. They studied DNS traffic from over 50 different ISPs by contrastingthem with Google DNS and OpenDNS. Particularly, they studied the response time of over the 5000most popular domains, 2000 less popular domains and embedded content from web pages.

Although their results find that Google DNS have a better performance than other DNS services;the authors considered that further investigation has to be done to understand better this peculiarity.They also noticed that even that the same DNS queries are performed amongst several DNS servers,the answer (i.e. IP address) could be different, possibly, because of the geographical location of theDNS server. Finally, the authors have been planning to optimize the DNS responses according totheir findings. As a minor flaw of the work, we can mention that a thoroughly validation has to beconducted to understand better the performance of the Google DNS servers. Moreover, if DNS serversuse load balancing with Round Robin mechanisms, it is not clear how this work can help to optimizethe overall performance of a DNS server they, possibly, need to change globally the load balancingmethod which seems difficult to accomplish.

Page 28: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

24

On Modern DNS Behavior and Properties. In [11] the authors studied 14 months of real-worldDNS traffic from a neighbourhood in the US. Basically, they find that 63% of the number of uniquedomains were requested only once during the 14 month period, two-thirds of the DNS transactionswere completed under 1 ms and 40% of the DNS responses went unused. The contribution could beconsidered as narrow; however, it is one of the most recent studies on a large-scale network.

The Collateral Damage of Internet Censorship by DNS Injection. The authors [3] studiedthe effects of using DNS injection so as to block the access to unwanted websites. Particularly, theystudied the effect of the Great Firewall of China (GFC) on other DNS server countries. Roughly, theGFC prevents china users to use services like Facebook or Twitter by answering with a non-existentanswer to china DNS requests. Indeed, they found that some countries like Korea and Chile sufferfrom a collateral damage, preventing them to access websites, because of this DNS injection. One ofthe main questions to be answered is, why does the injection on China DNS servers have such effect inother countries DNS servers?. According to the authors findings, this is because some DNS requestshave to go through China, making the GFC to answer with a non-existent response to the DNS server.Moreover, such negative answer is propagated to some root servers, namely the I root server. Thework is interesting because it studies what the authors call a collateral effect on DNS server, i.e. howa non-existent response can be easily propagated through the DNS hierarchy.

On Measuring the Client-Side DNS Infrastructure. This paper proposed a methodology todiscover several types of DNS servers [57]. They wanted to identify Open DNS (which can receive aquery from any client), Resolver, Forwarders, and Hidden DNS servers (not accessible to an end-user).To conduct this study they registered and propagated the information of a domain trough the DNSinfrastructure deploying an Authoritative DNS server. Then, they started to ask for subdomains toforce an authoritative resolution process. As expected, asking for an unknown domain implies anauthoritative resolution process from resolvers to authoritative DNS servers. Finally, they started toclassify DNS servers according to the receive request. Surprisingly, they reported twice as much OpenDNS servers than the reported by the DNS community. Their results are encouraging in the sense thatthey can identify several types of DNS servers. Moreover, they reported that authoritative (root)TTL values are dependable of the client-side (resolvers and forwarders) of the DNS infrastructure.Certainly, this is a typical DNS vulnerability that may easily compromised a DNS cache, unless theDNS operators use DNSSec.

An Empirical Reexamination of Global DNS Behavior. The authors of this work [21],studied a 26 billion DNS query dataset from over 600 globally DNS resolvers. They contrastedtheir results to previous and similar works on DNS servers. Not surprisingly, the authors foundsome differences between previous studies and their findings maybe because of the evolving natureof Internet or because the authors’ perspective. For example, in 2002 the number of AAAA RRswere small compared to the number of AAAA present in 2013. This study helped to refresh the

Page 29: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

25

information regarding the DNS nature. Finally, the authors proposed a method to detect anomalousdomains particularly, related to bot, spam, and phishing activity.

An Empirical Study of Orphan DNS Servers in the Internet. This work [35] studied how toidentify DNS servers that may be used to conduct malicious activities like spamming or phising. Theauthors referred to such servers as ’orphans’ because they only use NS records when a DNS serverusually use more than one RR type. Surprisingly, the authors found that some ’orphans’ are used byRegistrars for maintenance activities, consequently, making the use of such ’orphans’ is needed. Thecontribution of this work seems narrow because it is well-known that by means of the DNS services,much of the hoax and spam activity is conducted. However, the approach of analysing the DNSservers using NS records seems like a novel approach.

Analysis of Flooding DoS Attacks Utilizing DNS Name Error Queries. The paper analysedthe effects of using DNS error queries (NX RRs) on DoS attacks over local DNS servers [66]. First, theauthors made some considerations, like the maximum number of RRs that can reach an authoritativeserver, the number of concurrent sessions per client on a DNS server and the average query rate. Then,the paper analysed a DoS attack (2012) over the China DNS infrastructure using this considerations.Not surprisingly, the author noticed changes on the proposed parameters, concluding that a DoS attackoccurred using several non-existent domains. Through the proposed methodology, they identifiedattackers by simply analysing which clients ask for the non-existent domains. Basically, if the numberof clients asking for domains decreases while the number of NX RRs increases, there is an attack inprogress. To mitigate the effect of the attacks, the authors proposed the classic blacklisting techniqueover all the clients asking for non-existent domains.

Impact Evaluation of DDoS Attacks on DNS Cache Server Using Queuing Model DNS NameError Queries. In this work [67], the authors observed the effects of a DDoS attacks on DNSservers by modelling the DNS sever as a queueing process. To model the DNS server, the authorsassumed an authoritative resolution process involving several local resolvers which in turn ask forresolution to a couple of authoritative servers. The model also assumed a poisson process during theresolution process. The authors, test their model simulating DNS traffic. The results shows that a largeDDoS attack decrease significantly the performance of a local resolver. In contrast, a small DDoSattack (compared to the large one), has a negligible effect on the performance of a local resolver,in part, because of the defence mechanism of a DNS server. Although there is not a breakthroughcontribution in this work, the authors proved the strength of the DNS servers against DDoS attacks.

2.4.1.1 Discussion

Most of the presented works are focused on characterizing DNS traffic as, typically, it is the firstapproach to understand normalcy about network traffic. In these works, researchers propose several

Page 30: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

26

parameters like volume, packet-per-second rate, number of clients, etc. Although some of theseworks (e.g. [10, 41, 34]) measure changes on their statistics to detect an attack, they fail to validatethoroughly that it is a DDoS attack instead of an abnormal behaviour consequence of misconfigurationor a human-related event.

Finally, these works provide information and peculiarities about how the DNS is used. Forexample, most of the authors suspect or confirm that several DNS queries that reach authoritativeservers are, somehow, invalid. However, the criteria to classify such queries as invalid is not clear,while works like [12] indicate the existence of local or maintenance domains at the root level of theDNS hierarchy.

2.4.2 Data Visualization

Characterizing large DNS traces using graphs. In [16], the authors proposed a methodologyfor anomaly detection based on graphs. In order to construct the vertex of the graph, they usedthe source and destination IP addresses from a sample of DNS traffic. Then, an edge is drawnbetween vertex a and vertex b if IP-a queries IP-b. Based on the out-degree of the node, they candifferentiate between authoritative and recursive servers. According to this research, any vertex havingan out-degree of 100,000 is considered a zombie from a botnet.

One of the drawbacks of this research is that they did not consider the zone transfers as a plausibleexplanation about the out-degree of the vertex. The zone transfers is a process initiated by a singlequery from a client over the UDP protocol; then, the information transfer is carried out by using theTCP protocol which, obviously, increments the volume of traffic considerably.

Visualizing DNS traffic. The authors of [54] developed a software for DNS traffic visualization.They construct several graphs according to some parameters. Through their experimentation theyfound a DDoS attack that changes the visual structure of the graph.

This approach is impractical since a network administrator must visualize the DNS traffic to detectan anomaly. The authors also recognize that an automated process to detect an attack from the graphsis challenging. Finally, they use several types of DNS parameters giving rise to dozens of graphs tobe analysed at the same time making it hard to detect anomalies.

Traffic Dispersion Graphs. The work in [33] extended the research about the Traffic DispersionGraphs (TDGs) which provides a way to visualise network traffic. The resulting graphs show theinteraction between several network devices according to the protocol. For example, they show howthe P2P protocol, usually, groups several vertices on a large set of nodes. Using the DNS protocol, theTDGs show that most of the DNS queries are sent to a particular IP address. This observation onlyproves that the clients are asking to the same DNS resolver. The TDGs have several parameters thatchange during an anomalous network event.

Page 31: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

27

The proposed method fails in proving that they are able to detect an ongoing attack. Indeed, theauthors claim that their method can detect any network attack without supporting experimentally thisstatement. They also did not provide enough evidence to make clear the detection of a DDoS attack.

2.4.2.1 Discussion

Using visualization methods for the detection of DDoS attacks is difficult since extracting significantinformation from the visualization methods is challenging. Indeed, the presented works in this sectionshows that it is impractical to visualization methods, in part, because some of them require humanintervention.

Besides this drawback, visualization methods can show in a human-friendly way the status of thenetwork. Also, if we merge these methods with machine learning we can expect that the computerassists on the decision of whether an anomaly in the network is happening or not.

2.4.3 Machine Learning

Context-aware clustering of DNS query traffic. In [50], the authors proposed a method toclassify traffic given an RR Type. Particularly, they proposed to classify DNS queries considering thefollowing classes: intended, unwanted and black-listed. Considering this approach, they managed toidentify URLs requests from end-users. Also, they noticed malformed URLs and invalid TLDs overthe DNS traffic, supporting previous researchers observations. Indeed, because of this type of traffic,the authors claim that they found a botnet in the network.

As a flaw of this work, we can mention that the criteria to identify the botnet is vague since it onlyrelies in the packet-per-second rate. It is the same drawback as in [16] since they did not consider thetransfer zone which increments significantly the traffic volume in a given time period.

Analysing Root DNS Traffic. In [39] the authors studied a root server so as to propose amethodology for the detection of improper DNS server configuration. They used algorithms likeK-means, PCA and LDA. First they characterised preprocessed DNS traffic and use PCA to reducethe dimensionality of the data. Second, they used K-means to form clusters from this data. Finally,they used LDA to notice some separation on the clusters. From the results they noticed 5 clusterswhich may characterize an abnormal behaviour. For example, in one cluster the inter arrival time wastoo large while the number of unique requests was low. Thus, the authors consider this cluster asabnormal given that a root server usually replies to a large number of unique URLs.

The experimentation fails in proving that the methodology is not dependable of the analysed rootserver. Moreover, the evidence to classify clusters as abnormal is not well presented since it appearsto be based on conjectures rather than an experimentation.

Page 32: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

28

Modelling DNS Activities Based on Probabilistic Latent Semantic Analysis. In this work[72] the authors proposed a method for mining DNS data so as to find relationships between the usersand the visited domains, namely groups. In particular, they used the Probabilistic Latent SemanticAnalysis (PLSA) technique, which computes the probability of belongingness to a group based on thevisited domains, to determine groups of interest. Later, using what the authors called ’characteristic’user and domain, they try to give a semantic meaning to the groups. To test their method theyuse a dataset from real-world traffic. Their results indicate about 15 different groups each of onecorresponds, according to the authors, to different behaviours on the DNS traffic. For example, onegroup tends to collect users who visit CN domains. Although the approach is interesting they onlycompute these groups to characterize DNS traffic without giving hints on how to use them in otherDNS applications. Finally, they suggest that their method can be readily applied to other areas tocompute groups.

Tracking Anomalous Behaviors of Name Servers by Mining DNS traffic. The authors ofthis work [65] characterised real-world DNS traffic using a statistical approach. Particularly, theystudied DNS traffic feature such as number of queries per client, number of server responses, numberof queries to a particular server, number of responses to a particular client, in a given time period.Using such features they proposed two types of anomalies, one related to the frequency of the queriesand the other one related to the frequency of the responses.

The results showed that there are several query types present in the analysed log which can beconsidered as an abnormal. The authors considered that further investigation is required to supporttheir claims. The contribution of this work is cryptic because they wanted to detect anomalies to detectseveral types of DNS attacks. Moreover, they claim that some patterns on the DNS traffic correspondto anomalous behaviour without considering that this particular behaviour could be normal.

Towards classification of DNS erroneous queries. In this work [37], a local DNS serveris characterised so as to conduct a classification of common DNS error queries. Indeed, the authorsconsidered as an error a negative response, such as NX RR, to a DNS query. Trough their methodologythey proposed three types of errors, NX errors, where a non-existent domain is returned; server errors,where the authoritative server did not want to answer to the query; and refused errors where a localDNS server did not want to answer to the query. The authors found some usage patterns on the DNStraffic, basically, because human activity on the network.

The proposed DNS traffic classifier, which use some heuristic rules, have a low false positive rate.Finally, the classifier finds anomalous DNS queries related to spam activities.

2.4.3.1 Discussion

The machine learning methods allow to identify network normalcy. Although some of the presentedworks did not support their findings about anomalies, they provide with an idea of how DNS serverswork. Moreover, because some of the machine learning methods requires an interpretation, it has

Page 33: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

29

not been adopted entirely. These techniques are promising because works like [72] showed that it ispossible to extract groups from the DNS traffic. Moreover, their findings support our hypothesis thatconsidering the social structure can help to detect anomalies in a DNS server.

2.4.4 DNS Deployment

A correct implementation of DNS servers will enhance its security assurance. For example, [4]propose to create three name systems with plain architectures to use them as an interface betweenthe end-user and a recursive DNS server. Then, the volume of traffic will be distributed amongst thearchitectures. However, according to the author, only one DNS registrar will handle the architectureso a massive implementation is likely to be hard.

As another example, in [5, 48], the authors proposed to constantly update the TTL from the mostused RRs so as to reduce the number of iterative processes initiated by a recursive DNS server. Sinceit is expected that a RR from a popular website (e.g. Amazon, Google, Yahoo, etc) do not changeover time, this websites will be in the recursive server cache memory. Although this idea is sound, itis difficult to see how the performance of the DNS server is improved. Indeed, since the number of’popular’ websites is reduced (must follow a Zip-F distribution), the contribution of this method onthe DNS performance could be short.

The authors of [71] studied the way on which a DNS resolver selects an authoritative DNS serverto conduct an iterative processes. Moreover, they study this process from 3 different scenarios.Consequently, they aim to improve DNS packet routing and the performance of the translation service.The results show that there are particular cases where a resolver does not select the best DNS server,this is because a good selection depends on the configuration of the DNS servers. Then, given a poorselection of a DNS server, the resolver will have delays that affects directly the end-users requests.Although it is difficult to see a breakthrough contribution, this study is novel in the sense that itprovides an experimental perspective about the selection of authoritative servers. Finally, this workprovides some hints on how to improve DNS server selection considering several measures.

Finally, in [9] the authors proposed a congestion avoidance based method for load balancingon DNS servers. Typically, an end-user web service request like google.com implies generatingseveral DNS requests which redirects the web browser through different content servers (e.g. pictures,sound, etc). On the DNS service the selection is conducted by means of a Round Robin policy. Theauthors claim that a disadvantage of using such selection scheme, is that the DNS server may overloadthe end-user web services. Thus, they propose to use a congestion avoidance based method so theDNS sever does not overload end-user web services. The authors did not provide enough evidence tosupport that their method is reducing the load on web services. Moreover, they compute a probabilitybased on the hypothesis that a server that has a 70% load will increase its activity. This argument isnot properly supported with evidences as it is not possible to predict the activity only with the currentload of a server.

Page 34: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

30

2.4.4.1 Discussion

The motivation behind this kind of works is related with the improvement of the performance of theDNS service. Moreover, these works suggest changes on the DNS servers similar to the works of[12, 39, 10].

A correct deployment of local DNS infrastructure can reduce significantly the volume of trafficreaching higher-level DNS servers. Also, a correct deployment will enhance the availability of a DNSserver during a DDoS attacks.

2.5 ConclusionsThe Domain Name System is a hierarchical and distributed name system that provides a translationservice, to network users, from domains to IPs and vice versa. It is a critical infrastructure of theInternet. Consequently, it is a common target of cybercrime; particularly DNS as many other networkservices, is weak against DDoS attacks.

There is concern about the availability of DNS given the continuous increase in these kind ofattacks 5. However, few efforts have been made to protect the DNS server. One of these effortsconsist on increasing the number of DNS servers, this is impractical given that attackers can alsoincrease their capacity to launch a DDoS attacks.

In general, there are four categories for the related works: statistics, machine learning, visualizationtechniques and particular configurations of DNS servers.

As we have showed in this chapter, most of the research about DNS is related to characterize it orto improve its performance. Few works are focused on DDoS attacks.

In this thesis, we have built an anomaly based Intrusion Detection System (IDS) following thehypothesis that DNS usage gives rise to social structures that change during a DNS attack. Thereasoning behind this proposition is that social structures allow us to investigate how a collection ofagents, namely IP addresses, relate one another on the basis of the web domains (an URL) they havecommonly visited, over a period of time. Consequently, during an attack such relations will change,possibly, indicating a DDoS attack.

5For example, the security laboratories like Incapsula reported on the increase of DDoS attacks: http://www.

incapsula.com/blog/massive-dns-ddos-flood.html

Page 35: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

3. The Commonality Amongst the Objectsupon which a Collection of Agents hasPerformed an Action

In this chapter we shall define the social structures that arise from the commonality amongst theobjects upon which a collection of agents has performed an action.

Then, we shall prove that any algorithm to compute social structures requires an exponentialamount of computational resources, we call this problem Social Group Commonality (SGC). Tosupport this observation we provide a proof of its NP-Completeness.

Our proof consists of a Karp reduction, from the well-known Longest Common Subsequence(LCS) problem to SGC.

We will formulate SGC in terms of a matrix Q so as to prove a special case of SGC, we call2-SGC, where the commonality amongst actions is limited to agent pairs, remains NP-complete. Forproving NP-completeness of 2-SGC, though, our reduction departs from the well-known Hitting Setproblem.

Before concluding this chapter, we shall present characteristics from the Gram matrix that allowus to assert a non-constructive existential condition of a group being but not which agents and objectsconform the group.

Results from this chapter are published in [2]

31

Page 36: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

32

3.1 Social Groups: Computing Commonality Amongst Users, Actions,and Objects

In this section, we provide a model for the behaviour of a social group, based on the objects, referredto by an action carried out by a collection of agents. We shall use this model in order to formaliseboth decision problems: SGC and 2-SGC.

3.1.1 Agents Executing Actions over Objects for a Period of Time

Given that the behaviour of an agent might change over time (e.g., people may lose friends, books.etc.), the kinds of agent interactions, together with the kinds of agent relations these interactions giverise to, remain for only over some period of time. Accordingly, we capture relations of interest relativeto a given time period, called a window. A window is given by a number of agent actions, each ofwhich we, henceforth, call a query.

Let W be the set of all windows, ranged over by w

1

, w

2

, . . ., A the set of all agents, ranged overby a

1

, a

2

, . . ., and let O be the set of all objects, ranged over by o

1

, o

2

, . . .. We shall use qryw(a, o) todenote that agent a has queried object o, over window w.

The set of active agents, with respect to a given window w 2W, is defined as follows:

agt(w) = {x 2 A | 9y 2 O, qryw(x, y)}

Likewise, the set of objects, onto which the agent actions have been performed, is given by:

obj(w) = {y 2 O | 9x 2 A, qryw(x, y)}

3.1.2 Groups

We now define a structure, called a social group, which relates agents that have carried out a queryover the same collection of objects for a given time window:

Definition 1 (Social Group). Let w be a window, with agents, agt(w), and objects, obj(w). Then,the tuple:

hw,A ✓ agt(w), O ✓ obj(w)i

written g

w(A,O) for short, forms a social group of size |O|, and weight |A|, iff every agent in A hasqueried all objects in O, in symbols:

8x 2 A. 8y 2 O. qryw(x, y)

Notice that, in particular, for a given group g

w(A,O), qryw is the Cartesian product of A⇥O.

Definition 2 (Size-/Weight-Maximal Group). Let w 2W be a window, and let Gw denote all theexisting groups in w. Then, a group g

w(A,O) 2 G

w is called a size-maximal group of Gw if theredoes not exist gw(A0

, O

0) 2 G

w such that |O| < |O0|.

Page 37: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

33

Similarly, a group g

w(A,O) 2 G

w is called a weight-maximal group of Gw if there does not existg

w(A0, O

0) 2 G

w such that |A| < |A0|.

Clearly, we can build a poset out of Gw, using a lexicographical order,�, which combines the twoprevious posets, namely: size, and weight, in that order.

Definition 3 (�, �, Maximal Group). Define � and � as follows:

• ha, ci � hb, di iff a < b, or (a = b and c d), and

• s � s

0 if s � s

0 and s 6= s

0.

Then, let w 2 W be a window, and let Gw denote all the existing groups in w. Then, a groupg

w(A,O) 2 G

w is called a maximal group of Gw if there does not exist gw(A0, O

0) 2 G

w such thath|O|, |A|i � h|O0|, |A0|i.

3.1.3 Queried Objects and Queried by Individuals

We now define symbols that collect information about individuals’ activities.

Definition 4 (Agent Cover, Object Attraction). Let w 2 W be a window, with agents agt(w) ={a

1

, a

2

, . . .} ⇢ A, and objects obj(w) = {o1

, o

2

, . . .} ⇢ O. Then, the cover of an agent ai

, withrespect to w, written q

i

(w), is a list, just as w, except that it contains all the objects queried by agenta

i

, following w’s order of appearance:

q

i

(w) = hoj

, o

k

, . . .i whenever,

w = h. . . , (ai

, o

j

), . . . , (ai

, o

k

), . . .i

Likewise, the attraction of an object oj

, with respect to w, written trkj

(w), is the list of all agentsthat have queried o

j

:

trkj

(w) = hai

, a

k

, . . .i whenever,

w = h. . . , (ai

, o

j

), . . . , (ak

, o

j

), . . .i

3.1.4 “Real” World Examples of SGC

Example 1. A team of market researchers is interested in identifying groups of clients, of an onlinebookshop, with common book interests. Since the set of client book purchases is rather huge, the teamdecides to segment the purchase record history, using a sliding window approach, therefore fixing thewindow size (for instance, the number of purchases), and the window step (for instance, the numberof purchases the window is to be slid for the next group analysis.) For the sake of simplicity, suppose

Page 38: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

34

the research team has fixed the window size to 9 purchases, and that, currently, for some window w

has got the following observation:

w = h(c1

, b

1

), (c1

, b

2

), (c1

, b

3

), (c2

, b

4

),

(c2

, b

3

), (c2

, b

2

), (c3

, b

1

), (c3

, b

4

), (c2

, b

1

)i

where we use c, and b to denote a client, and a book, respectively. Then, clients, and books form thesets agt(w) = {c

1

, c

2

, c

3

}, and obj(w) = {b1

, b

2

, b

3

, b

4

}. The client purchase record, as expressedby w, is then used to respectively compute the covers of agents c

1

, c2

, and c

3

: q

1

= hb1

, b

2

, b

3

i,q

2

= hb4

, b

3

, b

2

, b

1

i, and q

3

= hb1

, b

4

i.Clients c

1

and c

2

form a group of size and weight two, since both have purchased books b2

and b

3

,in symbols: g({c

1

, c

2

}, {b2

, b

3

}). Notice that clients c2

and c

3

also form a group of the same measures,this time, though, given by g({c

2

, c

3

}, {b1

, b

4

}), as they have in common the purchases b1

and b

4

. Withthis information, the market research team might issue a campaign, offering, e.g., for sale the booksthat clients do not have in common.

Example 2. Suppose now that we are interested in studying how a collection of concepts, c1

, c

2

, . . .,are referred to in a few wikipedia articles, a

1

, a

2

, . . .. Also, suppose, that at some time, we haveanalyzed these articles and taken the following obervation:

w = h(c1

, a

1

), (c1

, a

2

), (c1

, a

3

), (c2

, a

4

),

(c2

, a

3

), (c2

, a

2

), (c3

, a

1

), (c3

, a

4

), (c2

, a

1

)i

Then, agt = {c1

, c

2

, c

3

} and obj = {a1

, a

2

, a

3

, a

4

}, with agent covers q

1

= ha1

, a

2

, a

3

, a

4

i, q2

=

ha4

, a

2

, a

1

i, and q

3

= ha2

, a

3

i. Notice, however, that in this case q

i

denotes the articles where conceptc

i

appears.Again, we find two groups. One, g({c

1

, c

2

}, {a1

, a

2

, a

4

}) conveys, that concepts c1

and c

2

appearin three articles: a

1

and a

2

, and a

4

; and the other, g({c2

, c

3

}, {a2

, a

3

}), that concepts c2

and c

3

appearin articles a

2

and a

3

.

Further examples of SGC are given by simply substituting agents, objects and the action on aparticular scenario. For example, students and books, correspond to agents and objects respectively,while the action could be buying. As another example, we can consider users as agents, webpages asobjects and visiting a web page using a browser the action.

3.1.5 Problem Statement

Take a window, w 2W, and two positive integers, z and t. Then, the problem that asks for all groups,g

w, in w, having size k or less and weight t or less is, clearly, provably intractable. This is because it iseasy to construct instances of the problem where exponentially many groups are smaller than or equalto the given bound; this way, no polynomial time algorithm could possibly list them all. As pointedout by Garey and Jhonson [23], this problem formulation might not be realistic, in that it involves

Page 39: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

35

more information than one could hope to use. This remains true for our problem unless we are tryingto compute a maximal group, or to compare two or more populations in terms of their activity, as isthe case of this thesis work1.

Accordingly, we shall cast the calculation of social groups as a decision problem, having twopossible solutions, namely: “yes” or “no”, in a way that it becomes of practical interest. In addition,this casting is necessary as we shall be constructing a Karp reduction from a well-known decisionproblem to ours, when proving NP-completeness of social group calculation.

The decision version of the social group calculation problem can be defined as shown below:

Definition 5 (Social Group Calculation, SGC).INSTANCE: A window w 2 W, a finite set obj = {o

1

, o

2

, o

3

, ..., o

n

} ⇢ O of objects, a finite setagt = {a

1

, a

2

, a

3

, ..., a

m

} ⇢ A of agents, a finite set Qy = {q1

, q

2

, q

3

, ..., q

m

} of agent covers,one for each agent, a positive integer, z > 0 and, a positive integer, t > 0.

QUESTION: Is there a group of size lesser than or equal to z and weight t?

Example 3. Let w = h(a1

, o

1

), (a1

, o

2

), (a1

, o

3

), (a2

, o

4

), (a2

, o

3

), (a2

, o

2

), (a3

, o

1

), (a3

, o

4

)i, agt ={a

1

, a

2

, a

3

}, obj = {o1

, o

2

, o

3

, o

4

}, Qy = {q1

, q

2

, q

3

} with agent covers q

1

= ho1

, o

2

, o

3

i, q2

=

ho4

, o

3

, o

2

i, q3

= ho1

, o

4

i. Together, they constitute a yes-instance of SGC, with z = 2 and t = 2,witnessed by the group g({a

1

, a

2

}, {o2

, o

3

}).

3.2 SGC is NP-completeThe classic paper of Karp [23, 36] was a significant breakthrough towards the understanding ofcomplex problems. Karp proves the NP-completeness of several well-known problems using a mappingprocedure, later known as a Karp reduction. During the years to come, several works on provingNP-Completeness appeared. As an example during the last 12 years works like [18, 70, 19, 56]showed the validity of using Karp’s idea.

Karp showed that for proving the NP-Completeness of an unknown problem, ⇡0, one may follow afive-step procedure. In the first step, we select a problem, ⇡, that has been proven to be NP-complete(in principle, any NP-complete problem would do, but a careful selection makes it easier to find theproof of the third step, see below). Next, in the second step, we prove that ⇡0 is in NP. Then, inthe third step, we show how to transform ⇡ to ⇡0; this step is typically known as the reduction, anddenoted by ⇡

p

⇡0. In the fourth step, we show that the reduction can be carried out in polynomialtime. Finally, in the fifth step, we prove that whenever there is an answer in ⇡, then there also is ananswer in ⇡0, and vice versa.

As expected, we shall follow this proof procedure for establishing the main results of this chapter.1Indeed, there are many other cases on which we might compare populations. For example, to suggest friends, based

on common preferences, over a social network service.

Page 40: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

36

3.2.1 Longest Common Subsequence problem

For our first reduction, we have chosen the Longest Common Subsequence (LCS) problem for ourKarp reduction: LCS

p

SGC. LCS is both NP-complete, and well-known [42], for it arises inmany contexts [7], such as bioinformatics [40] or file comparison (c.f. the UNIX diff command.)Let length(·) be a polymorphic function, which has the natural interpretation, returning the numberof elements of its input argument; then LCS is defined as follows:

Definition 6 (Longest Common Subsequence).INSTANCE: A finite alphabet ⌃, a set R of strings from ⌃⇤, and a positive integer k.

QUESTION: Is there a string s

0 2 ⌃⇤, with length(s0) � k, such that s is a subsequence of eachs

0 2 R?

Example 4. Let ⌃ = {a, b, c, d}, R = {s1

, s

2

}, with strings s

1

= abcd, s2

= dadbcaa,2 and letk = 3. Together, they constitute a yes-instance of LCS, with k = 3, witnessed by s

0 = abc, thelongest common subsequence for the strings s

1

and s

2

.

3.2.2 SGC NP-Completeness

Theorem 1: SGC is NP-complete.

Proof: Having fixed ⇡ to be LCS, we then prove that SGC is in NP. To see this, notice that anyinstance of SGC can be solved using a Non-Deterministic Turing Machine (NDTM), which, uponhalting, provides a witness, g(A,O). Verifying that g(A,O) truly is a witness can be certainly carriedout in polynomial time, as it consists of checking that every agent in the set A queries all objects inthe set O.

We now proceed to produce the reduction, LCS p

SGC. We use I, J, . . . stand for indexing sets,and write f⌃

I

= {`i

: ` 2 ⌃, i 2 I} to denote the set of symbols in ⌃, indexed by I . Let s|p

denotethe element e at position p 2 {1, . . . , length(s)} of s, either a list or a string. Now, let ⌃ be a finitealphabet, R a set of strings from ⌃⇤, and let k be an integer, such that they all constitute an instanceof the LCS problem. Our reduction maps the parameters of an LCS instance to an SGC instance, asfollows:

1. For each alphabet symbol, `i

2 f⌃I

, create a unique object, oi

2 gobjI

, and then build thebijective function, 7!

I

, which associates every symbol in ⌃ with an object in obj: f⌃I

7!I

gobjI

.

2. For each string, sj

2 fR

J

, create a unique agent, aj

2 gagtJ

, and then build the bijective function7!

J

, such that fRJ

7!J

gagtJ

.2Following standard notation, we use juxtaposition to denote the string constructor function.

Page 41: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

37

3. For each string, sj

2 fR

J

, build the associated cover of the agent aj

, qj

, such that length(sj

) =

length(qj

), and such that, for all p 2 {1, . . . , length(sj

)}, sj

|p

7!J

q

j

|p

; in this way, we alsobuild Qy.

4. For each agent cover, qj

= hoi

, o

k

, . . .i, build the expected agent (sub-)window:w

j

= h(aj

, o

i

), (aj

, o

k

), . . .i. Next, build w simply by concatenating all agent windows.

5. Finally, set z = k and t = |R|.

Example 5. Consider an instance of LCS, where ⌃ = {a, b, c, d, e}, R = {s1

, s

2

}, s1

= abbdc, s

2

=

abdeb, and k = 3. Following our Karp reduction, we generate the next instance of SGC:1. obj = {o

1

, o

2

, o

3

, o

4

, o

5

};2. agt = {a

1

, a

2

};3. q

1

= ho1

, o

2

, o

2

, o

4

, o

3

i, q2

= ho1

, o

2

, o

4

, o

5

, o

2

i, and Qy = {q1

, q

2

};4. w = h(a

1

, o

1

), (a1

, o

2

), (a1

, o

2

), (a1

, o

4

), (a1

, o

3

),

(a2

, o

1

), (a2

, o

2

), (a2

, o

4

), (a2

, o

5

), (a2

, o

2

)i;5. z = 3, and t = 2.

Our reduction can be carried out in polynomial time; indeed, it is clearly linear in the number of stepsplus the number and length of each string.

Now, the only step left is to prove that, for any LCS instance, there exists a common subsequencewith length(s0) � k, if and only if there also is a group of size z = k and weight t = |R| in thegenerated SGC instance.(=)) Take again an instance of LCS, with ⌃, a finite alphabet, R, a set of strings from ⌃⇤, and k,a positive integer. Also, take this instance to be positive, with witness s. To transform s into a SGCwitness, g, carry out our reduction, and then proceed as follows:

1. SetA = agt, andO = {o | s|

p

7!J

o, p 2 {1, . . . , length(s)}}.

2. Finally, set g(A,O), which stands for a group of size |O| and length |A|.

Example 6. Consider again Example 5. The LCS instance is actually positive: s = abd is a witnesslongest common subsequence, with k = 3. Carrying out the previous procedure, we end up witha SGC instance which is also positive, witnessed by g({a

1

, a

2

}, {o1

, o

2

, o

4

}), with size z = 3 andweight t = 2.

((=) We first prove that finding a group in the generated SGC instance implies finding a commonsubsequence for the LCS instance. Take the generated instance of SGC, with agt, obj, Qy, w, z andt. Then, to find a group, if any, proceed as follows:

Page 42: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

38

1. Set A = agt. Find O ✓ obj, such that |O| = z, and such that g(A,O) forms a group of sizez = |O|, and weigth t = |A|. If there does not exist one such a group, halt with failure (seebelow); otherwise convert O into a list, O0, and continue.

2. Using the inverse of 7!J

,3 transform O

0 into the string cO

0.

3. Transform cO

0, imposing an ordering, so that it is a subsequence of every bsj

(j 2 J).

4. Call the transformed string, s, actually the witness of the LCS instance; that is, s is a longestcommon subsequence of every string in R.

Example 7. Let us go back again to the generated SGC instance of Example 5, and assume wehave now set k = 2. Then, using the previous procedure, we could have picked up the groupg({a

1

, a

2

}, {o2

, o

4

}) of size z = 2 and weight t = 2. This, in turn, yields the LCS witness s0 = bd fork = 2.

Consider now the case where we have A ✓ agt and O ✓ obj, such that A and O are not a group ofsize |O| and weight |A|. Then, it only remains to prove that O cannot be transformed into a commonsubsequence, for a positive answer of the LCS problem instance. To see this, notice that, since A

and O are not a group, not all the elements of O appear in the associated agent covers, and, thus, Ocannot be transformed into a string that is a subsequence of every string in R, out of which we havecomputed the agent covers.

Example 8. Consider Example 5, let A = {a1

, a

2

} and O = {o4

, o

5

}. Together, these sets are nota group of size z = 2 and weight t = 2. Transforming this SGC answer to a LCS witness gives uss = de, which can easily be seen not to be a valid common subsequence with length(s) � k = 2.

3Recall that 7!I and 7!J are bijective, so it is guaranteed that they both have an inverse.

Page 43: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

39

3.3 2-SGC is NP-completeWe now consider the scenario, where, given an observation window, we would like to know if everyagent belongs to a group of a size and weight at least equal to two. For example, given a collectionof Facebook friends we might want to know if all of them have commonality over some particularproduct.

This special case of SGC, we call 2-SGC, is still NP-complete. Our result follows a Karp reductionfrom the well-known Hitting Set problem, and relies on a graphical representation of 2-SGC.

3.3.1 2-SGC Problem Statement

Clearly, given a window w 2 W, the activity of agents, agt(w), over a collection of objects, obj(w),can be represented by means of a query matrix, Qw, of size |agt(w)| ⇥ |obj(w)| (where |S| denotesthe cardinality of set S), such that Qw

i,j

= m implies that agent ai

has queried m times object oj

acrossw. Then, a query matrix gives rise to a graph, we call connectivity graph, G = (V,E), which is suchthat V = {obj [ agt}, and E = {(a

i

, o

j

) | Qw

i,j

= 1} (see Fig. 3.1). We formalise 2-SGC as follows(as before, when understood from the context, we shall refrain ourselves from explicitly noting thewindow, w 2W, upon which observations are made):

Definition 7 (2-SGC).INSTANCE: A connectivity graph G = (V,E) with V = {obj [ agt} and E = {(a

i

, o

j

) |Q

i,j

= 1}, for a given Q.

QUESTION: Is there a 2-SGC?, i.e., does every agent, ai

2 agt, belong to a group, involvingobjects o

j

2 obj, of a size and a weight at least equal to two?

Example 9. Fig. 3.1 portrays an example connectivity graph, where G = (V,E), with V = obj[agt,obj = {o

1

, o

2

, o

3

} and agt = {a1

, a

2

, a

3

}, and where E = {(a1

, o

1

), (a1

, o

2

), (a1

, o

3

),

(a2

, o

2

), (a2

, o

1

), (a3

, o

1

), (a3

, o

3

)}.It actually is a yes-instance of 2-SGC, witnessed by g({a

1

, a

3

}, {o1

, o

3

}) and g({a1

, a

2

},{o

1

, o

2

}). Notice that if we removed the dotted line, the instance would no longer be of type yes.

3.3.2 Hitting Set

For proving 2-SGC NP-Completeness, we have chosen Hitting Set (HS) [36, 49]:

Definition 8 (Hitting Set, HS).INSTANCE: A finite set S, a collection C of subsets of S, and a positive integer k |S|.

QUESTION: Is there a hitting set? i.e., is there a subset S 0 ✓ S such that S 0 has at least oneelement of each subset of C with |S 0| k ?

Page 44: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

40

Figure 3.1: Incidence matrix Q, and associated connectivity graph; as is standard in the literature(resource usage), agents are denoted with circles, and objects with squares.

Example 10. Take S = {s1

, s

2

, s

3

, s

4

}, C = {C1

, C

2

, C

3

}, C1

= {s1

, s

2

, s

3

}, C2

= {s2

, s

4

, s

3

},and C

3

= {s1

, s

2

}. Together, they constitute a yes-instance of HS, with k = 3, witnessed by S

0 =

{s2

, s

3

, s

4

}.

3.3.3 2-SGC NP-Completeness

Theorem 2: 2-SGC is NP-complete.

Proof: Clearly, an NDTM can be used to solve any instance of the 2-SGC problem, yielding acollection of groups. Verifying the witness 2-SGC amounts to verifying that each agent belongs to agroup of a size and a weight at least equal to two. As argued in Section 3.2.2 group verification canbe carried out in polynomial time. Thus, 2-SGC is in NP.

We shall now provide a Karp reduction: HS p

2� SGC. Consider an instance of HS, with S, afinite set, C, a collection of subsets of S, and k � |S|, an integer. We shall index S and C using I andJ , respectively. Then, for each element s

i

2fS

I

, we construct a social group as follows:

1. Add what we call a partially social cell, consisting of an agent querying for two objects (seeFig. 3.2).

Figure 3.2: A partially social cell.

2. If there is a set Cj

containing s

i

, complete the most recently added social cell, adding an agentlabeled a

j

(see Fig. 3.3), thus forming a group of size and weight equal to two. Otherwise,

Page 45: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

41

remove the partially social cell, skip the rest of these steps, going back to step 1 and continuingwith the next s

i

2fS

I

. Henceforth, we shall call a chain of social cells a social component.

aj

Figure 3.3: A social cell.

3. Then, add a partially social cell to the current social component in such a way that the newlyadded agent is connected with of one of the resources (see Fig. 3.4).

aj

Figure 3.4: New agent chained to the component.

4. If there is another Cj

containing s

i

go back to step 2; otherwise, box the social component,forming a subgraph, and label this graph G

i

, using the same index as the current si

(see Fig. 3.5).Then, go back to step 1, continuing with the next s

i

2fS

I

.

Notice that in our reduction, the objects that are introduced for the 2-SGC are dummy. An exampletransformation of HS to 2-SGC is given below.

Page 46: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

42

aj

Gi

Figure 3.5: Boxed component, Gi

, comprising agent aj

.

Example 11. Let S = {s1

, s

2

, s

3

, s

4

}, C = {C1

, C

2

, C

3

}, with C

1

= {s1

, s

2

, s

3

}, C2

= {s2

, s

4

}, C3

=

{s1

, s

2

, s

4

}, and k = 2 be an instance of the HS problem. After applying the previous transformation,we get the connectivity graph shown in Fig. 3.6.

Notice that our reduction can be performed in polynomial time, since the construction of the graphcan be done in linear time on the cardinality of S. So, the only step left is to prove that a hitting setwith |S 0| k exists if and only if 2-SGC holds in the output connectivity graph.(=)) Take a yes instance of HS, with S, a finite set, C a collection of subsets of S, and k |S|, aninteger, such that S 0 is the corresponding witness, with |S 0| k. Then, collect together in a singleconnectivity graph all the box components, labeled G

i

, for each i such that si

2 S

0: G =S

si2S0 Gi

.Since, by construction, every agent a

j

in a boxed component Gi

is part of a social cell, it follows thatevery agent belongs to a group that is of a size and weight equal to two; therefore, we have yielded ayes 2-SGC instance.(=)) Finally, to transform a generated 2-SGC answer into one of HS, simply select at least k boxedcomponents, such that they contain all the agents appearing in the 2-SGC instance. To transform the2-SGC answer to a HS answer, construct the HS witness using the corresponding s

i

using the label ofthe selected box component G

i

.

Example 12. Consider again the HS instance of Example 11, together with the associated outputconnectivity graph 2-SGC instance, yielded by our reduction, and shown in Fig. 3.6. The set S 0 =

{s1

, s

4

} is hitting, for |S 0| k = 2, since it contains at least one element of each C

j

2 C. We map S

0

to a 2-SGC witness, selecting for each s

i

2 S

0 the corresponding box component labeled G

i

, G1

andG

4

, and forming the corresponding connectivity graph G = G

1

[ G

4

(see Fig. 3.7), in which clearly

Page 47: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

43

a3

a1

a1 a2

a3

a1

a2

a3

G1

G2

G3

G4

Figure 3.6: The 2-SGC instance that results from applying our reduction to the HS instance S =

{s1

, s

2

, s

3

, s

4

}, C = {C1

, C

2

, C

3

}, with C

1

= {s1

, s

2

, s

3

}, C2

= {s2

, s

4

}, and C

3

= {s1

, s

2

, s

4

}.

every agent ai

belongs to a group of size and weight equal to two.By contrast, notice that if we selected an invalid witness from the output connectivity graph shown

in Fig. 11, say G = G

3

[ G

4

(see Fig. 3.8), we would produce S

0 = {s3

, s

4

}, which is also invalidas a witness for the associated HS problem. Thus, there is a hitting set, with |S 0| k, if and only ifthere is a 2-SGC in the connectivity graph output by our reduction.

With this, we have completed the fifth step of the Karp reduction and also our proof of 2-SGCNP-completeness.

Page 48: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

44

a3

a1 a2

a3

G1

G4

Figure 3.7: 2-SGC made out of social components G1

and G

4

, taken from the graph shown in Fig 3.6.

a3

a1

a1

G1

G3

Figure 3.8: 2-SGC made out of social components G1

and G

3

, taken from the graph shown in Fig. 3.6.

Page 49: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

45

3.4 The Gram MatrixThe gram of both Q

w and the transpose of it, QwT provide valuable information about the socialstructure of a DNS window, In particular, Qw ⇥ Q

wT is a symmetric matrix, such that the lower(respectively, upper) triangular matrix contains all the distinct groups with weight equal to two. Putanother way, Qw

i,j

⇥ Q

wT

i,j

= n implies that w contains a 2-weight group of size equal to n, involvingthe participation of a

i

and a

j

. The main diagonal of this matrix contains information regarding theactivity of the agents, in symbols Q

w

i,i

⇥ Q

wT

i,i

= n implies that agent ai

visited at least n distinctobjects. From the main diagonal it is possible to determine the number of actions performed by themost active agent in w, i.e. max({Qw

i,i

⇥Q

wT

i,i

: i 2 {1, . . . , |agt(w)|}}).

Example 13. Take a window w, such that q1

(w) = ho1

, o

2

, o

3

i, q2

(w) = ho4

, o

3

, o

2

i, and q

3

(w) =

ho1

, o

4

i. Then, construct the associated query matrix, Qw and the corresponding gram matrix Q

w ⇥Q

wT . To better illustrate the result, we have drawn Figure 3.9. Notice that the activity of the agenta

1

is 3 which corresponds clearly to the length of the cover of agent q1

. Note also the maximalg({a

1

, a

2

}, {o2

, o

3

}) with z = 2 and t = 2. Clearly Q

w

1,2

⇥ Q

wT

1,2

= 2 showing that it is possible todetermine a maximal.

a1

a2

a3

o1 o2 o3 o41 1 1

1 1 1

1 1

0

0

0 0

3 2 1

2 3 1

1 1 2

a1

a2

a3

a1 a2 a3

Gram matrix

Figure 3.9: Gram matrix Q

w ⇥Q

wT .

Complementarily, Q

wT ⇥ Q

w provides information about the popularity of objects in w. Inparticular, the lower (respectively, the upper) triangular matrix of QwT ⇥ Q

w contains all the distinctgroups of size two. Put another way, QwT

i,j

⇥Q

w

i,j

= n implies that w contains a 2-size group of weightequal to n, involving the use of o

i

and o

j

. Also, the main diagonal contains crucial information aboutthe object use, in symbols, QwT

i,i

⇥ Q

w

i,i

= n implies that object oi

has been referred to by at leastn different queries along w. Lastly, max({QwT

i,i

⇥ Q

w

i,i

: i 2 {1, . . . , |obj(w)|}}) is the number ofactions issued to a most popular object in w.

Example 14. Consider Example ??, but this time modify the agent cover q1

such that it also queriesfor object o

4

, i.e. q

1

= ho1

, o

2

, o

3

, o

4

i and generate the gram matrix Q

wT ⇥ Q

w. From Figure 3.10we can notice that it is possible to determine a group with maximal weight, e.g. g({a

1

, a

2

}, {o2

, o

3

}).Also notice the object o

4

has been ’touched’ 3 times. Moreover, notice that {o1

, o

4

} has formed agroup of size z = 2 with weight t = Q

wT

1,4

⇥Q

w

1,4

= 2.

Page 50: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

46

a1

a2

a3

o1 o2 o3 o41 1 1

1 1 1

1 1

1

0

0 0

Gram matrix

o1

o2

o3

o1 o2 o3 o42 1 1

2 2 2

1 2

2

1

2 2

1 31 1o4

Figure 3.10: Gram matrix Q

wT ⇥Q

w

Both gram matrices provide information regarding the number of groups with fixed t = 2 andz = 2, for Qw ⇥Q

wT and Q

wT ⇥Q

w respectively. Henceforth, we shall call non-trivial any group ofa size and weight greater than or equal to two.

3.4.1 Characteristics from Q and the Gram Matrix of Q

Clearly, by computing Q and both gram matrix we can determine characteristics that capture the socialstructure of a window in polynomial time. Particularly in this work we have considered the followingcharacteristics:

• The number of queries issued from a most active agent in w. When the activity of a single agentincreases, with respect w, it is less likely to find a group. Conversely, it is more likely to find agroup when the activity of a single agent is low with respect to other agents.

• The number of queries issued to a most popular object in w. Likewise, it is less likely to find agroup when, with respect w, a large number of agents query for the same object.

• The size of a maximal group in w. This feature increases the search space that need to beexplored. For example, a maximal groups with size z and weight t is formed by groups sizez � 1 and weight no more than t.

• The weight of a group with maximal weight in w. Likewise, this it increases the search space.

• The number of groups with fixed weight t = 2, computed from Q

w ⇥Q

wT . It is more likely tofind a bigger maximal group, compared to a possible smaller one, when finding a large numberof 2-weight groups.

• The number of groups with fixed groups z = 2, computed from Q

wT ⇥Q

w. Similarly, it is morelikely to find a maximal groups when this value increments.

Lastly, we must point out that neither Q

w ⇥ Q

wT , nor Q

wT ⇥ Q

w yield the correspondinggroup witness necessary in a NP-complete problem; that is, they assert only a non-constructive,existential condition of the group being but not which agents and agents conform the group.

Page 51: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

47

3.5 ConclusionsWe have shown that the SGC problem and the special case 2-SGC are NP-complete. The theoremspresented in this chapter give insight into the difficulty of grouping agents on the basis of the actionsperformed into a set of objects.

Clearly, any algorithm capable of computing a group, with its corresponding witness, in SGC(Theorem 1) or determining if every agent belongs to some group in 2-SGC (Theorem 2), requires anexponential amount of computational resources, unless P=NP.

Moreover, if we look for a maximal group SGC might as well belong to the NP-hard class sincethe NDTM needs to compute all groups to verify the answer. Indeed, if we try to list all the groups ina window we will be facing to a function that may belong to the #P class. For proving NP-hardness ofSGC, it is required to use a Turing reduction while a polynomial-counting reduction will prove SGCis also #P.

The results from this chapter motivate a thoroughly study of heuristic-based methods to approximatea solution to the SGC problem.

The results from this chapter motivate a thoroughly study of heuristic-based methods so as toapproximate a solution to the SGC problem. For example, it may use the information provided byboth gram matrix, Qw and the transpose of it, QwT .

Page 52: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

4. On the Experimental Tractability of theSGC Problem

We have showed that, theoretically, it is not feasible to compute the complete social structure in awindow with bounded resources. However, several NP-complete problems exhibit a behaviour wherean efficient algorithm can compute, for a number of problem instances, a solution with boundedresources. In this chapter, we shall present that SGC follow the same behaviour given that severalproblem instances are easier to solve than expected.

The identification of such manageable instances is conducted by a four-step approach which aimsto study patterns in two properties: one related to the computational expense, and the other to thesolvability of the problem.

In general, our results show that we can determine where are the really hard instances of SGCin terms of the computational expense. Particularly, we shall show that these hard instances happenwhen the problem is solvable.

The outcomes of this chapter provide information about the instances of SGC for which we cancompute the complete social structure and when it is necessary to estimate it. Also, our results willbenefit the development of algorithms attempting to solve SGC given that it provides a fair baselineof comparison.

48

Page 53: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

49

4.1 IntroductionIt is well-known that solving an NP-complete problem, usually, requires an exponential amount ofcomputational resources. However, NP-complete problems exhibit a behaviour where an efficientalgorithm can provide a solution with bounded resources for several instances because the computationalcost can be considered negligible. Indeed, the classic work of Cheeseman [13] showed that thewell-known Travelling Salesman, Graph colouring and Hamiltonian path problem, amongst others,exhibit this behaviour for which an algorithm can compute a solution, if any, with polynomial resources.Complementarily, Cheeseman also identified instances for which it is extremely hard to compute acomplete solution with bounded resources. Later, this approach was applied in a number of NP-completeproblems, e.g. [29, 43, 20], as it provides information regarding the real hardness of the problems.For example, it allows to easily identify instances for which computing a solution requires a largeamount of computational resources. Also, this information can be used in the development of othermethods, e.g. heuristic based, attempting to solve a problem given that it provides a fair baseline ofcomparison. For example, if we are to determine the efficiency of two algorithms it can be tested withhundreds if not thousands of hard instances.

We have conducted the same study on the SGC problem because it will allow to identify bothmanageable and hard instances of our problem. Then, our results will benefit the design of efficientalgorithms attempting to solve SGC because it considers the real hardness of the problem. Lastly,given the context of computer security on which our work was originated, identifying such manageableand hard instances is critical to detect anomalies in the social structure of a window, given that a timelyresponse allows to mitigate the effect of an attack.

Chapter overview: In what follows (section 4.2) we shall give an outline of our study which ismade out of four steps. In section 4.3, we present the first step of the study aiming to capture theproblem structure in terms of a parameter. In the next step, in section 4.4, we show the instancesunder consideration for this study. Then, in the third step each instance is solved using a backtrackingapproach (section 4.5). In the last step, we shall report the results of our study and discuss them insection 4.6. Before concluding (section 4.7), we suggest how to use these results in the design ofefficient algorithms attempting to solve SGC.

4.2 The Phase Transition StudyIn order to conduct this study, several instances of an NP-complete problem have to be solved. Then,two type of properties are studied: the first one is solvability, which defines a problem as beingsolvable if an algorithm finds a solution for the problem, e.g. for the SAT problem, a set of valuessatisfying a boolean formula; also, a problem is considered unsolvable if an algorithm determines thatthere is not a solution for the problem, e.g. for the Hamiltonian path problem, an associated instancewith no circuit in it. The second property is the hardness which refers to the computational cost,e.g. in terms of time in seconds, of solving the problem instance.

Studying both properties, usually, gives raise to a phase transition, analogous to the Physicsphenomena. In Physics, after exceeding some critical boundary, namely the phase transition, a

Page 54: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

50

material will dramatically change its properties, e.g. from liquid to solid.After exceeding the phase transition, several NP-complete problems (e.g. the SAT problem [25])

exhibit a same dramatic change in both the hardness and solvability property.Studying these properties by means of the phase transition provides information regarding the

real hardness of the problem. This information is valuable because it can be later used as a systematicbasis on the selection of efficient algorithms to solve a problem [29].

4.2.1 Phase Transition Study Procedure

Roughly, conducting a phase transition study is a four-step approach:• Step 1, a parameter, which succinctly captures the problem structure, is selected. Identifying

such parameter is not a trivial task as many problems do not exhibit natural parameters usefulfor the phase transition study.

• Step 2, a number of instances of the problem under study is either randomly generated, orcollected from a real process. Instances randomly generated, are iteratively synthesized in termsof the order parameter selected before.

• Step 3, an algorithm that solves the problem is selected. Then, the algorithm is applied on eachinstance of the set built from the second step, and the computational expense is measured.

• Step 4, lastly, the computational expense measure is plotted against the parameter so as to studythe hardness of the problem. The resulting graph is superimposed with one that captures thesolvability of the problem in terms of a probability, also against the parameter.

In what follows we shall described our method to satisfy this procedure.

4.3 Step 1 - On the Identification of the Order ParameterSelecting a parameter which succinctly captures the problem structure is not a trivial task given theinfluence that it have in the phase transition. As pointed out by Cheeseman [13], using a differentparameter yields different results on the phase transition. Cheeseman, amongst others, refer to thisparameter as an order parameter.

While several well-known NP-complete problems exhibit natural order parameters (e.g. the averageconnectivity of a graph [13]), other approaches rely in experimental evidence so as to identify asuitable parameter. For example, in [27] the authors identify an order parameter by means of anannealed theory; later this parameter helped on the identification of the phase transition in the contextof Number Partitioning.

In our phase transition study, we have used the experimental approach on the identification ofthe order parameter because dozens if not hundreds of natural candidates can be considered (e.g. thenumber of agents, objects or actions, amongst others) and after exploring them all, the phase transitionmay be unnoticed. So, we have studied several features of SGC that may capture its structure so as tosuggest a suitable order parameter.

Page 55: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

51

4.3.1 Outline of Order Parameter Selection

Our approach to select an order parameter consists in the construction of a classifier, which separatesSGC windows, i.e. collection of objects, agents and action, according to the classes EASY and HARDthat relate both the hardness and solvability properties. In other words, after identifying instancesfrom our DNS sample as EASY and HARD, we have transformed them into a characteristic vectoron which we apply a classifier hypothesizing that the decision function that separates both classessuggests an order parameter.

4.3.1.1 EASY and HARD classes

We have related the solvability and the hardness property by computing a maximal group in a window,following the hypothesis that the larger the maximal group in a window, the more power will berequired to find it. Then, the underlying characteristics of a window can be studied so as to identifycomplex instances. For example, an instance with a maximal group with size z = 20 and weightt = 20 may have characteristics different from another instance with a maximal group with size z = 4

and weight t = 5.Due to computational resources restrains (i.e. clearly, we cannot always enumerate all the groups),

we have, for now, impose a limit as to the maximum time we are able to compute a maximal group.Hence, we consider a window to be EASY if it can be solved (i.e. a maximal is found) within a giventime bound, 20 seconds in our case, and HARD, otherwise.

4.3.2 Characteristics under Study

In order to construct the classifier, we have transformed each EASY and HARD window into acharacteristic vector made out of the Gram matrix characteristics (see Chapter 3). Complementarly,we have considered in the characteristic vector, other characteristics that can be computed immediatelyand provide information about the hardness of a window. We have considered the following characteristics:

• |agt(w)|: the number of active agents in w; we have selected this feature since it increases, inprinciple, the number of combinations that need to be attempted to solve a problem instance.

• |obj(w)|: the number of queried objects in w; likewise, it increases the number of combinationsthat need to be attempted to solve a problem instance.

• H

agt

(w)): the entropy of agents over w. Take the sequence of queries in w, and form aprobability distribution function as follows. For each agent a

i

2 agt(w), compute w

ai , thesequence that results from deleting every query from w other than those issued by a

i

. Then, theprobability that agent a

i

has issued a query in w, denoted Pr(qiw

), is given by:

Pr(qiw

) =length(w

ai)

length(w)

With this, we define agent entropy as:

H

agt

(w)) =

length(w)X

k=1

Pr(qkw

) log2

(Pr(qkw

))

Page 56: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

52

A small value of agent entropy indicates a large variability of the relative agent activity, makingit more likely to find a small maximal group compared to the possible maximum one. Bycontrast, if this feature value is large, then the variability of the relative agent activity is small,making it less likely to find a maximal group.

• H

obj

(w)): the entropy of objects over w, which is defined likewise, except that, instead of wai ,

we use w

oj , the sequence that results from deleting every query from w other than those thatrefer to object o

j

. A small value of object entropy indicates a large variability of the relativeobject queries, making it more likely to find a small maximal group compared to the possiblemaximum one. By contrast, if this feature value is large, the reverse holds.

• The length of a query sequence, which is as w, except that it contains no two or more queriesrelating the same agent and object. It is more likely to find a groups when this feature value isbigger, since the variability of the actions (i.e. agents along with distinct queried object) alsoincreases. In contrast, when this value is small, the reverse holds.

• comb(w) : is an approximation of the total number of object combinations in w when lookingfor groups given by the 3 highest values of the lower triangular Q ⇥ Q

T matrix. This valueincrements the number of combinations that need to be attempted to solve an instance.

• The social degree of w, �(w), defined as the ratio between the number of objects, and theaverage agent activity in w: �(w) = |obj(w)|

qw, where:

q

w

=

|agt(w)|X

k=1

length(qak)

|agt(w)|

When �(w)! 1 it is less likely to find groups; by contrast when �(w)! 0 the reverse holds.

4.3.3 Construction of the Classifier

After having transformed every EASY and HARD window, into a characteristic vector, we haveapplied C4.5 [52], as implemented in Weka [69]. We have selected the C4.5 classifier because itprovides rules which are both easy to interpret and to construct. Indeed, the construction reliesin simple information gain heuristics which are well-known. Moreover, each rule could be easilyinterpreted because after the features reach some value, the classifier, we expect, will classify thewindow as EASY or HARD, providing information of interest related to the complexity of a window.For example, if our classifier provides a rule like A window is HARD, if the number of object exceeds64, we can claim that the number of objects in a window have some influence in the complexity ofthe window making it, probably, hard to solve and a possible order parameter could be a function ofthis feature.

Our study considers the construction of a classifier for several windows with different sizes,namely: 100, 150, 200 and 250, since we aim to observe whether and, if so, how the number of actionsaffect the features involve in the order parameter. Thus, for each window size we have generated thecorresponding training, validation and test set following the golden rule in data mining about the sizeof the sets, namely 40%, 30% and 30% respectively.

Page 57: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

53

The construction of our classifier is as follows. First, we arbitrarily pick windows from the trainingand test set so as to obtain 10% of the total windows from both sets. Second, we applied C4.5 in theremaining instances, obtaining a classification tree, which was then tested on the selected 10% of theinstances. Third, we repeated this procedure 10 times for cross validation. Finally, we have testedthe best classification tree on the validation set and reported on the classification performance. In thissection, we will detail the results obtained throughout our experimentation.

4.3.4 Classification Trees and Classification Rates

After applying C4.5, the resulting rules from the different classification trees, one for each windowsize, present a similar structure in the sense that they include features like the: size of a maximal group,length of a query sequence, number of groups t = 2, number of objects and number of agents. Hence,for the sake of simplicity, we only show, in Figure 4.1, the resulting rules for the set of windows withsize 250.

maximal

maximal

length

comb weight

objects

agentslength

<= 3

<= 146

<= 15

<= 91

<= 1

<= 57

<= 4

> 3

> 146

> 1> 15

> 91

> 4

> 57

<= 24 > 24

HARD

HARD EASYEASYEASY

EASY EASY

HARD

HARD

Figure 4.1: Resulting rules after applying the C4.5 classifier into the learning set with window size250.

We have noticed a predominant, i.e. a rule classifying a over 80% of the HARD class, that appearsin all the classification trees, e.g. composed of Size of a maximal greater than three and number ofobjects greater than 57. Then, we have applied this predominant rule into all our HARD instances soas to notice how well this single rule classifies the HARD class. From Figure 4.2 we can notice thatthis predominant rule classifies correctly from 97% to 100% of the instances for windows of size from100 to 250 respectively. This suggest that HARD windows are strongly correlated with the features:size of a maximal and number of objects.

Page 58: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

54

100 150 200 25097

97.5

98

98.5

99

99.5

100

Learning set − window size

Har

d in

stan

ces

clas

sifie

d(%

)

100150200250

Figure 4.2: Proportion of HARD windows classified correctly with the rule.

Learning set with False Positives False Negativeswindow size Rate Rate

100 0.770 0.259150 5.609 0.199200 13.370 0.197250 17.5 0.259

Table 4.1: False Positives and False Negatives from resulting trees.

In general, the classification rates for each learning set (Table 4.1) show a low false negative ratefor any window size. By contrast, the false positive rate grows as the window size increases. The lownumber of EASY cases on larger windows (e.g. 200 and 250) leads to these results. Incrementing thewindow size affects the number of EASY instances, since it is more common to find HARD instanceson larger windows.

4.3.5 Evaluation of the Classifier Using ROC curves

We shall show the results of applying our C4.5 based, classifier to the validation set. To illustrateour classification tree, we shall use so-called ROC curves. A ROC curve is parametric, generatedby varying a threshold from 0% to 100%, and computing both the false positive rate, and the falsenegative rate, at each operating point. The false positive rate (FPR) is the rate at which the classifierfalsely regards a SGC instance HARD, while the false negative rate (FNR) is that at which theclassification tree falsely regards a HARD window EASY. In general, there is a trade-off betweenthese rates. The lower and the further left a ROC curve is, the better the classifier is.

Resulting ROC curves (Fig. 4.3) show a good performance for every learning set, since the curve

Page 59: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

55

10−2

10−3

10−2

FPR

FNR

100

(a) Learning set window size 100

10−2 10−1

10−3

10−2

FPR

FNR

150

(b) Learning set window size 150

10−2 10−1 10010−4

10−3

10−2

10−1

100

FPR

FNR

200

(c) Learning set window size 200

10−2 10−1 10010−4

10−3

10−2

10−1

100

FPR

FNR

250

(d) Learning set window size 250

Figure 4.3: Resulting ROC curves reported from the classification tree. Notice that both FPR andFNR range over the interval [0,1], and that, to better appreciate our results, we have plotted both axisusing a 10�x scale.

is lower and further left. Furthermore, the overall false negative rate and false positive rate is low, forthe learning sets, considering windows size: 100, 150 and 200. Notice that there is an decrement inthe performance of the classifier considering window size 250 which shows that, ironically, EASYinstances become hard to classify.

4.3.5.1 Evaluation of the Classifier Using F-Measure curves

To better analyse the performance of the classifier, in terms of the relative effect of weighting falsenegatives versus false positives, we have appealed to the so-called F-Measure [51]. Roughly, theF-measure evaluates the effectiveness of an information retrieval mechanism, by analysing the relationbetween two quantities: recall and precision. Recall is the proportion of relevant material that isretrieved, and precision is the proportion of retrieved material that is relevant. In the context ofclassification systems recall and precision are defined in terms of three performance rates: True

Page 60: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

56

Learning set window size� 100 150 200 2500.5 0.979 0.9511 0.840 0.7791.5 0.9745 0.9604 0.9054 0.86154 0.9722 0.9651 0.9425 0.9103

Table 4.2: F�

measure for different values of �.

Positives Rate (TPR), False Positive Rate and False Negative Rate.The F-measure can be parametrized in the following way:

F

=(1 + �2) ⇤ TPR

(1 + �2) ⇤ TPR + �2 ⇤ FNR + FPR

where � accounts for the relative importance that could be simultaneously given to both recall andprecision. If � ! 0 precision is more important than recall; if � !1 the reverse holds.

While the ROC curves give a good idea of the general performance of the classification of instances,the F-measure provides an indication of the effectiveness of the classification when false negativesand true positives are weighted. Table 4.2 shows the F-measure for five different values of � for everylearning set.

To better illustrate the effect of � we show the F-measure in Fig. 4.4 for different values of �.Briefly, if we prefer recall over precision, i.e. � ! 1, the performance of the classifier are allsimilar. By contrast, when we prefer precision over recall, i.e. � ! 0, the classifier performance isnot affected by the value of � for the learning sets 100 and 150. These results reinforce the conclusionthat we have drawn from the ROC curves, previously. Notice that � = 1 is a significant inflectionpoint, usually known as the F

1

measure.

0 1 2 3 4 5 6 7 8 9 100.7

0.75

0.8

0.85

0.9

0.95

1

Beta

Fmea

sure

100150200250

Figure 4.4: F�

measure. A major inflection point is in � = 1.

Page 61: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

57

4.3.6 Discussion

We have proved that the classifier is able to recognize with a great performance EASY and HARDwindows. These observations were supported by its corresponding ROC and F-� curves.

The results from our classifier show that, when classifying windows as HARD or EASY, thefeatures that capture the complexity of SGC are the size of a maximal group and the number ofobjects.

Then, we have selected as our order parameter the size of a maximal group in a window w, zmax,divided by the number of objects in w, |obj(w)| in symbols: zmax/|obj(w)|. The rationale behindusing this order parameter is because: First, the features were suggested by our C4.5 classifier, andsecond because, in principle, when zmax/|obj(w)| ! 1 an instance will be harder. This is, in part,because any efficient algorithm attempting to solve it, will explore a large search space as all theobjects in the window are involved in the maximal group. By contrast, we expect that a small valueyields to an easier instance.

⌅With this, we have completed the first step in our phase transition study.

4.4 Step 2 - Instances under ConsiderationIn order to conduct a phase transition study, a number of instances of the problem under study is eitherrandomly generated or collected from a real process. The phase transition study, indeed, requires toconsider hundreds if not thousands of instances because the phase transition phenomena, if present,could be easier to notice. Moreover, one usually is not interested in studying a few given problems butrather study a variety of problems likely to be encountered in the future [29]. In such cases, typicallythere is few data available to conduct the study so instances of the problem are synthesized in termsof the order parameter. Also, if enough data is available, we can collect instances from a real-worldprocess so as to study the problem and this approach yields to similar results, as shown in [27] in thecontext of Number Partitioning.

In this thesis work, we have collected a set of SGC instances, randomly sampling real traffic toa recursive DNS server. The rationale behind this decision is because the rate at which our DNStraffic is generated (2.4 million DNS queries per day) makes infeasible to study it, unless we use adistributed computing approach. An implementation of this type is out of the scope of this thesiswork 1). By contrast, working with a sample of DNS traffic is a more manageable because we onlyrestrain ourselves to a portion of the entire traffic with similar results [22].

Our procedure is as follows:First, we arbitrarily picked five DNS logs, each of which is a sequence of queries, from agents to

URL’s. Then, from each log, we randomly sampled a subsequence of a given length correspondingto a SGC instance with an associated window size. This was repeated as many times so as to attain asubsequence that amount to the 40% of every log. Collecting together every SGC instance, we have

1However, some progress on this implementation has been made in [61] by Adrian Avila’s work

Page 62: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

58

formed a set which was used to construct the characteristic vector, one for each sampled window.We shall refer to this transformed instances as a learning set. Next, the learning set is split in 70%and 30% the so-called training and test set. The remaining log bits were also transformed into acharacteristic vector, but kept in a separated set, for validation purposes; hence the name validation.Lastly, we have restrain to study only windows of size 50 to 150 in steps of 25 queried sequences;however, our phase transition study can be readily applied to other windows sizes.

4.5 Step 3 - Algorithm to Solve SGC and Measure of ComputationalExpense

In this section we shall present the algorithm we used to solve an SGC instance in our study of thephase transition for SGC. Roughly, our algorithm gets a window that contains actions, agents andobjects as input and returns a group with size k and weight t, if any, along with the correspondingcomputational expense involved.

4.5.1 Algorithm Applied into the SGC Instances

In any attempt at solving an SGC instance we have applied a backtracking based approach. Roughly,our algorithm explores all the possible combinations that may form a group, but also ignore thosecombinations that will not form a bigger group, hence the use of backtracking.

Example. Consider the following SGC instance:1. obj = {o

1

, o

2

, o

3

, o

4

, o

5

};2. agt = {a

1

, a

2

};3. q

1

= ho1

, o

2

, o

2

, o

4

, o

3

i, q2

= ho1

, o

2

, o

4

, o

5

, o

2

i, and Qy = {q1

, q

2

};4. w = h(a

1

, o

1

), (a1

, o

2

), (a1

, o

2

), (a1

, o

4

), (a1

, o

3

),

(a2

, o

1

), (a2

, o

2

), (a2

, o

4

), (a2

, o

5

), (a2

, o

2

)i;Notice that the combination (o

2

, o

3

) of objects in the tuple {(a1

, a

2

), (o2

, o

3

)} does not form agroup; clearly, nor will it the combination (o

2

, o

3

, o

4

) or (o2

, o

3

, o

4

); hence, given that (o2

, o

3

) couldnot form a bigger group it is pushed into a blacklist, implemented in our algorithm, which collects allthe combinations that should be ignored when searching bigger groups.

Our algorithm considers as input a window with is corresponding agents, objects and actionsand the size z and weight t of the group being looked for. An instance is considered solved whenour algorithm returns a witness, a tuple g

w(A,O) with size z = |A| and weight t = |O|. Theimplementation of our algorithm is as follows:

Where the function COMMONAGENTSQRY ING(C) returns a set of common agents(w)

querying for all the combinations of C. Notice in line 10-11 of the algorithm that the implementation

Page 63: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

59

Algorithm 1 Backtracking based approach to solve a SGC instance.Input: A window w that contains actions from agents to objects and a size z and weight t of the

group being looked for.Output: A group with size z and maximal weight t with the corresponding witness.

1: r 2

2: blacklist null

3: while z � r do4: C

�obj(w)

r

5: for all C in C do6: // C is a collection of objects.7: if not C in blacklist then8: agt COMMONAGENTSQRY ING(C)

9: // agt is a collection of agents.10: if |agt| > 2 and z = r then11: return g{agt,C}12: // Return a group with objects in C and agents in agt

13: else14: if |agt| < 2 then15: blackList blacklist+C

16: end if17: end if18: end if19: end for20: r = r + 1

21: end while22: return noSolution

returns a group with the corresponding objects and agents and if not group is found, the algorithmreturns a noSolution flag.2

2An implementation in PERL of this algorithm is available at https://db.tt/vDG5vXh5.

Page 64: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

60

Example. As an example of our algorithm solving an instance of SGC, consider we want to find agroup with size z = 2 and weight t = 2 in the following SGC instance:

1. obj = {o1

, o

2

, o

3

};2. agt = {a

1

, a

2

};3. q

1

= ho1

, o

2

, o

3

i, q2

= ho1

, o

2

i, and Qy = {q1

, q

2

};4. w = h(a

1

, o

1

), (a1

, o

2

), (a1

, o

2

), (a1

, o

3

),

(a2

, o

1

), (a2

, o

2

), (a2

, o

2

)i;First, we give the window w to the algorithm. Second, we provide the input z = 2 and t = 2,

the size and weight of the group we are looking, respectively. Third, we compute the combinationsresulting in C = {(o

1

, o

2

), (o1

, o

3

), (o2

, o

3

))}. Fourth, we look for common agents querying for aparticular C 2 C, C = (o

1

, o

2

) in this example. A list of common agents is generated with h(a1

, a

2

)i.Fifth, this list of agents along with the combination C is considered a group. Sixth, given that r = z

then the algorithm output the group g({a1

, a

2

}, {o1

, o

2

}).

4.5.2 Measure of Computational Expense

A measure of computational expense is selected because the phase transition study aims to identifypatterns of complexity on the difficulty to solve a hard problem. Naturally, determining whetheran instance of a problem is difficult should be reported in terms of a measure of the computationalexpense. Several works (e.g. [26, 24, 28]) present the number of explored combinations as a way tomeasure the computational expense while it is valid also to use the so-called time-to-solve an instanceas measure [53].

In this phase transition study we have considered the number of explored combinations as ourmeasure of computational expense. The rationale behind this decision is that it allows to generate anunbiased study given that the hardware on which the experiment is conducted does not have influencein this measure. Indeed, each explored combination (see for example line 10-11 of the algorithm) isreported as computational expense until the algorithm stops.

Lastly, we have solved thousands of SGC instances considering several computers in an attemptto explore more instances in a short period of time3.

⌅With this, we have finished the third step on the phase transition procedure.

3Our experimentation on this chapter lasted about 6 months of continuous calculations using two computers: the firstone with a Core i7 2Ghz computer with 4 GB in RAM and a mainframe with two Xeon 3GHz processor with 8 GB inRAM.

Page 65: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

61

4.6 Step 4 - The Phase Transition of the SGC ProblemIn this section, we will report the results obtained throughout our phase transition plots of our study.The phase transition study was conducted on both decision and optimality version of SGC usingthe size of a maximal group divided by the number of objects and the size of a maximal group,respectively, as order parameters4.

4.6.1 Phase Transition of the SGC Decision Problem

First, we shall present the results of applying the four-step approach of our phase transition study inthe decision version of SGC, defined as shown below:

INSTANCE: A window w 2W, a finite set obj = {o1

, o

2

, o

3

, ..., o

n

} ⇢ O of objects, a finite setagt = {a

1

, a

2

, a

3

, ..., a

m

} ⇢ A of agents, a finite set Qy = {q1

, q

2

, q

3

, ..., q

m

} of agent covers,one for each agent, a positive integer, z > 0 and, a positive integer, t > 0.

QUESTION: Is there a group with size lesser than or equal to z and weight t?

Identifying the phase transition on the decision version of SGC aims to determine if the hardnessand solvability property exhibit patterns as many well-known NP-complete problems.

In order to observe such patterns, the computational expense, the number of explored combinationsin our case, is plotted against the order parameter, zmax/|obj(w)|. Typically, this graph shows aregion of hard instances where the computational expense is large with respect to the easy instanceswhich are considered manageable. Also, this graph could exhibit a dramatic change in the hardnessof solving the problem, going from an easy region to a hard region, the so-called easy-hard pattern.Interestingly, there are examples of NP-complete problems (e.g. [26] in the context of SAT) showingpatterns like an easy-hard-easy pattern and much more complex pattern like an easy-hard-easy-hardpattern as shown in works like [53]. .

Also, it is possible to study the solvability property of a problem by plotting the probability offinding a solution,e.g. for the SAT problem a set of values satisfying a boolean formula, against theorder parameter. Here, the probability of an instance to be solvable, is computed by the function,if synthesized, that generates the random instances and if instances were taken from a real-processit can be determined by the frequency of the solvable instances divided by the total instances in agiven value of the order parameter. After plotting this graph, it is possible to notice changes in thesolvability property where, for example, a problem may go from a region with solution, to regionwhere instances cannnot be solved after a dramatic change, namely the phase transition.

Both the plot of hardness and the plot of solvability are usually superimposed so as to betterillustrate the results of these properties.

4I am indebted with Eduardo Aguirre from NetSec Tecnologico de Monterrey (CEM) for providing me withCPU-cycles and for developing some transcripts required for plotting the phase transition we found.

Page 66: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

62

4.6.1.1 Experimental Setting

We have used in this experimentation instances of SGC instances considering different window sizesfrom 50 to 150 in steps of 25, as presented in section 4.4. Then, as described in step 3, we have solved1000 SGC instances using our algorithm as depicted in section 4.5.1 and reported the computationalcost.

We have, for now, restrain to study group sizes z from 2 to 6 with weight t = 2. For the sake ofsimplicity, we shall refer to a group with size z and weight t as only a group size z.

Lastly, our results ignored the size of the window as a factor influencing the hardness and solvabilityof SGC and thus our results were split according to the group size for any window. Later in thissection, we will show that this decision does not affect the study of the hardness and solvability ofSGC, as we present results according to the window size on the optimality version with similar results.

Attention is now turned into the resulting graphs.

4.6.1.2 Results

We first report the experiments of the SGC problem considering the size z = 2. Figure 4.5 shows, ina solid line, the mean number of explored combinations to find a group with size z = 2 for 1000 SGCinstances while the dashed line in the plot indicates the probability of finding a solution for SGC withz = 2. Notice that zmax/|obj(w)| should go from 0 to 1 but to illustrate better the results, the orderparameter was plotted in logarithmic scale, in symbols bz = log

2

(zmax/|obj(w)|).

Page 67: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

63

The mean cost shows an hard-easy pattern as the number of objects and the size of a maximal getthe same value with the easy region, i.e. the region with the lowest computational cost. Notice that ifzmax > 0 then there must be a group of size z = 2, hence, the probability of finding a groups z = 2

is 100%. Also, notice the peak mean cost is 6.25 combinations at bz = �6. Note, also, an inflectionpoint at bz = �3, with the associated cost of 4.2 combinations when the computational cost decreasedrastically. There is also an inflection point at -1.5 before the problems gets manageable. This plotshows that the hardest instances are when bz is between -6 to -5.5.

−6 −5 −4 −3 −2 −1 0

1

2

3

4

5

6

7

#exploredcombinations

z

Phase transition plot for SGC with z = 2

−6 −5 −4 −3 −2 −1 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Probability

Figure 4.5: Mean cost of finding a solution with t = 2 and z = 2 for window sizes 50 to 200 in stepsof 25.

Page 68: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

64

We now report our experimentation considering the size z = 3. Figure 4.6. The mean costshows an easy-hard-easy pattern, also, the hard region is associated with solvable instances, i.e. theregion where the probability of finding a solution is more than 50%. The maximum mean cost of 202combinations is at bz = �1.5 which is a major inflection point in the graph. At bz = �4, there is aninflection point when the probability of finding a solution is more than 80%. Lastly, instances withouta solution are when bz goes from -6 to -5.

−7 −6 −5 −4 −3 −2 −1 00

100

200

300

#exploredcombinations

z

Phase transition plot for SGC with z = 3

−7 −6 −5 −4 −3 −2 −1 00

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Probability

Figure 4.6: Mean cost of finding a solution with t = 2 and z = 3 for window sizes 50 to 200 in stepsof 25.

Page 69: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

65

The results of our experimentation considering z = 4 are shown in Figure 4.7. The mean costshows a clear easy-hard-easy-hard-easy pattern. The hardest instances are in the region with solutionbut notice that before reaching the maximum mean cost of 1300 combinations, at bz = �1.5, there areseveral instances with mean cost between 100 to 800 combinations, in the region of bz from -3 to -2.Also, similar inflection points at the order parameter value of -1.5 and -1 are present as seen before inz = 3, but with an increase in the mean cost. Instances after the bz = �1 are easier than expected.

−7 −6 −5 −4 −3 −2 −1 00

500

1000

1500

#exploredcombinations

z

Phase transition plot for SGC with z = 4

−7 −6 −5 −4 −3 −2 −1 00

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Probability

Figure 4.7: Mean cost of finding a solution with t = 2 and z = 4 for window sizes 50 to 200 in stepsof 25.

Page 70: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

66

Notice that for z = 5, Figure 4.8 shows similar inflection points in bz at -1.5 and -1 with mean costof 6000 and 3120 combinations, respectively. Particularly, at bz = �3 the hard region is associatedwith the solvable region. Notice, the easy-hard-easy-hard-easy pattern in the mean cost of solvingSGC as seen before.

−7 −6 −5 −4 −3 −2 −1 00

2000

4000

6000

8000

#exploredcombinations

z

Phase transition plot for SGC with z = 5

−7 −6 −5 −4 −3 −2 −1 00

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Probability

Figure 4.8: Mean cost of finding a solution with t = 2 and z = 5 for window sizes 50 to 200 in stepsof 25.

Page 71: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

67

Our experimentation for z = 6, in Figure 4.9, shows an easy-hard-easy pattern on the mean costof solving SGC. As seen before, the hard region is associated with the region with solution. Noticethat between -6 and -4 for bz instances are easier, with respect to the hardest instances, and without asolution. This graph shows the same inflection point at -1.5 for the order parameter with cost 24000.Note, also, that after bz = �1.5 the computational cost decreases before starting to increase. Instanceswith a value greater than -3.5 for the order parameter are solvable.

−7 −6 −5 −4 −3 −2 −1 00

1

2

3x 104

#exploredcombinations

z

Phase transition plot for SGC with z = 6

−7 −6 −5 −4 −3 −2 −1 00

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Probability

Figure 4.9: Mean cost of finding a solution with t = 2 and z = 6 for window sizes 50 to 200 in stepsof 25.

We have studied the search cost considering the percentiles 25%, 90% and 50% the so-calledmedian; the rationale behind studying these measures is because the mean cost can obscure the searchcost of finding a group, hiding some information that could be of interest, e.g. the cost of the hardestinstances. Indeed, the percentile 25% give an insight about the easiest instances of SGC, while thepercentile 90% provides information about the hardest instances. Lastly, the median, the percentile50%, provides a middle point that separates easy and hard instances. In order to illustrate better thehardness of SGC we have restrain these results to finding a group with size from 4 to 6.

Page 72: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

68

In Figure. 4.10 we can notice the same easy-hard-easy pattern and a significant increment in thehardest instances of SGC for z = 4. The vertical line indicates the point at which the probabilityof finding a solution is 50%, after this point, instances are solvable, before reaching this point,instaces are more likely to be unsolvable. Notice that the hardest instances happened at the samepoint indicated by the mean cost, bz = �1.5, with a cost of 1400. Also, notice that even if we analysethe percentile 25% that the hard instances, in bz = �1.5, have a cost of more than 1000 combinations.The median cost of solving SGC for z = 4 is approximately 1250, three times more than the resultsfound considering only the mean. Notice that after the inflection point bz = �1.5, there is a decrease inthe computational cost until we reach a cost of 562 combinations at bz ⇡ 0. These results suggest thatfinding a group with size approximately the number of objects in a window is relatively easier thanexpected. An explanation about this behaviour could be attributed to the fact that there is only onelarge group in the window; then, the algorithm easily discards the combinations that do not belong tothis large and single group.

−7 −6 −5 −4 −3 −2 −1 00

500

1000

1500

z

#ofexploredcombinations

Phase transition plot for SGC with z = 4

25%50%90%

Figure 4.10: Percentile 90%, 25% and median cost of finding a group with size z = 4 and weightt = 2 for window sizes 50 to 200 in steps of 25.

Page 73: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

69

Our experimentation for z = 5, in Figure. 4.12, shows the same easy-hard-easy pattern. Noticethe inflection point bz = �2 before noticing a sudden increase in the computational cost. At bz = �1.5we have the hardest instances indicated by the percentile 90% with cost of 7000 combinations. Afterbz = �1.5 the computational cost decreases until it reaches a cost of 2000 combinations at bz ⇡ 0.Notice that the percentile 25% and the median show a similar behaviour both on its inflection pointsand in the computational cost.

−7 −6 −5 −4 −3 −2 −1 00

1000

2000

3000

4000

5000

6000

7000

8000

z

#ofexploredcombinations

Phase transition plot for SGC with z = 5

25%50%90%

Figure 4.11: Percentile 90%, 25% and median cost of finding a group with size z = 5 and weightt = 2 for window sizes 50 to 200 in steps of 25.

Page 74: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

70

The resulting plot of studying SGC with z = 6 show, not surprisingly, the same easy-hard-easypattern on its cost. Notice the same significant inflection points at bz = �2 and in bz = �1.5.The hardest instances have a cost of 28000 combinations considering the percentile 90% at bz =

1.5 However, notice that near bz = �1.5 a sudden decrease in the computational cost, then thecomputational cost increases and drastically decreases before reaching 5000 combinations at bz ⇡ 0.Lastly, notice that considering the percentile 25% the hardest instances have cost of 18000 exploredcombinations.

−7 −6 −5 −4 −3 −2 −1 00

0.5

1

1.5

2

2.5

3

3.5x 104

z

#ofexploredcombinations

Phase transition plot for SGC with z = 6

25%50%90%

Figure 4.12: Percentile 90%, 25% and median cost of finding a group with size z = 6 and weightt = 2 for window sizes 50 to 200 in steps of 25.

4.6.1.3 Discussion

For a small z, namely z = 2 and z = 3 the computational cost of solving the hardest instances of SGCcan be considered as manageable. Moreover, such hard instances are solvable. By contrast, the easyinstances have a low cost and are unsolvable. For z from 4 to 6, there is also an inflection point at themaximum mean cost when bz is near -1.5. Particularly, bz = �1.5 is zmax/|obj(w)| = 0.375 meaningthat the hardest instances are when there are 8 times more objects in the window than groups of sizez = 3. In other words, the hardest instances follows a relation 3 to 8, in symbols zmax/|obj(w) =3/8.

Page 75: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

71

Considering the median and the percentiles 25% and 90% we have noticed an (expected) increasein the computational cost of solving SGC with z from 4 to 6. Indeed, we have evidenced that thethree plots exhibit the same easy-hard-easy pattern, and they have the same inflection points but withdifferent computational costs.

Notice, also, that the region with the hardest instances are in the region with solution and usuallythese regions starts with an inflection point.

As another observation, some well-known NP-complete problems exhibit the hardest problems atthe region where there is 50% probability of having a solution. By contrast, we showed that the hardinstances of SGC are at the region with solution. This result is relevant when selecting an algorithm toefficiently solve SGC given that hard instances are more likely to have a solution when bz ! 0 whileeasy instances in several cases are more likely to be unsolvable when bz ! �7.

After the median peak cost at bz = �1.5 there is a decrease on the computational cost of findinga group. The explanation for this is because after this inflection point, the size of a maximal groupis reaching the number of objects in the window and our algorithm is easily discarding combinationsthat will not be used. This result show that a group size z = |w| is uncommon and easy to solve.

We have found that SGC follows patterns on their complexity similar to other NP-completeproblems where the problem goes from unsolvable to solvable instances with an associated computationalcost. Moreover, having an instance of SGC, we can certainly determine how hard it will be and if itwill have a solution. Roughly, an instance with bz between -4.5 to -1.3 is really hard but solvable whenlooking for small groups while bz = �1 indicates the hardest instances for larger groups.

Page 76: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

72

4.6.2 Phase Transition on the Optimality Version of SGC

Results from our previous experimentation on the decision version of SGC showed similar inflectionpoints, on the same hard regions, for every tested size z. So, in order to determine if this pattern isrepeated for different group weights and sizes, we have investigated the optimality version of SGC,where a group of maximal size and weight has to be found.

4.6.2.1 Experimental Setting

In this experimentation we have considered the same collection of SGC instances as described in theStep 1 of our phase transition procedure (see section 4.4) and used in the phase transition study of thedecision version of SGC which consider instances with window size from 50 to 150 in steps of 25.

Following an approach similar as in [27], we have considered to use in this study, the size of amaximal group as the order parameter. Also, the selection of this parameter is supported by our resultsfrom the C4.5 classifier (see section 4.3) which suggest this parameter as a good candidate, we shallrefer to this parameter as zmax.

Next, we split the instances according to the window size in turn and plotted the mean cost offinding a group with size z = zmax+ d. Where d 2 {�zmax+ 2,�zmax� 3, ..., 0} is a defined asthe distance from the optimal solution, i.e. a maximal is found, in a given instance of SGC. At d = 0,namely z = zmax, a maximal is found while at d = �zmax+2, namely z = 2 means that a we havefound a group with size 2 which could be part of a maximal. Notice that we have ranged over thesevalues because with the Gram matrix Q

w ⇥ Q

wT we have certainty about the size of a maximal andthus there is not a bigger group in w.

Last, we modify our algorithm (see. section 4.5.1) so as to report on the computational cost,namely number of explored combinations, of finding a group with size z = zmax� d

4.6.2.2 Results

Figure 4.13 for a window with size 75 and 100, and Figure 4.14 considering a window with size 125and 150, shows two type of behaviours. In the region with solution, there is an exponential growthas the phase transition, i.e. the boundary between solvable and unsolvable instances indicated bythe dashed line, is approached because we are reaching to a group with maximal size. In the regionwithout a solution, when d ! ↵ the computational cost gets to zero because, according to the Grammatrix, there are not groups bigger than a maximal.

Note that the mean cost of finding a maximal group is 1397, 17874, 31293, 51604 for the windowswith size 75, 100, 125 and 150 respectively.

4.6.2.3 Discussion

Notice from the graphs that windows with size from 75 to 150 have an inflection point when d = �20showing that there is a maximal group with size 20 or more. This suggest that even in smaller windowsthere are large groups, in terms of its size. Also, after d = �20 the computational cost of finding abigger group increases until it reach d = �10.

Page 77: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

73

−20 −15 −10 −5 00

200

400

600

800

1000

1200

d

mea

n nu

mbe

r of e

xplo

red

com

bina

tions

Optimality version for a window size w = 75

−20 −15 −10 −5 00

0.5

1

1.5

2

2.5

3

x 104

d

mea

n nu

mbe

r of e

xplo

red

com

bina

tions

Optimality version for a window size w = 100

Figure 4.13: Mean cost of finding a solution for the optimality version of SGC for a window with size75 and 100.

Page 78: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

74

−20 −15 −10 −5 00

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5x 104

d

mea

n nu

mbe

r of e

xplo

red

com

bina

tions

Optimality version for a window size w = 125

−20 −15 −10 −5 00

2000

4000

6000

8000

10000

12000

14000

16000

d

mea

n nu

mbe

r of e

xplo

red

com

bina

tions

Optimality version for a window size w = 150

Figure 4.14: Mean cost of finding a solution for the optimality version of SGC for a window with size125 and 150.

Page 79: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

75

Indeed, at d = �10 there is a drastic change in the computational cost before reaching the size ofthe maximal at d = 0. This suggest that when d < �10 half of the search space can be explored withhalf of the time to explore the combinations. While after exceeding d = �10 means that it will beexpensive to solve the optimality version of SGC.

4.7 ConclusionsWe have evidenced that SGC behaves similarly to other well-known NP-complete problems in thesense that, there are a number of problem instances for which it is manageable to determine a solution.Moreover, SGC exhibits a critical value at which the solvability and the hardness of the problemchanges drastically, the so-called phase transition.

Our results showed that before we reach the phase transition, at the order parameter zmax/|obj(w)| =0.0625, i.e bz = �4, there is a region of SGC for which it is possible to compute with boundedresources a solution which is negative, i.e. no group with size z is found. By contrast, when0.0625 < zmax/|obj(w)| < 0, i.e �4 < bz < �0, there is a solution for SGC but with an increase inthe computational cost.

These results provides a fair baseline of comparison for any algorithm attempting to solve SGC,since we can study the performance of an algorithm in terms of the computational cost of solvinghard instances of SGC. As another application of our results, they can be applied into an algorithmwhich considers that it is manageable to explore half of the search space when looking for a maximal.Then, given that the algorithm can find half of the solution, it may attempt to estimate the other halfso as to determine the maximal group. This approach will reduce the error between a complete andthe estimated solution

Any method, e.g. heuristic based, aiming to compute the social structure will improve its performanceif it considers our results.

Page 80: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

5. On the Detection of Anomalies UsingSocial Structure Characteristics

While the phase transition study gives certainty about the windows for which we can compute withbounded resources the complete social structure, a classifier is necessary if we are to determineanomalies on the social structure. In this chapter we present a novel approach to detect anomalieson the social structures raised in a DNS server.

In order to build our anomaly detector, we have constructed a characteristic vector made out oftwo type of social structure features: gram matrix discussed in Chapter ??, and estimated, which arediscussed in this chapter.

With this characteristic vector we have constructed a one-class classifier. This classifier aims tolearn patterns from ordinary conditions of the DNS traffic so as to identify, implicitly, everything thatis not ordinary.

We shall show the robustness of our approach by considering four experiments: using a syntheticanomaly, a real anomaly, and two abnormal and independent DNS events.

In general, our approach with respect to other approaches, has a better detection rate, a low numberof false alarms and it has been evaluated with a methodology.

This chapter concludes the collection of evidence aiming to validate that changes in the DNSsocial structure contribute in the detection of anomalies on a DNS server, namely our hypothesis.

76

Page 81: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

77

5.1 IntroductionWhen dealing with problems where there is enough data about one class and few data, if exist, fromthe other, a typical approach is to train a classifier for the majority class, namely construct a one-classclassifier. As pointed out by [17] this approach is useful to reduce imbalance class related problemsand also for novelty detection as shown in [59], since we only need to identify a normal class, train aclassifier to identify it, and everything that it is not predicted as normal is considered as not normal,under the assumption of perfect classification.

In this thesis work, we have developed a classifier which considers, as the majority class, ordinaryconditions of DNS traffic.

The decision of using this approach is because collecting evidence of anomalies on a real-worldprocess, like the DNS server under study, is much more difficult than gathering ordinary DNS traffic.This could be attributed to several factors: First, a poor management when facing an attack on the DNSserver because anomalous traffic is not properly identified after the attack. Second, the impossibilityto conduct a massive attack on a real-world DNS server because it implies disrupting the service forhundreds if not thousands of users. Last, the dozens of ways to conduct an attack. For example, anattacker may use a dictionary of thousands of domains to ask for translation pretending to be dozens ofIP addresses. As another example the attacker could use a single domain and pretend to be thousandsof IP address and so on. Because we do not have enough data about all the types of anomalies andthe impossibility of conducting an attack, we have designed a classifier which considers only a singleclass, namely a normal class.

Chapter overview: In the following section, we shall present the characteristics aiming to capturethe ordinary condition of the DNS server but sensitive to changes under abnormal conditions. Nextin section 5.4 we shall show the type of classifier considered in our approach. Section 5.6 presentsan overview of our experimental setting and how we are going to test the robustness of our classifier.Then, the results of applying the classifier into the experiments are shown in Section 5.7. In Section 5.8we present a novel alarm filtering approach intended to reduce the number of alarms by ignoringisolated windows. Our results will be contrasted with other works in terms of the characteristics thatan Intrusion Detection System (IDS) on a DNS server should have (section 5.9). Lastly, in section 5.10we give indications about future work and conclusions of the chapter.

Page 82: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

78

5.2 Characteristics under Consideration from DNS TrafficOur thesis work considers that it suffices to detect changes on the social structure so as to indicate ananomaly on the DNS traffic Unfortunately, computing the complete social structure from a windowis infeasible using polynomial resources, as reported theoretically in Chapter 3 and experimentally inChapter 4. So, we need to estimate the social structure in the window by means of a heuristic basedapproach.

In what follows we will describe our approach on the estimation of the social structure.Given that we want to estimate the social structure from a window, several clustering approaches

quickly arise. However, formulating the problem such that a clustering algorithm extract socialstructures of interest is difficult, in part, because the interpretation is challenging. For example, ifa k-means algorithm is applied we should prompt the questions: what does a cluster mean? and is itreally estimating the social structure?

5.2.1 Hybrid Segmentation Method

We have implemented an image segmentation method by considering the matrix Q

w as an image of|agt(w)||obj(w)| pixels. In this matrix a cell {ai, oj} is set to 1 if agent a

i

queried for object oj

at leastone time, otherwise the cell is set to 0. Then, our implementation, based in the Hybrid SegmentationMethod [47], is applied into Q

w. Basically, the idea is to form groups of pixels with commonalities;then, each group of pixels can be viewed as a social structure of given characteristics, i.e. weight andsize of the group.

The image segmentation approach we have used is quadratic in the height and width of the image.Moreover, it is a suitable approach to identify social structures of interest because it could be used toestimate larger, with respect to its size and weight, social structures given the way it merges group ofpixels.

Our implementation of this approach is as follows:1. Sort the matrix Q

w according to the activity. Since we consider binary activity, a row withmost of its agents querying for the objects is considered more active than a row with few agentsquerying for objects; likewise, a column with most of its objects being queried for is consideredmore active and vice versa. This yields a matrix with most of the activity in the upper leftcorner (Fig. 5.1). The aim of this step is to improve the performance of the region mergemethod, described in step four, which aims to group pixels from an image.

2. Map the sorted adjacency matrix Q

w, into an image of agt(w) ⇥ obj(w) dimensions such thata black color is given to the pixel p

i,j

, if IP a

i

queries for a URL d

j

; otherwise, a white color isgiven to the pixel (Fig. 5.2).

3. Split the image into 2⇥ 2 square cell (Fig. 5.3). Splitting the image corresponds to the originalprocedure [47]; however, we have used a 2⇥2 dimension cell because it gives more informationabout groups of interest, since the most elemental group should be composed by a cell of thissize. In general, using a large square cell means that we consider a large portion of the image asa region, while a small square cell considers that there are more regions in the image. Roughly,

Page 83: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

79

1 0 0 0 1 0 1 01 1 1 1 1 1 1 01 1 1 1 1 1 10 0 0 0 0 1 0 00 0 1 1 0 1 1 00 1 1 0 1 0 0 00 1 1 1 0 1 0 00 0 0 0 1 1 0 00 1 0 0 1 0 0 1

a2a4a6a1a5a8a7a3

11001

0

0

00

d111010

0

1

01

d211110

0

1

00

d311110

0

0

00

d411001

0

1

11

d511110

1

0

10

d611101

0

0

00

d70000

0

0

01

d8d1 d2 d3 d4 d5 d6 d7 d8a1a2a3a4a5a6a7a8

Figure 5.1: Step 1. Example of applying the sorting method on a Qw matrix. IP addresses are denotedby the symbol a

i

while objects are denoted by the symbol dj

.

a2a4a6a1a5a8a7a3

d1d2 d3 d4d5 d6 d7 d8a2a4a6a1a5a8a7a3

d1d2 d3 d4d5 d6 d7 d811001

0

0

00

11010

0

1

01

11110

0

1

00

11110

0

0

00

1001

0

1

11

11110

1

0

10

11101

0

0

00

0000

0

0

01

Figure 5.2: Step 2. Example of mapping a sorted Q

w matrix into an image.

the importance of determining more regions is because it provides more information aboutsocial structures of interest, as we will clarify in the next step.

Page 84: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

80

a2a4a6a1a5a8a7a3

d1d2 d3 d4d5 d6 d7 d8a2a4a6a1a5a8a7a3

d1d2 d3 d4d5 d6 d7 d8

Figure 5.3: Step 3. Example of splitting the image into 2⇥ 2 square cells.The dashed line representsthe limit of the square cell. Notice that there are some empty square cells.

4. Assign a label to square cells of interest (Fig. 5.4). The pixels on the square cell are labelledwith a letter L if it has at least three black pixels. Notice that a square cell has 4 pixels and thusif 75% of the square cell is black, all the pixels are labelled with a letter. Tightening this rule,namely asking for 100% black pixels, will reduce the number of estimated social structures; bycontrast, loosening the rule will increase the number of social structures but with greater errorbecause it will consider more white pixels as part of a social structure.

a2a4a6a1a5a8a7a3

d1d2 d3 d4d5 d6 d7 d8a2a4a6a1a5a8a7a3

d1d2 d3 d4d5 d6 d7 d8

Figure 5.4: Step 4. Example of labelling 2 ⇥ 2 square cells. To better illustrate the result, each labelis represented in a circle with the letter L.

Page 85: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

81

5. Merge together adjacent square cells with a letter (Fig. 5.5). This is done using an adjacencygraph (4-adjacency), this step is named the merge region method.

a2a4a6a1a5a8a7a3

d1d2 d3 d4d5 d6 d7 d8a2a4a6a1a5a8a7a3

d1d2 d3 d4d5 d6 d7 d8

R1

R2

R3Figure 5.5: Step 5. Example of merging 2⇥ 2 labelled square cells. To better illustrate the result fromthe method we have labelled each region.

The hybrid segmentation method ends up with a set of disjoint regions each of which, accordingto the method, belong to some object of a scene. In our implementation, however, we have ended upwith a set of disjoint regions which is an estimation of the social structure in the window because weconsider each region as a social structure of interest. The soundness of this observation is because bydefinition a social group is a set of IPs with common URLs; then, notice that each labelled square cellis, clearly, a group of size two and weight two.

However, the method merges all square cells in an attempt to find regions with commonalities.In our implementation, merging square cells allow us to identify larger groups, clearly, because notall groups are of size and weight two. The major drawback of merging the square cells is that thestructure of smaller groups is somehow hidden in the large regions and thus we need a method toidentify them.

We have developed a split method so as to identify the size and weight of smaller estimated socialstructures in the regions of the image found by the hybrid segmentation method. For each region, ouralgorithm will split it following the next algorithm:

Page 86: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

82

Algorithm 2 Splitting method to enumerate the groups with its size and weight given a region.Input: A region R that contains a list of black pixels.Output: A collection of groups with its corresponding size and weight.

1: next = 0

2: quit = 0

3: x, y FINDLOWESTXY (R)

4: pixelsInGroup p{x,y}

5: pixelsInGroup

0 null

6: while quit = 0 do7: if next = 0 then8: if not p{x,y+1} 2 R then9: if not SAMEREGION(pixelsInGroup, pixelsInGroup

0) = 1 then10: groupList groupList+ {pixelsInGroup

0}11: pixelsInGroup pixelsInGroup

0

12: end if13: next 1

14: R DELETEPIXELS(pixelsInGroup)

15: x, y FINDLOWESTXY (R)

16: pixelsInGroup

0 pixelsInGroup

17: pixelsInGroup null

18: else19: pixelsInGroup pixelsInGroup+ p{x,y}

20: y y + 1

21: end if22: else23: next 0

24: x x+ 1

25: end if26: if R = null then27: quit = 1

28: end if29: end while30: return groupList

The function FINDLOWESTXY (R) returns the tuple (x, y), representing a pixel in the segmentedimage, which has the lowest value in the list of black pixels region considering first the value x andthe value y later. For example, (0, 2) is lower than (1, 0). Then, considering this tuple, we analysedeach group of pixels using the function SAMEREGION(P, P 0) which determines if the groups ofpixels P and P

0 are in the same group. If a group is found and no other pixels can be part of the groupThe function DELETEPIXELS(P ), flag the pixels from the region so we can ignore them giventhat they are part of some other group.

Page 87: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

83

After applying this algorithm, we end up with a list of groups groupList with the correspondingsize z and weight t which corresponds to the width and height of the split region. In Figure 5.6 weillustrate the results of applying our algorithm in a region. The dashed line indicates the place wereour algorithm is splitting the region. In this example the algorithm has found five groups.

Figure 5.6: Example of applying our split method into a region (right) given by the HybridSegmentation approach.

Also, in Figure 5.7 we show how we compute the size and weight of a group which correspondsto the width and height respectively, of the split.

size

weigh

t

Figure 5.7: Example of the size and weight of the estimated group. To better illustrate the result wehave gray out the estimated groups except for one.

From our implementation we end up with a list of estimated groups with its corresponding sizesand weights.

5.2.1.1 Characterising the Estimated Social Structure

Even that we have an estimation of the social structure in the window, we need to characterise it suchthat a learning algorithm, namely a classifier, is able to recognise patterns of interest. This step isparticularly important given that a window has its own social structure and thus we can not create acharacteristic vector considering, for example, each possible group as a feature.

We have conducted an analysis on the results of our image segmentation method so as to study thecharacteristics, e.g. in terms of size and weight, of the estimated social structure. This will give us

Page 88: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

84

0 20 40 60 80 100 1200

0.5

1

1.5

2

2.5

3

3.5

4

4.5x 105 Window size = 125

Size

Freq

uenc

y

0 20 40 60 80 100 120 1400

0.5

1

1.5

2

2.5

3

3.5

4x 105 Window size = 200

Size

Freq

uenc

y

0 50 100 150 2000

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5x 105 Window size = 250

Size

Freq

uenc

y

Figure 5.8: Distribution of the estimated social structure size in a set of 680000 windows.

information about the most common sizes and weights that will be used to construct the characteristicvector that is used to build the classifier.

In order to conduct this analysis we have applied the image segmentation method into thousandsof windows (680000 in total) of size: 125, 200, 250. Then, we have analysed the size and weight ofthe groups found by our method using an histogram so as to notice its distribution across the studiedwindows.

From our analysis we have noticed that there is a large number of groups with size two in everytested window, while groups of size six are less common (Fig. 5.8). Moreover, the graph shows thatthe number of groups with size greater than six are less common and thus a change on this numbermay suggest an anomaly. Notice in the three graphs that the size of the groups are always even, this isbecause the hybrid segmentation method creates the square cell of size 2⇥ 2 pixels. Also, notice that80% of the groups have a size from two to six.

The distribution on the weight of the groups is more uniform given that we have usually foundthat groups of weight two and four (Fig. 5.9). Groups heavier than four are unusual. Same as the sizeof the group, the weights comes in even number because our image segmentation method.

As a result from our analysis, we have concluded that 85% of the social structure, if any, in a

Page 89: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

85

2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8x 105 Window size = 125

Weight

Freq

uenc

y

2 4 6 8 10 12 140

1

2

3

4

5

6

7x 105 Window size = 200

Weight

Freq

uenc

y

2 4 6 8 10 12 14 160

1

2

3

4

5

6

7x 105 Window size = 250

Weight

Freq

uenc

y

Figure 5.9: Distribution of the estimated social structure weight in a set of 680000 windows.

Page 90: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

86

window has a size from two to six and a weight from two to four.Then, we have constructed a vector of 12 columns where each value corresponds to the frequency

of groups of a given size and weight in a window. For example, the first column corresponds to thenumber of group with size and weight two, while the last column is related to the number of groupswith size six and weight four.

Lastly, we have added another column to the vector considering the frequency of all the groupswith a size greater than 6 and weight 2 so as to retain information that could be of interest.

5.2.2 Features under Consideration for the Characteristic Vector

Once the social structure of a window has been estimated, we have constructed the characteristicvector considering:

1. Agent entropy2. Object entropy3. Frequency of top active agent4. Frequency of top popular object5. Statistics from Qw ⇥QwT and QwT ⇥Qw

6. Weight of weight maximal group7. Size of size maximal group8. Non-trivial groups (t = 2)9. Non-trivial groups (z = 2)

10. Social degree11. Estimated number of groups z = 2 and t = 2

12. Estimated number of groups z = 2 and t = 4

13. Estimated number of groups z = 2 and t = 6

14. Estimated number of groups z = 4 and t = 2

15. Estimated number of groups z = 4 and t = 6

16. Estimated number of groups z > 4 and t > 2

17. Estimated number of groups z � 5 and t > 2

Recall that features from 1 to 10 are defined in Chapter 4.With this vector, we capture the social structure of a window and a classifier will be applied into

it so as to learn patterns of interest.

5.3 Characteristics from our DNS SampleWe have studied our DNS sample in terms of the characteristic vector. The rationale behind thisstudy is because we want to identify ordinary conditions of the DNS traffic as this will improve theperformance of the identification of abnormal conditions of traffic.

We have analysed the average packet per second in a minute rate because a persistent attack, i.e.a DDoS attack, will change eventually this metric. So, if our DNS sample exhibits a constant averagepacket per second rate, we will assume that it is normal traffic.

Not surprisingly, we have found patterns on the DNS usage (Fig. 5.10). First, we have noticedthat the days days of the week follows similar patterns across a month. For example, Wednesdayshave a lower activity on the number of packets with respect to other days of the week while Fridaysare days with more activity. Second, an automatic process is conducted in the first hours of the daywere hundreds of inverse query are been solved. Given that this pattern appears at the same time oneach of the sampled days we concluded that it is an ordinary DNS process. Third, the activity on the

Page 91: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

87

0 500 1000 15000

10

20

30

40

50

60

Minute of the day

Avg.

pac

ket p

er s

econ

d in

a m

inut

e

MondayTuesdayWednesdayThursdayFriday

9am 9pm

Figure 5.10: Average number of packets in a minute. The plot shows 1 month of activity from thestudied DNS server. The solid bold line represent the day with most of the average activity.

DNS server increases drastically at 9AM, reaching a peek of 50 packets per second at 12pm. Afterthis peek on the activity, DNS traffic starts to decrease until it reaches 4 packets per second after 9PM.Notice that we have ignored weekends, this is because DNS activity changes drastically during thesedays.

Considering the average packet per second in a minute, we conclude that our DNS sample isnormal in the sense that it exhibits a constant DNS usage.

Moreover, given that the activity of the DNS server varies according to the hour of the day, wehave decided to conduct a deep analysis on this behaviour.

5.3.1 Identifying Populations

We have conducted an analysis on our DNS sample hypothesizing that the traffic follows patternsaccording to the hour of the day. That is, we may be facing to different populations that once weidentify them, it can be handled from another perspective.

In an attempt to identify the existence of more than one population, we have considered the timewindow between 9AM to 9PM. The soundness of this decision is because the studied DNS trafficwas generated on an academic context where most students and faculty staff work from Monday toFriday starting at 9AM and finishing activities by 9PM. Also, studying this time period may allow usto mitigate the effect of a DDoS attack during a time where users may be affected more.

After, fixing the time window we now study two features the agent entropy and the object entropy.We have noticed that the number of outliers is reduced for both features after selecting the proposed

Page 92: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

88

0.5

1

1.5

2

2.5

3

3.5

4

1

(a) 24-hour day week. # outliers = 426

1

1.5

2

2.5

3

3.5

4

1

(b) 9AM to 9PM day week. # outliers = 91

Figure 5.11: Box plot for the H

agt

feature.

4

4.5

5

5.5

6

6.5

7

7.5

1

(a) 24-hour day week. # outliers = 1171

4

4.5

5

5.5

6

6.5

7

7.5

1

(b) 9AM to 9PM day week. # outliers = 483

Figure 5.12: Box plot for the H

obj

feature.

time window. For example, The agent entropy (Fig. 5.11) shows a decrease on its outliers of 41%while the the number of outliers in the object entropy (Fig. 5.12) is reduced by 30%.

The reduction in the number of outliers happened in several features but for the sake of simplicitywe only show the results of our analysis on these features.

Since separating DNS populations shows a more constant behaviour, in terms the number ofoutliers, we have decided to restrain our work to consider only the population from 9am to 9pm.Lastly, we should point out that considering only a population does not get in the way of detecting ananomaly in any other hour, given that if we are to consider another population, we only have to followthe same methodology.

Page 93: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

89

5.4 Selection of the One-class ClassifierOnce we have our characteristic vectors, we have to select a classifier. We have considered to usea one-class Support Vector Machine (SVM). This classifier aims to learn patterns from ordinaryconditions of the DNS traffic for some specified ⌫ between 0 and 1, hence the name ⌫-SVM.

5.4.1 The One-class Support Vector Machine (SVM)

A SVM is a supervised learning model used for classification and regression analysis. It is based onthe construction of hyperplanes (i.e. decision function) so as to separate instances in a given space asa function of a parameter p.

The SVM has proven to be useful and widely used in several areas, which include IntrusionDetection Systems (IDS) [38], medicine [31], novelty detection [59], among others. However, usingit for a classification task involves three challenges: selecting the representation space, selecting thetype of SVM and selecting suitable parameter values.

The decision of using ⌫-SVM is motivated by the diversity of DNS attacks since they can beconducted in several ways. For example, we can exhaust the DNS server buffer by asking for the samedomain hundreds if not thousands of times from a single IP or exhaust it using several IP addresses.⌫-SVM address this problem by training only, according to the author’s knowledge, for a normal DNStraffic class.

5.4.1.1 Representation Space

One of the most widely used representation spaces is the Radial Basis Function (RBF) kernel. Thiskernel shows a good performance and it is considered as the optimal kernel selection when there is noenough knowledge about the data [59]. In this work we chose the RBF kernel as our representationspace.

The RBF kernel gives raise to the parameter �, which is the influence of a single instance on thekernel representation. Indeed, a small � means that the computational cost of constructing the SVMis greater and complex because all instances have influence in the decision function of the SVM; bycontrast, a large � is easy to compute because an instance will be considered as a support vector givingraise to overfitting.

5.4.1.2 Type of SVM - ⌫-SVM Classifier

The ⌫-SVM classifier was selected as it aims to identify instances of the same population by trainingthe SVM with a sample of the population [55]. It is used in ensemble of classifiers as a commonstrategy to address the imbalance class problem [17], and novelty detection. Particularly, we have totrain it only for the normal class and everything that does not belong to the class can be considered asan anomaly.

Using ⌫-SVM, however, involves selecting the so-called slack parameter ⌫ which controls thenumber of support vectors. The greater the slack parameter the wider the margin of the SVM is

Page 94: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

90

because we will allow a greater number of support vectors to stay into the margin improving thegeneralization of the model but with a greater error; by contrast, a small value on the slack parametergives raise to a tighter margin and an overfitting problem may occur.

5.4.1.3 Parameter Values of the SVM

Once the selection of the kernel and SVM type is made, the only step left is to adjust the parameters forthe construction of the SVM. As any other machine learning technique, the optimal value selectionfor the parameters lies between generalization and overfitting. In our one-class classifier the twoparameters to adjust are: The parameter � related to the kernel and the parameter ⌫ related to numberof support vectors of the one-class SVM.

To the best of our knowledge there is not a standard technique to select the best candidate parametersbut rather approaches based on trial an error. We shall have more to say about this below, in the nextsection.

5.5 Construction of the ClassifierIn this section we shall present our experimental setting which aims to show our methodology toconstruct the ⌫-SVM classifier. The construction of the classifier was conducted following a five-stepapproach. While the evaluation was conducted by appealing to the so-called ROC, F

and Recall/Precisioncurves.

5.5.1 Outline of the Construction

In our experimentation, we have used the training, test and validation sets from our DNS sample witha window size 250 and 500. 1 The decision of considering these window sizes is motivated, in part,because we want to give a timely alarm. In our DNS sample, the average number of packets persecond is 23.7 and, under the consideration of perfect classification, an anomaly will be identified inthe first 21 seconds using a window of size 500; by contrast, considering a window of size of 250, analarm should be raised in about 10 seconds.

However, we should point out that in the window sliding approach, there is a trade-off in thesize of the window. The larger a window is, the more chance to notice the anomaly at the price ofincreasing the time of detection, while a small window implies that we will only see a portion of theattack that may be unnoticed by the classifier.

1Refer to Chapter 3 on details about the construction of the training, test and validation sets

Page 95: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

91

Then, for each window size we have constructed a ⌫-SVM classifier by tuning the SVM parametersso as to identify the best classification rates. Particularly, the construction of the classifier is as follows:

• First, construct a search space by defining lower and upper limits of the parameters ⌫ and �. Wehave started at ⌫ = 0.01, � = 1 for the lower limit and ⌫ = 0.8, � = 64 considering steps of0.01 for ⌫ and 2 for �.

• Second, construct classifier for each of the values in the search space and determine the bestclassifier in terms of the classification rate.

• Third, construct another search space but considering as the lower limit the parameters of thisclassifier and as the upper limit twice the value of the parameters from the best classifier.

• Finally, explore the new search space and determine the best classifier. Go back to the thirdstep if the classification rate is low, e.g. in terms of missing alarms or false alarms; otherwise,consider the classifier as the best candidate.

If applied for each window size, this procedure ends up with two classifiers each of which correspondsto a given window size, namely 250 and 500.

After constructing the classifier, we have evaluated it considering the corresponding ROC, F�

andRecall/Precision curves.

Lastly, in order to test its robustness we have evaluated our classifier in a series of experimentswhich are described below.

5.6 The Test SetIn order to validate the robustness of our classifier, we have conducted a series of experimentaltests considering several scenarios. These tests aim to prove that our classifier is not experiencingoverfitting problems, imbalance class problems, or not accepting any instance as part of the class.

5.6.1 Classification on a Synthetic Attack

Given the that there is not enough data about anomalies, we have synthesized an attack consideringtwo characteristics of a DDoS attack: persistence, because it could last for hours, and massivegeneration of traffic, because it could use hundreds if not thousands of attackers IPs. In this attackwe have started by simulating three attacker IPs asking for dozens of domains which were insertedin normal traffic; then, we increased the number of attacker IPs and the number of domains. Theincrease on the number of domains and attacker IPs was made considering that an attacker will appearwith 50% of probability. This is done until we reach the upper limit which is determined by theaverage number of different IPs and domains which are 20 and 14,993 for attacker IPs and anomalousdomains, respectively.

The rationale behind synthesizing the attack in this way is because, we expect that during anattack, there should be a time where the DNS service is mimicked with legitimate DNS activity andeventually it will disrupt the DNS service. By contrast, an attack that increases drastically (e.g. fromdozens to thousands) the number of domains and IPs should be evident for any anomaly detector.

Page 96: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

92

5.6.2 Classification on a Conducted Attack

We have tested our classifier in a more realistic scenario by conducting a DoS attack in the DNSserver under study. The rationale behind conducting a DoS attack instead of a DDoS attack isbecause conducting a DDoS attack is infeasible as it implies disrupting the service for hundreds ifnot thousands of users.

Our implementation of the DoS attack is based on a DNS amplification attack tool.2 Basically,it considers a single attacker computer which will spoof IP addresses so as to request resolution onbehalf of the spoofed addresses. Then, the answer to the query is redirected to the spoofed address inan attempt to overwhelm the victim.

We have considered to spoof four IP addresses each querying for three random domains. We havealso considered that an attacker could use a dictionary of 14,993 domains so as to mimic ordinaryDNS traffic since it is the frequency of the different domains in a day in the DNS server under study.

Our attack lasted 5 minutes. During the first minute of the attack, the average packet per secondwas 38. By contrast, at the minute 2 of the attack we reached a peak of 69 in average per second in aminute. At the end of our attack, the average packet per second was 13.

Although, the attack have a small impact in the DNS server, we want to investigate if it can beidentified with our classifier as it could be part, we hypothesize, of an incoming DDoS attack and thuswe can mitigate the effect of it.

We have tested our classifier with our conducted attack.

5.6.3 Classification on Abnormal Activity

After constructing our classifier and tested it with a synthetic and conducted attack, we have applied itinto a day with abnormal DNS activity, dated in March 28th 2014. This, so as to get an insight abouta possible attack on the DNS server under study.

As a preliminary analysis on this abnormal day we have investigated the average packet per secondin a minute so as to compare it with an ordinary day (Fig. 5.13). Particularly, we have contrasted thismetric with the same day of the week.

Interestingly, we have found that the average number of packets increased drastically with respectto ordinary traffic. After analysing the activity of this day we found that a couple of IP addresses weregenerating over 30% of the total packets in the day by conducting resolution to abnormal domains(e.g. uv.liebiao.800fy.com).

Also, during this time period numerous DNS administrators reported similar domain patterns ontheir DNS servers [?]. So, we concluded that this particular day is a real anomaly onto which ourclassifier will be applied.

Also, during this time period numerous DNS administrators reported similar domain patterns ontheir DNS servers.3 So, have we concluded that this particular day is a real anomaly onto which ourclassifier will be applied.

2Available at: https://arcti.cc/python_dns_amplification.html3For example, DNS-OARC in https://indico.dns-oarc.net/getFile.py/access?contribId=

23&resId=0&materialId=slides&confId=19

Page 97: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

93

0 500 1000 15000

10

20

30

40

50

60

70

80

90

Minute of the day

Avg.

pac

k pe

r sec

ond

in a

min

ute

Ordinary dayAbnormal day

Figure 5.13: Average packets per second in a minute. The solid line stands for a typical day of DNSactivity while the dashed line refers to abnormal activity.

5.6.4 Classification on Monthly Traffic

We have applied our classifier into a month of DNS traffic so as to have an insight of the attacks,if any, affecting the DNS server under study. Particularly, we have considered this month withoutknowing at first hand if an anomaly happened. So, if a window is classified as abnormal, we will lookinto its characteristics in order to determine if it is a false alarm.

Our results on this experiment have encouraging results.

5.7 Experimental ResultsIn this section, we show the results of applying our classifier into the proposed experiments. Toillustrate the performance of our classifiers we have used ROC, F- and Recall/Precision curves. Whilethe ROC curve is standard on the machine learning community, the Recall/Precision curve is used tonotice, if any, imbalance class related problems [17] where the higher and the further to the right thecurve is, the better the classifier is.

5.7.1 Classification on a Synthetic Attack

We report on the performance of applying the classifier into normal DNS traffic and the syntheticattack (see section 5.6.1).

Page 98: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

94

After tuning the SVM classifier, our methodology considers the parameters � = 1.2 and ⌫ = 0.01

for a window size 250, while � = 0.9 and ⌫ = 0.01 for a window size 500. The accuracy of theclassifier is shown in Table 5.1.

Table 5.1: Confusion matrix for the A (anomaly) class and N (normal).w = 250 w = 500

Predicted PredictedA

ctua

l A N

Act

ual A N

A 100% 0% A 100% 0%N 5.08% 94.918% N 6.07% 93.93%

In general, the classifier is able to identify without errors the synthetic anomaly while the numberof false alarms is 5.08% and 6.07% considering a window size of 250 and 500 respectively. In orderto support these observations, we show the corresponding ROC curves (Fig.5.14) which exhibits acurve higher and further to the left for both window sizes.

0 1 2 3 4 5 6x 10−3

0.955

0.96

0.965

0.97

0.975

0.98

0.985

0.99

0.995

1

False Positive Rate

True

Pos

itive

Rat

e

500250

Figure 5.14: ROC curves for both windows, 250 and 500. Notice that both FPR and TPR range overthe interval [0,1], and that, to better appreciate our results, we have plotted both axis using a log scale.

The Recall/Precision curves are consistent with our previous observation since the curve is higherand further to the right for both the window size 500 and the window size 250 (Fig. 5.15).

In order to generate the F

curve we have determined the classifier with the best F1 score (i.e.� = 1). Indeed, Figure 5.16 shows the performance of the classifier for each window considering thescore 0.9.

This curve shows that when we weight more recall over precision the overall performance increases.By contrast, when we prefer precision over recall, i.e. � ! 0, the performance decreases slightly.

Page 99: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

95

0.95 0.96 0.97 0.98 0.99 10.91

0.92

0.93

0.94

0.95

0.96

0.97

0.98

0.99

1

Recall

Prec

isio

n

Recall−precision

500250

Figure 5.15: Recall/Precision curves for both windows, 250 and 500.

2 3 4 5 6 7 8 9

0.9993

0.9994

0.9995

0.9996

0.9997

0.9998

0.9999

1

β

F−Sc

ore(β)

500250

Figure 5.16: F-� curve for both window, 250 and 500.

Page 100: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

96

Discussion: An explanation about the zero missing alarms rate (i.e. false negatives), is related tohow the synthetic attack was constructed. Indeed, we have synthesized the attack considering threeattacker IPs and then we increased the number of attackers; by contrast, a typical window, in the DNSserver under study, contains in average 32 IPs on a window of size 500 and 21 IPs on a window of size250. Consequently, the classifier is able to identify the attack because the changes on these featuresare evident.

However, we have found similar results (with a slightly variation) in another experiment considering32 and 21 attacker IPs on average for windows size of 500 and 250, respectively. Still, our classifierwas able to identify clearly abnormal windows with a detection rate of 90%. Ordinary windows, bycontrast, achieved a detection rate of 93%. From these results we have also proved that our classifierdoes not suffer from generalization problems since it is not accepting just any window as part of theclass.

Page 101: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

97

5.7.2 Classification on a Conducted Attack

We have applied the classifier into the normal DNS traffic and the conducted attack (see sec. 5.6.2).The evaluation of the classifier considers again the parameters � = 1.2 and ⌫ = 0.01 for window ofsize 250 while � = 0.9 and ⌫ = 0.01 for a window of size 500. The accuracy is shown in Table 5.2.

Table 5.2: Confusion matrix for the A (real attack) class and N (normal).250 500

Predicted Predicted

Act

ual A N

Act

ual A N

A 89.65% 10.35% A 100% 0%N 5.08% 94.918% N 6.07% 93.93%

Notice that in the conducted attack, the performance of the classifier decreases in the case ofabnormal class considering a window of size 250; however, considering a window size 500 theperformance remains similar to the previous experiment.

Also, the performance of the classifier is similar to our simulated attack given that the ROC(Fig.5.17) curve is still higher and further to the left.

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Positive Rate

True

Pos

itive

Rate

ROC curve of (AUC = 0.99733 )

500250

Figure 5.17: ROC curves for both windows, 250 and 500.

The Recall/Precision curve (Fig.5.18), however, shows a poor performance for window of size250, possibly, indicating an imbalance class problem. By contrast, the results using a window of size500 are encouraging since the curve is further to the right.

Page 102: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

98

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

Prec

isio

n

Recall−precision

500250

Figure 5.18: Recall/Precision curves for both windows, 250 and 500.

The classifier was also evaluated with the F-� curve (Fig.5.19) for the best classifier with F1 score,0.55 and 0.78 for the window size 250 and 500, respectively. In general our classifier shows a goodperformance when weighting more the false negatives.

Discussion: Our classifier was able to identify a DoS attack with zero missing alarms for a windowof size 500 and an encouraging detection rate of 89.65% using a window size 250. This behaviour isexplained because the larger the window is the more anomalous queries the classifier is able to notice,yielding to a perfect classification on the abnormal class.

5.7.3 Classification on Abnormal Activity

We have applied our classifier into an abnormal day as discussed in section 5.6.3. The results showthat the number of windows predicted as abnormal on the DNS traffic was 84.25%.

We have looked into the raw log so as to identify the kind of anomaly present in windowsmarked as abnormal. Not surprisingly, 90% of the windows classified as abnormal were queryingto non-existent domains (e.g. sgsfssd.800ffy.com), possible in order to conduct a DNS amplificationattack. This observation is supported by a report from DNS-OARC (see sec. 5.6.3) as a constantbehaviour happening around the globe. It is unclear for the remaining 10% if they were intended toconduct abnormal activity.

With this information, it is possible to filter DNS packets, from a local DNS server, intended toconduct an Amplification attack before they reach its victim. Moreover, even if it is not possible toidentify 100% of the attack, reducing the number of DNS packets by 84%, in this case, is a significantway to mitigate the effect of a DDoS or Amplification attack.

Page 103: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

99

0 1 2 3 4 5 6 7 8 9 100.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

β

F−Sc

ore(β)

500250

Figure 5.19: F-� curve for the best classifier according to our F1-score curve.

5.7.4 Classification on Monthly Traffic

We have tested our classifier using a month of DNS activity so as to have an insight about trafficcondition. Our experimentation considers a window size of 500. The classifier identified 9% of thewindows (4902) as abnormal.

In an attempt to understand our classifier predictions, we have plotted the average number ofpackets per second in a minute for the day with, according to the classifier, most of the anomalies(86%) so as to compare it with a typical day of DNS activity.

From Figure 5.20 we can notice a similar average number of packets in a minute for the ordinaryday and the day with abnormal activity. Moreover, notice a peak of activity of 66 packets per secondin the ordinary day while at the same hour, the abnormal day showed 55 packets per second. Alsonote an increase in the number of packets in the early hours of the day, that could be considered asabnormal activity.

These observations raise a number of questions: 1) why the classifier predicts windows as abnormal?and 2) what kind of anomaly, if any, happened during that day?.

In order to answer both questions, we have looked into the raw log of abnormal windows, in anattempt to understand the classifier predictions. We found domains constructed in a human-friendlyway (e.g. www.notengoip.com). This contrasts with other random generation algorithms like thebehaviour found (see section 5.6.3) in an abnormal day. Moreover, the number of windows markedas abnormal is small compared to the total amount of DNS traffic in a day.

After analysing the generated domains, we concluded that windows labelled as abnormal are madeout from bot activity. This observation is supported by evidence in several security blogs and bulletinslike CERT4 that report these domains as part of a complex botnet called Mariposa.

4https://ics-cert.us-cert.gov/advisories/ICSA-10-090-01

Page 104: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

100

0 500 1000 15000

10

20

30

40

50

60

70

Avg.

pac

ket p

er s

econ

d in

a m

inut

e

Suspicous dayNormal day

Figure 5.20: Average number of packets in a minute. The solid line stands for a typical day of DNSactivity while the dashed line refers to abnormal activity dated February 04th.

This result suggests that our classifier is able to detect slightly changes on the ordinary conditionsof the DNS, that are not necessarily related to a DDoS attack but with a bot, possibly trying toconciliate an attack.

5.8 Analysis of the Number of AlarmsTypically, an anomaly based detector is evaluated in terms of the detection of anomalies; the large thenumber of true alarms raised the better the detector is. However, if our detector raises an alarm foreach abnormal window found, it could be inconvenient for a DNS administrator. This is because analarm can be raised when facing an outlier in the DNS traffic, e.g. a sudden increase in the number ofdistinct IPs because a human event. Moreover, a single abnormal event has a negligible effect in theDNS server while a DDoS attack will overwhelm the DNS service.

To the best of our knowledge, few efforts have been conducted towards understanding the nature ofalarms in the sense that some of them could be an outlier in the DNS traffic. So, we have investigatedalarms raised by our classifier. Particularly, we have conducted this study by analysing collectionsof five consecutive windows; if three or more windows are predicted by our classifier as abnormalthen, we label the collection as abnormal; otherwise, the collection is considered normal. We haveconducted this study in steps of one window in all our four scenarios considering the best classifierwith window of size 500, � = 0.9 and ⌫ = 0.01 (see. section5.7)

Page 105: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

101

5.8.1 Number of Alarms Considering the Synthetic Attack

Considering this test, our analysis in the number of windows showed that all the collections ofwindows were abnormal. This is because the classification rate for the abnormal class was 100%.Also, our analysis in the collection of windows shows a reduction of 77% in the number of falsealarms. In order to assess the importance of these results, let us present the following example: First,according to our DNS server, in average, there are 500 DNS packets every 21 seconds. Second, recallthat we are considering a 12-hour day in a local DNS server. Third, consider that using our results wehave 174 false alarms in the 12 hour-day. Lastly, assume that we have implemented as the only IDSour detector. Then, without the analysis in the collection of windows we may end up with over 14.5alarms every hour. By contrast, considering the analysis in the collection of windows, we can end upwith 3.3 alarms every hour.

We are aware that 3.3 alarms every hour remains inconvenient for a DNS administrator. However,our approach could be extended so as to consider more than five windows so as to reduce the numberof false alarms but increasing the chance to miss a real attack.

5.8.2 Number of Alarms Considering the Conducted Attack

Our analysis in the abnormal class indicates that all the collection of windows are abnormal, this isbecause the classification rate was 100%. Particularly, in this attack we have found that 50% of thecollections marked as abnormal were in consecutive order indicating a persistent behaviour, namelyour attack. The remaining 50% of the attack is made up of an abnormal event happened after welaunched our attack and traces of our attack. These results show that during an attack the numberof collections of windows labelled as abnormal will increase during a DDoS attack. Moreover, if weobserve consecutive collections labelled as a abnormal it is more likely to be related to a DDoS attack.

Considering the number of false alarms, we have noticed a reduction of 12%. Considering ourDNS server, for example, this could mean a false alarm every hour and twelve minutes.

5.8.3 Number of Alarms Considering the Abnormal Activity

In this experiment the number of alarms increased 3%. This is because we were facing a real anomalyhappened in our DNS server. Moreover, about 85% of the windows labelled as abnormal, appeared inconsecutive order meaning that almost all day our DNS server was under attack.

5.8.4 Number of Alarms Considering Monthly Activity

After analysing this experiment we have found that there are over 3400 collections of windows markedas an abnormal in a single day, which in turn corresponds to 51% of the total DNS activity during aday. The reduction on the number of alarms is 20%.

Page 106: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

102

5.8.5 Discussion

Except for the abnormal activity day, we have decreased the number of false alarms raised by ourdetector. Notice that consecutive collection of windows should be managed in a special way becausewe may be facing a DDoS attack. By contrast, a collection of windows happens at different timesshould be managed as an outlier of the DNS activity.

Certainly, we can analyse collections of windows considering more than five windows or evenanalyse a set of collections so as to give more importance to the alarms or to reduce the number offalse alarms. For example, if a set of five consecutive collections are marked as abnormal we may befacing a DDoS attack. If a DNS administrator notice this kind of behaviour it could mitigate the effectof the attack before it impacts the DNS performance.

Lastly, notice that this approach if implemented in a detector, a priority alarm could be raise 1.75minutes after the attack. It is an encouraging result since the time for detection is enough to mitigatethe effect of the attack.

5.9 Comparative ResultsWe have evaluated our approach so as to contrast it with related works in terms of five characteristicsthat an IDS on a DNS server should have. The following characteristics are under consideration:

• Time to detect (TD).- Referred as the time needed to raise the first alarm during an attack.Theoretically, an IDS should raise an alarm fast enough so as to mitigate the effect of a DDoSattack on a DNS server. We will contrast the time provided by the authors with our time neededto raise an alarm.

• Test set (TS).- This characteristic is satisfied if the method has coverage. That is, if theevaluation of the method considers one or more scenarios either synthesized or taken from areal-world process. Also, the authors must show the reasoning behind testing their method onthese scenarios.

• Representative sample of ordinary DNS traffic (RS).- This characteristic considers that a workmust report evidence about the construction of a representative sample of DNS traffic, eithercollected from a real process or generated by using a network traffic generator. This sampleshould capture ordinary conditions of the DNS server under study, e.g. in terms of the packetsper second.

• Precision and Recall (PR).- Same as in machine learning, a system with high precision hasmore predictions correctly classified as an attack while the reverse holds in a system with lowprecision. Moreover, a system with high recall detects more attacks with respect to the totalnumber of attacks. Clearly, this is also related to the number of missing alarms and false alarms.

• Alarm filtering (AF).- This characteristic prioritize alarms by reducing the number of outliersthat are detected by an IDS. For example, a single abnormal event has a negligible effect in

Page 107: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

103

Table 5.3: Characteristics satisfied by related works.Properties

TD TS RS PR AFOur approach X X X X X

Van Dyke Parunak, et.al. [63] ⇥ � ⇥ � ⇥Pin Ren, et al. [54] ⇥ X ⇥ ⇥ X

Bilge Leyla, et.al. [8] X X ⇥ X ⇥Yuchi Xuebiao, et.al. [72] ⇥ X X ⇥ ⇥

Yao Wang, et.al. [65] ⇥ � ⇥ ⇥ ⇥Yuchi Xuebiao, et.al. [73] ⇥ X ⇥ ⇥ X

the performance of a DNS server; by contrast, the reverse holds when the number of abnormalevents increases.

5.9.1 Related Work under Study

In general, a characteristic will be considered as satisfied if the work under consideration presentsevidence to support it. The following studies are under consideration so as to contrast them with ours:

• An Agent-Based Framework for Dynamical Understanding of Events (DUDE) [63]

• Visualizing DNS Traffic [54]

• EXPOSURE: A Passive DNS Analysis Service to Detect and Report Malicious Domains [8]

• Probabilistic Semantic Analysis [72]

• Tracking Anomalous Behaviors of Name Servers by Mining DNS Traffic [65]

• A New Statistical Approach to DNS Traffic Anomaly Detection [73]

5.9.2 Discussion

For the sake of simplicity we have constructed Table 5.3 so as to show an overview of the characteristicssatisfied by the work under consideration. A symbol X is given to a work that satisfies a characteristic;a symbol⇥ is assigned if it does not. The symbol� is given if the work partially satisfies the property.In what follows we will discuss about how each characteristic is satisfied by the related work andcontrast our results.

Page 108: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

104

5.9.2.1 Time to Detect (TD)

Bilge, et.al. [8] satisfy this characteristic by analysing when a single DNS packet is captured so asto determine if it is anomalous. Although their main goal is to detect abnormal domains, if appliedas a detector of DDoS attacks, it may arise a number of false alarms. Other authors did not provideevidence about the time to detect an abnormal event.

In our approach we provide evidence that when facing a persistent anomaly the time to detect isabout 0.9 minutes and 1.8 minutes, for a collection of five consecutive windows size of 250 and 500respectively. This is because as depicted in Section 5.5.1 the average packet per second from the DNSserver under study is 23.7.

5.9.2.2 Test Set (TS)

The works in [8, 73, 65, 72] tested their approach considering synthetic traffic but without supportingthe decision on the construction of such traffic. By contrast, we have synthesized traffic based on a realanomaly considering typical characteristics of a DDoS attack, like the gradual increase in the volumeof traffic and persistence. Generating synthetic anomalies is relevant because incorrect assumptionsabout anomalies could bias the detection mechanism. For example, inserting anomalies arbitrarily inthe raw data may create an evident pattern which will be straightforward to detect.

In order to support the robustness of our detector, we have also tested our classifier considering aconducted attack in a real-world process. None of the related work tested their classifier in a conductedattack.

The works in [63, 73, 65, 72, 54] report their findings considering a real anomaly but they do notprovide evidence, e.g. in terms of its characteristics, of the so-called anomaly. By contrast, we haveshowed two types of anomalies. The first anomaly, reported by DNS OARC (see sec. 5.6.3), refers toan abnormal DNS activity where few IPs generate a great number of distinct DNS requests, probablyin an attempt to conduct an amplification attack. The second anomaly, is related to the Mariposabotnet (see sec. 5.6.4 where few IP addresses were generating human-friendly domains. We showedevidence of our anomalies supported by the DNS OARC and CERT reports.

5.9.2.3 Representative Sample of Ordinary DNS traffic (RS)

Mostly, the work under consideration has limitations when showing that their collected DNS trafficfollows a constant pattern on its characteristics, e.g in terms of packets per second, so as to considerit as ordinary conditions. This step is important because it will improves the accuracy of detectiontechnique and it proves that the technique is not biased.

Our work shows that using a simple metric, namely the average number of packets per second ina minute, our DNS sample exhibits a constant pattern that we call ordinary conditions.

5.9.2.4 Precision and Recall (PR)

[63] reports on these metrics but in an incomplete state, in the sense that they are not sure how tointerpret traffic detected as anomalous. The inserted synthetic anomaly have a poor performance of

Page 109: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

105

56% detection rate. The work of [8] showed a 98% detection rate with false positives of 1% on theirsynthetic data. By contrast, we have showed a 100% detection rate on synthetic traffic and over 93%on a conducted attack. We support this observation by presenting the so-called ROC and F-� curves.

5.9.2.5 Alarm Filtering (AF)

[73] showed an approach on alarm filtering by using a threshold. In our approach, we have analyseda collection of consecutive windows. After applying this technique we have also reduced by 20% thenumber of alarms that have a negligible effect in the DNS server as shown in section 5.8. Complementarily,we have showed that following this approach we can raise an alarm that is more related to a persistentanomaly, namely a DDoS attack.

Page 110: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

106

5.10 ConclusionsWe have proved that that considering the DNS social structure contributes significantly to the detectionof DDoS attacks.

In order to design our detector we have considered the DNS social structure and constructed a⌫-SVM. Roughly we have constructed the classifier in the following way: First, by proposing featuresthat capture the social structure that arise in the DNS server; Second, by collecting a DNS samplewhich have ordinary conditions, i.e. without attacks; Third, by training our classifier with this sample.Fourth, by proposing a test set, composed of four experiments, on which our detector is applied.Lastly, we have filtered the alarms raised by our detector

We have presented a methodology to evaluate our classifier considering four scenarios. Onerelated to a synthetic anomaly, another to a conducted attack and the other two related to real anomaliesfound in the DNS traffic. Also, we have evaluated our classifier with the corresponding ROC, F � �,Precision-recall curves.

We have encouraging results given the detection rate of 89.6% and 100% for a window size 250and 500 respectively.

To the best of our knowledge, few efforts have been made towards understanding the alarms raisedby a detector in a DNS server. Indeed, we have showed that by analysing collection of windows it ispossible to raise alarms that are more likely to be related to DDoS attacks and to reduce the numberof false alarms.

Our detector has been able to identify botnet activity. Then, we hypothesize that, consideringthe social structure of DNS, bot activity can be blocked preventing them to conciliate an attack, forexample. The reasoning behind this claim is because a bot will conciliate attacks using URL’s. Indeed,future work is concerned with validating this hypothesis by testing it with deployed botnets. We shallhave more to say about this in the thesis conclusions

Page 111: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

6. Conclusions and Indications for FutureWork

The DNS service is a distributed naming system for computers. Mainly, it is used to translate URLsto IP addresses. DNS is hence a critical service and a common target of DDoS attacks. Examplesof DDoS attacks attempting to disrupt the DNS service include the so-called events of 2003 [64],, an attack to the root servers in 2007 [32] and a more recent attack in 2014. 1 Some efforts havebeen made towards its protection. However, these efforts have either been focused on proposingmechanisms based on detecting values out of a given boundary, have been short-term studies or havenot reported properly its results, e.g., the time to detect an attack.

In this work we have evidenced that the DNS social structure contributes on the detection ofanomalies attempting to disrupt the DNS service. Hence, we have validated our hypothesis whichcan be described in terms of three major contributions: The study of the theoretical complexity ofcomputing social structures, the study of the experimental tractability of computing social structures,and the development of an anomaly based detector considering the DNS social structure. We shalldiscuss them below.

This thesis has contributed in the study of the theoretical complexity of the SGC problem. First,we have proved its NP-completeness by means of a Karp reduction. Clearly, any algorithm attemptingto solve SGC will require an exponential amount of computational resources. Second, we have provedthat even if we restrain the SGC problem to agent pairs, the problem remains NP-complete. Third,we have showed that it is possible to determine some characteristics of SGC in polynomial time bycomputing the gram matrix. For example, we can determine the size of a maximal group but not therequired witness to solve the problem. Hence, further research towards understanding theoretically

1http://http://goo.gl/bMkiwX

107

Page 112: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

108

SGC should be focused on studying the Gram matrix, as it may provide more information regardingthe social structure. Moreover, the Gram matrix might be useful to develop a much finer heuristicso as to approximate more accurately the complete social structure of a window. As an example, theGram matrix provides an insight about the groups with t = 2 and size k � 2, the so-called 2-weightgroups (see Chapter 3), these information might be used to estimate non-trivial groups with maximalweight along with other measures, e.g. the 2-size groups.

As our second contribution, we have studied the computational cost of solving SGC by meansof a phase transition study. Our results shows a typical easy-hard-easy pattern on the cost of findinga group of size from 3 to 6; solving the hardest instances involves an exploration of over 28000combinations. Our results can be used to design efficient algorithms attempting to solve SGC giventhat they provide a fair baseline of algorithm comparison. For example, in order to test the performanceof an algorithm it is possible to select the hardest instances that according to our results involvesexploring 28000 combinations and test if the algorithm can solve more efficiently those instances.

Our third contribution is an anomaly detector which considers the DNS social structure as asignificant factor towards the detection of DDoS attacks. Our anomaly detector considers the estimatedsocial structure of a window. This estimation has been computed using an image segmentationmethod. In general, our results have showed that it suffices to consider the social structure to detectan attack. Indeed, we are able to raise an alarm after observing three or more consecutive abnormalwindows, with 93.9% of correct detection rate. Moreover, by analysing collection of windows, wehave prioritized alarms and thus, an alarm raised is more likely to be related to a DDoS attack. AnyIntrusion Detection System on a DNS server will improve its detection rate, if it considers our results.Also, our results showed that it is possible to detect bot activity considering the DNS social structure.This is because a bot usually conciliates attacks using URL’s. Hence, further research should befocused in investigating the detection of bots considering the DNS social structure.

Our detector leaves room for improvement. We will described some insights about the improvementsfor our classifier: 1) The classifier could be use so as to determine the source of an attack by analysingthe social structures in an abnormal window. For example, we can analyse the number of request fromall the distinct IPs, if an IP exceeds some boundary, e.g., the average number of request, then we canblock its requests. 2) Continue to study the Gram matrix and the phase transition of SGC, does notget in the way of detecting anomalies in a DNS server. Indeed, new findings about the SGC couldbe readily applied into a classifier aiming to detect changes in the DNS social structure. Certainly,having a more accurate DNS social structure must be useful to increase the anomaly detection rate.3) We are aware that the synthetic anomaly test has to be improved. This is because we still need toprove that the synthetic anomaly is similar to a real attack. Then, further research could be focused onenhancing the synthetic test following an approach similar to [14]. 4) Our social structure approachis by no means the ultimate solution for the detection and mitigation of DDoS attacks. Certainly,our approach does not capture the DNS request sequences that may be useful to drop only abnormalDNS packets, instead considering our approach may end up dropping the entire window. Future workcould be focused on developing a more robust Intrusion Detection System by following the next idea:After an alarm is raised by our social structure approach, another prediction technique, e.g. a classifierbased, could analyse the DNS packet sequences in an attempt to drop abnormal DNS requests. 5) Our

Page 113: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

109

classifier ignores the frequency of the DNS requests. Indeed, the rationale behind the construction ofthe Gram matrix is that it is enough for an IP to make a single request to be related to a URL, then itis possible to form the so-called groups. However, in another scenario a single IP could ask for thesame URL dozens if not hundreds of times and that frequency is lost with our current approach. Asanother indication for future work, it could be possible to relate the frequency of the DNS requeststo the Gram matrix so as to retain this information. Considering the DNS request frequency mustimprove the detection rate of DDoS attacks.

We hypothesize that SGC can significantly contribute to a system that formulates a problemin terms of a social structure. Particularly, it may help in a recommender system. The aim of arecommender system, is to suggest items, (e.g., places, books, movies) to agents, based on somerecommendations collected from a list of friends or from the common interest with other agents. Then,the outcomes of the complexity study and the image segmentation method can be readily applied toimprove a recommender system.

As another example, we would like to know all the connections (e.g., friends, colleagues, etc.)that have in common a collection of users of a given social network (e.g., Facebook, Linkedin, etc.).This could be used to suggest friends with a much finer criterion.

Also, forming customer classes can be formulated as SGC. For example, the staff of sales ormarketing would very much like to have all their customers grouped together in terms of the commonalityof the goods or services they have somehow recently requested for. These could be used, for example,to elaborate sales offers, or issue a marketing campaign based on a finer customer profile. Otherexample applications include computing the file system objects that are used simultaneously by acollection of system users; the books or music commonly bought by a set of individuals; and manyother more.

Page 114: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

References

[1] Bernhard Ager, Wolfgang Muhlbauer, Georgios Smaragdakis, and Steve Uhlig. Comparingdns resolvers in the wild. In Proceedings of the 10th ACM SIGCOMM Conference on InternetMeasurement, IMC ’10, pages 15–21, New York, NY, USA, 2010. ACM.

[2] Roberto Alonso and Raul Monroy. On the NP-completeness of computing the commonalityamongst the objects upon which a collection of agents has performed an action. Computacion ySistemas, 17(4):489–500, 2013.

[3] Anonymous. The collateral damage of internet censorship by dns injection. SIGCOMM Comput.Commun. Rev., 42(3):21–27, June 2012.

[4] Hari Balakrishnan, Karthik Lakshminarayanan, Sylvia Ratnasamy, Scott Shenker, Ion Stoica,and Michael Walfish. A layered naming architecture for the internet. In SIGCOMM ’04:Proceedings of the 2004 conference on Applications, technologies, architectures, and protocolsfor computer communications, pages 343–352, New York, NY, USA, 2004. ACM.

[5] Hitesh Ballani and Paul Francis. Mitigating dns dos attacks. In CCS ’08: Proceedings of the15th ACM conference on Computer and communications security, pages 189–198, New York,NY, USA, 2008. ACM.

[6] S.M. Bellovin. A loop back at security problems in the tcp/ip protocol suite. 1989.

[7] L. Bergroth, H. Hakonen, and T. Raita. A survey of longest common subsequence algorithms. InProceedings of the Seventh International Symposium on String Processing Information Retrieval(SPIRE’00), SPIRE ’00, pages 39–, Washington, DC, USA, 2000. IEEE Computer Society.

[8] Leyla Bilge, Sevil Sen, Davide Balzarotti, Engin Kirda, and Christopher Kruegel. Exposure:A passive dns analysis service to detect and report malicious domains. ACM Trans. Inf. Syst.Secur., 16(4):14:1–14:28, April 2014.

[9] G. M. Borkar, M. A. Pund, and P. Jawade. Implementation of round robin policy in dns forthresholding of distributed web server system. In Proceedings of the International Conference&#38; Workshop on Emerging Trends in Technology, ICWET ’11, pages 198–201, New York,NY, USA, 2011. ACM.

[10] Nevil Brownlee, kc Claffy, and Evi Nemeth. Dns measurements at a root server. In DNSRoot/gTLD Performance Measurement, Usenix LISA Conference, 2001.

[11] T. Callahan, M. Allman, and M. Rabinovich. On modern dns behavior and properties. AcmSigcomm Computer Communication Review, 43(3):8–15, 2013.

[12] Sebastian Castro, Duane Wessels, Marina Fomenkov, and Kimberly Claffy. A day at the root ofthe internet. SIGCOMM Comput. Commun. Rev., 38(5):41–46, 2008.

110

Page 115: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

REFERENCES 111

[13] Peter Cheeseman, Bob Kanefsky, and William. M. Taylor. Where the really hard problems are.In John Mylopoulos and Raymond Reiter, editors, Proceedings of the 12th International JointConference on Artificial Intelligence, IJCAI, pages 331–337. Morgan Kaufmann, 1991.

[14] Ramkumar Chinchani, A. Muthukrishnan, M. Chandrasekaran, and S. Upadhyaya. Racoon:rapidly generating user command data for anomaly detection from customizable template. InComputer Security Applications Conference, 2004. 20th Annual, pages 189–202, Dec 2004.

[15] Ruetee Chitpranee and Kensuke Fukuda. Towards passive dns software fingerprinting. InProceedings of the 9th Asian Internet Engineering Conference, AINTEC ’13, pages 9–16, NewYork, NY, USA, 2013. ACM.

[16] C. Cranor, E. Gansner, B. Krishnamurthy, and O. Spatscheck. Characterizing large dns tracesusing graphs, 2001.

[17] Jesse Davis and Mark Goadrich. The relationship between precision-recall and roc curves.In Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, pages233–240, New York, NY, USA, 2006. ACM.

[18] Josep Dıaz, Olli Pottonen, Maria Serna, and Erik Jan van Leeuwen. On the complexity of metricdimension. In Proceedings of the 20th Annual European conference on Algorithms, ESA’12,pages 419–430, Berlin, Heidelberg, 2012. Springer-Verlag.

[19] Martin Dyer and Catherine Greenhill. The complexity of counting graph homomorphisms. InProceedings of the ninth international conference on on Random structures and algorithms,pages 260–289, New York, NY, USA, 2000. John Wiley & Sons, Inc.

[20] Jeremy Frank, Ian P. Gent, and Toby Walsh. Asymptotic and finite size parameters for phasetransitions: Hamiltonian circuit as a case study. Information Processing Letters, 65:241–245,March 1998.

[21] Hongyu Gao, Vinod Yegneswaran, Yan Chen, Phillip Porras, Shalini Ghosh, Jian Jiang, andHaixin Duan. An empirical reexamination of global dns behavior. SIGCOMM Comput.Commun. Rev., 43(4):267–278, August 2013.

[22] K.A. Garcia, R. Monroy, L.A. Trejo, C. Mex-Perera, and E. Aguirre. Analyzing log files forpostmortem intrusion detection. Systems, Man, and Cybernetics, Part C: Applications andReviews, IEEE Transactions on, 42(6):1690–1704, Nov 2012.

[23] Michael R. Garey and David S. Johnson. Computers and Intractability; A Guide to the Theoryof NP-Completeness. W. H. Freeman & Co., New York, NY, USA, 1990.

[24] Ian P. Gent, Ewan MacIntyre, Patrick Prosser, and Toby Walsh. Scaling effects in the CSP phasetransition. In Ugo Montanari and Francesca Rossi, editors, Principles and Practice of ConstraintProgramming, CP ’95, pages 70–87. Springer, 1995.

Page 116: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

REFERENCES 112

[25] Ian P. Gent and Toby Walsh. The hardest random SAT problems. In Bernhard Nebel and LeonieDreschler-Fischer, editors, Proceedings of the 18th German Annual Conference on ArtificialIntelligence, KI-94, pages 355–366. Springer, 1994.

[26] Ian P. Gent and Toby Walsh. The SAT phase transition. In Anthony G. Cohn, editor, Proceedingsof the Eleventh European Conference on Artificial Intelligence, ECAI’94, pages 105–109. JohnWiley & Sons, 1994.

[27] Ian P. Gent and Toby Walsh. Phase transitions and annealed theories: Number partitioning as acase study. In Wolfgang Wahlster, editor, Proceedings of the Twelfth European Conference onArtificial Intelligence, ECAI’96, pages 170–174. John Wiley & Sons, 1996.

[28] Ian P. Gent and Toby Walsh. The TSP phase transition. Artificial Intelligence, 88(12):349 – 358,1996.

[29] Tad Hogg, Bernardo A. Huberman, and Colin P. Williams. Phase transitions and the searchproblem. Artificial Intelligence, 81(12):1 – 15, 1996. Frontiers in Problem Solving: PhaseTransitions and Complexity.

[30] HermanJ. Horst. Combining rdf and part of owl with rules: Semantics, decidability, complexity.In Yolanda Gil, Enrico Motta, V.Richard Benjamins, and MarkA. Musen, editors, The SemanticWeb ISWC 2005, volume 3729 of Lecture Notes in Computer Science, pages 668–684. SpringerBerlin Heidelberg, 2005.

[31] Yonggang Huang, Jun Zhang, Yongwang Zhao, and Dianfu Ma. Medical image retrievalwith query-dependent feature fusion based on one-class svm. In Computational Science andEngineering (CSE), 2010 IEEE 13th International Conference on, pages 176–183, Dec 2010.

[32] ICANN. Factsheet root server attack on 6 february 2007. 2007.

[33] Marios Iliofotou, Michalis Faloutsos, and Michael Mitzenmacher. Exploiting dynamicity ingraph-based traffic analysis: techniques and applications. In CoNEXT ’09: Proceedings ofthe 5th international conference on Emerging networking experiments and technologies, pages241–252, New York, NY, USA, 2009. ACM.

[34] Keisuke Ishibashi, Tsuyoshi Toyono, Hirotaka Matsuoka, Katsuyasu Toyama, Masahiro Ishino,Chika Yoshimura, Takehiro Ozaki, Yuichi Sakamoto, and Ichiro Mizukoshi. Measurement ofdns traffic caused by ddos attacks. In SAINT-W ’05: Proceedings of the 2005 Symposium onApplications and the Internet Workshops, pages 118–121, Washington, DC, USA, 2005. IEEEComputer Society.

[35] Andrew J. Kalafut, Minaxi Gupta, Christopher A. Cole, Lei Chen, and Nathan E. Myers. Anempirical study of orphan dns servers in the internet. In Proceedings of the 10th ACM SIGCOMMConference on Internet Measurement, IMC ’10, pages 308–314, New York, NY, USA, 2010.ACM.

Page 117: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

REFERENCES 113

[36] R. Karp. Reducibility among combinatorial problems. In R. Miller and J. Thatcher, editors,Complexity of Computer Computations, pages 85–103. Plenum Press, 1972.

[37] Yuta Kazato, Kensuke Fukuda, and Toshiharu Sugawara. Towards classification of dns erroneousqueries. In Proceedings of the 9th Asian Internet Engineering Conference, AINTEC ’13, pages25–32, New York, NY, USA, 2013. ACM.

[38] DongSeong Kim and JongSou Park. Network-based intrusion detection with support vectormachines. In Hyun-Kook Kahng, editor, Information Networking, volume 2662 of Lecture Notesin Computer Science, pages 747–756. Springer Berlin Heidelberg, 2003.

[39] B. Kirkpatrick, S. Lacoste-Julien, and W. Xu. Analyzing root dns traffic, 2004.

[40] Ziqian Liu, Bradley Huffaker, Marina Fomenkov, Nevil Brownlee, and Kimberly C. Claffy. Twodays in the life of the dns anycast root servers. In Steve Uhlig, Konstantina Papagiannaki, andOlivier Bonaventure, editors, PAM, volume 4427 of Lecture Notes in Computer Science, pages125–134. Springer, 2007.

[41] D.A. Luduena and et al. Statistical study of unusual dns query traffic. In ISCIT 07: Proceedingsof the International Symposium on Communications and Information Technologies, 2007., 2007.

[42] David Maier. The complexity of some problems on subsequences and supersequences. J. ACM,25(2):322–336, April 1978.

[43] Dorothy L. Mammen and Tad Hogg. A new look at the easy-hard-easy pattern of combinatorialsearch difficulty. Journal of Artificial Intelligence Research, 7:47–66, 1997.

[44] D. Massey, E. Lewis, O. Gudmundsson, R. Mundy, A. Mankin, and Society Ieee Computer.Public key validation for the dns security extensions. Discex’01: Darpa InformationSurvivability Conference and Exposition Ii, Vol I, Proceedings, pages 227–238, 2001.

[45] Rada Mihalcea and Andras Csomai. Wikify!: linking documents to encyclopedic knowledge.In Proceedings of the sixteenth ACM conference on Conference on information and knowledgemanagement, CIKM ’07, pages 233–242, New York, NY, USA, 2007. ACM.

[46] David Milne and Ian H. Witten. Learning to link with wikipedia. In Proceedings of the 17thACM conference on Information and knowledge management, CIKM ’08, pages 509–518, NewYork, NY, USA, 2008. ACM.

[47] Rafael Murrieta-Cid and Ral Monroy. A hybrid segmentation method applied to color imagesand 3d information. In Alexander Gelbukh and CarlosAlberto Reyes-Garcia, editors, MICAI2006: Advances in Artificial Intelligence, volume 4293 of Lecture Notes in Computer Science,pages 789–799. Springer Berlin Heidelberg, 2006.

[48] Vasileios Pappas, Dan Massey, and Lixia Zhang. Enhancing dns resilience against denialof service attacks. In DSN ’07: Proceedings of the 37th Annual IEEE/IFIP International

Page 118: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

REFERENCES 114

Conference on Dependable Systems and Networks, pages 450–459, Washington, DC, USA,2007. IEEE Computer Society.

[49] Vangelis T. Paschos. A survey of approximately optimal solutions to some covering and packingproblems. ACM Comput. Surv., 29(2):171–209, June 1997.

[50] David Plonka and Paul Barford. Context-aware clustering of dns query traffic. In IMC ’08:Proceedings of the 8th ACM SIGCOMM conference on Internet measurement, pages 217–230,New York, NY, USA, 2008. ACM.

[51] David M. W. Powers. Evaluation: From precision, recall and F-Factor to ROC, informedness,markedness & correlation. Technical Report SIE-07-001, School of Informatics andEngineering, Flinders University, 2007.

[52] John R. Quinlan. Improved use of continuous attributes in C4.5. Journal of Artificial IntelligenceResearch, 4(1):77–90, March 1996.

[53] Nelson Rangel-Valdez and Jose Torres-Jimenez. Phase transition in the bandwidth minimizationproblem. In Arturo Hernandez-Aguirre, Raul Monroy-Borja, and Carlos A. Reyes-Garcıa,editors, Proceedings of the 8th Mexican International Conference on Artificial Intelligence,MICAI ’09, pages 372–383. Springer, 2009.

[54] Pin Ren, John Kristoff, and Bruce Gooch. Visualizing dns traffic. In VizSEC ’06: Proceedings ofthe 3rd international workshop on Visualization for computer security, pages 23–30, New York,NY, USA, 2006. ACM.

[55] Bernhard Schlkopf, Alex J. Smola, Robert C. Williamson, and Peter L. Bartlett. New supportvector algorithms. 2000.

[56] Michael Schmitt and Laura Martignon. On the complexity of learning lexicographic strategies.J. Mach. Learn. Res., 7:55–83, December 2006.

[57] Kyle Schomp, Tom Callahan, Michael Rabinovich, and Mark Allman. On measuring theclient-side dns infrastructure. In Proceedings of the 2013 Conference on Internet MeasurementConference, IMC ’13, pages 77–90, New York, NY, USA, 2013. ACM.

[58] C. A. Shue and A. J. Kalafut. Resolvers revealed: Characterizing dns resolvers and their clients.Acm Transactions on Internet Technology, 12(4):17, 2013.

[59] Alex J. Smola, Le Song, and Choon Hui Teo. Relative novelty detection. In Twelfth InternationalConference on Artificial Intelligence and Statistics, volume 5 of JMLR Workshop and ConferenceProceedings, pages 536–543, 2009.

[60] Domain Name System, Jeffrey Pang, Roberto De Prisco, Hash Tables, James Hendricks, andBruce Maggs. Availability, usage, and deployment characteristics of the domain name system.In IMC ’04, 2004.

Page 119: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

REFERENCES 115

[61] Luis. Trejo, Raul Monroy, Roberto Alonso, Adrian Avila, Mario Maqueo, Jorge Vazquez, andErika Sanchez. Using cloud computing MapReduce operations to detect DDoS attacks on DNSservers. In Proceedings of the 4th Iberian Grid Infrastructure Conference 2010, IBERGRID’10.Netbiblo, 2010.

[62] D. Kaminsky US-CERT. Vulnerability note vu no.800113, 2008.

[63] H. Van Dyke Parunak, Alex Nickels, and Richard Frederiksen. An agent-based framework fordynamical understanding of dns events (dude). In Proceedings of the 1st International Workshopon Agents and CyberSecurity, ACySE ’14, pages 6:1–6:8, New York, NY, USA, 2014. ACM.

[64] P. Vixie, G. Sneeringer, and M. Schleifer. Events report of october 21, 2002. 2002.

[65] Yao Wang, Ming-zeng Hu, Bin Li, and Bo-ru Yan. Tracking anomalous behaviors of nameservers by mining dns traffic. In Proceedings of the 2006 International Conference on Frontiersof High Performance Computing and Networking, ISPA’06, pages 351–357, Berlin, Heidelberg,2006. Springer-Verlag.

[66] Z. Wang. Analysis of flooding dos attacks utilizing dns name error queries. Ksii Transactionson Internet and Information Systems, 6(10):2750–2763, 2012.

[67] Z. Wang and S. S. Tseng. Impact evaluation of ddos attacks on dns cache server using queuingmodel. Ksii Transactions on Internet and Information Systems, 7(4):895–909, 2013.

[68] Ian H. Witten. Semantic document processing using wikipedia as a knowledge base. InProceedings of the Focused retrieval and evaluation, and 8th international conference onInitiative for the evaluation of XML retrieval, INEX’09, pages 3–3, Berlin, Heidelberg, 2010.Springer-Verlag.

[69] Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques.Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann, second edition,2005.

[70] Dahai Xu, Yang Chen, Yizhi Xiong, Chunming Qiao, and Xin He. On the complexity of andalgorithms for finding the shortest path with a disjoint counterpart. IEEE/ACM Trans. Netw.,14(1):147–158, February 2006.

[71] Yingdi Yu, Duane Wessels, Matt Larson, and Lixia Zhang. Authority server selection in dnscaching resolvers. SIGCOMM Comput. Commun. Rev., 42(2):80–86, March 2012.

[72] Xuebiao Yuchi, Xiaodong Lee, Jian Jin, and Baoping Yan. Modeling dns activities based onprobabilistic latent semantic analysis. In Proceedings of the 6th International Conference onAdvanced Data Mining and Applications - Volume Part II, ADMA’10, pages 290–301, Berlin,Heidelberg, 2010. Springer-Verlag.

Page 120: Robertos PhD - homepage.cem.itesm.mxhomepage.cem.itesm.mx/raulm/theses/ralonso.pdf · significantly. I want to thank my dad Juan for being always my inspiration and my mom Angelica

REFERENCES 116

[73] Xuebiao Yuchi, Xin Wang, Xiaodong Lee, and Baoping Yan. A new statistical approach to dnstraffic anomaly detection. In Proceedings of the 6th International Conference on Advanced DataMining and Applications - Volume Part II, ADMA’10, pages 302–313, Berlin, Heidelberg, 2010.Springer-Verlag.