The Holy Grail of “Systems for Machine Learning”

9
The Holy Grail of “Systems for Machine Learning” Teaming humans and machine learning for detecting cyber threats Ignacio Arnaldo PatternEx San Jose, CA, USA [email protected] Kalyan Veeramachaneni MIT LIDS Cambridge, MA, 02139, USA [email protected] ABSTRACT Although there is a large corpus of research focused on us- ing machine learning to detect cyber threats, the solutions presented are rarely actually adopted in the real world. In this paper, we discuss the challenges that currently limit the adoption of machine learning in security operations, with a special focus on label acquisition, model deployment, and the integration of model findings into existing investiga- tion workflows. Moreover, we posit that the conventional approach to the development of machine learning models, whereby researchers work offline on representative datasets to develop accurate models, is not valid for many cyberse- curity use cases. Instead, a different approach is needed: to integrate the creation and maintenance of machine learning models into security operations themselves. 1. INTRODUCTION Over the past decade, various disciplines, from aircraft mon- itoring and healthcare to IOT engineering and cybersecurity, have taken up machine learning (ML) systems that monitor large amounts of data in order to draw conclusions and/or suggest interventions. By adopting such systems, stakehold- ers aim to leapfrog traditional post-hoc investigations by humans and achieve more proactive and predictive analysis and prevention. In particular, recent years have yielded a large corpus of research focused on using machine learning to tackle a wide range of cybersecurity use cases [13]. In an ideal scenario, machine learning would enable us to quickly detect precursors to adverse events, enabling institutions to either prevent or reduce the dwell time of cyberattacks. A few years ago, our team at PatternEx set out to develop and deliver a cybersecurity machine learning platform. This paper summarizes the experience and establishes the key challenges faced along the way. The most striking observa- tion is that a common modus operandi in both research and commercial ventures involves identifying a “labeled dataset ”, generating a complex machine learning model, and declar- ing victory when high accuracy is achieved. However, the approaches generally suffer from a lack of realism and rep- resentativeness [3] and are rarely adopted in the real world. Moreover, access to representative data is but one of the roadblocks in cybersecurity. In fact, given the breadth of the challenges, we predict that making machine learning work in this domain will be the holy grail of machine learning systems development: Challenge #1: Where art thou “labels”? A fun- damental assumption undergirding current efforts in this area is that “labeled examples” of previous attacks are readily available or could be easily acquired via an inter- active system. This assumption is misguided or academic at best. The acquisition of “labeled datasets” remains a key barrier to ML adoption, for several reasons: cer- tain datasets pertaining to previous breaches cannot be made public due to political or privacy-related concerns, and the shifting nature of attacks makes it challenging to maintain up-to-date labeled datasets. We present differ- ent label acquisition strategies and their benefits, draw- backs, and challenges in Section 3. Challenge #2: A giant leap from “AutoML” to scalable model deployments. The scale of model de- velopment required for ML-based cybersecurity solutions to be effective is unprecedented, and must be coupled with real-world operational settings and requirements. In a nutshell, supporting the ML-based detection of a wide range of threats introduces the need to simultaneously an- alyze a diversity of data sources and to learn and deploy a large number of disparate machine learning models (each possibly learned from minimal training examples cata- logued on a daily basis). Meanwhile, real-time require- ments demand a complex data processing infrastructure. The systems currently available for automating machine learning are dwarfed by the scale and specifics of these requirements, which we discuss further in Section 4. Challenge #3: Real-time explanations. In real set- tings, complex, multi-stage attacks that are worth detect- ing usually require analyst investigations. This means that any findings made by ML models must be inte- grated into existing analyst workflows. In fact, the use of ML models for detection introduces challenges related to the interpretability and explainability of machine learn- ing models. We describe more on this in Section 5. Challenge #4: It’d take a village. The changing na- ture of the cyberthreat landscape introduces the need to create and maintain ML models on a continuous basis. This requirement involves new tasks and workflows for the cybersecurity operations workforce, which will need to be augmented with a new set of roles. We expand upon this observation in Section 6. While usage of machine learning for cybersecurity has been piecemeal basis, the area of machine learning systems de- velopment has been similar. It seems like the community

Transcript of The Holy Grail of “Systems for Machine Learning”

Page 1: The Holy Grail of “Systems for Machine Learning”

The Holy Grail of “Systems for Machine Learning”Teaming humans and machine learning for detecting cyber threats

Ignacio ArnaldoPatternEx

San Jose, CA, [email protected]

Kalyan VeeramachaneniMIT LIDS

Cambridge, MA, 02139, USA

[email protected]

ABSTRACTAlthough there is a large corpus of research focused on us-ing machine learning to detect cyber threats, the solutionspresented are rarely actually adopted in the real world. Inthis paper, we discuss the challenges that currently limit theadoption of machine learning in security operations, with aspecial focus on label acquisition, model deployment, andthe integration of model findings into existing investiga-tion workflows. Moreover, we posit that the conventionalapproach to the development of machine learning models,whereby researchers work offline on representative datasetsto develop accurate models, is not valid for many cyberse-curity use cases. Instead, a different approach is needed: tointegrate the creation and maintenance of machine learningmodels into security operations themselves.

1. INTRODUCTIONOver the past decade, various disciplines, from aircraft mon-itoring and healthcare to IOT engineering and cybersecurity,have taken up machine learning (ML) systems that monitorlarge amounts of data in order to draw conclusions and/orsuggest interventions. By adopting such systems, stakehold-ers aim to leapfrog traditional post-hoc investigations byhumans and achieve more proactive and predictive analysisand prevention. In particular, recent years have yielded alarge corpus of research focused on using machine learningto tackle a wide range of cybersecurity use cases [13]. In anideal scenario, machine learning would enable us to quicklydetect precursors to adverse events, enabling institutions toeither prevent or reduce the dwell time of cyberattacks.

A few years ago, our team at PatternEx set out to developand deliver a cybersecurity machine learning platform. Thispaper summarizes the experience and establishes the keychallenges faced along the way. The most striking observa-tion is that a common modus operandi in both research andcommercial ventures involves identifying a “labeled dataset”,generating a complex machine learning model, and declar-ing victory when high accuracy is achieved. However, theapproaches generally suffer from a lack of realism and rep-resentativeness [3] and are rarely adopted in the real world.Moreover, access to representative data is but one of theroadblocks in cybersecurity. In fact, given the breadth of thechallenges, we predict that making machine learning workin this domain will be the holy grail of machine learningsystems development:

Challenge #1: Where art thou “labels”? A fun-damental assumption undergirding current efforts in thisarea is that “labeled examples” of previous attacks arereadily available or could be easily acquired via an inter-active system. This assumption is misguided or academicat best. The acquisition of “labeled datasets” remainsa key barrier to ML adoption, for several reasons: cer-tain datasets pertaining to previous breaches cannot bemade public due to political or privacy-related concerns,and the shifting nature of attacks makes it challenging tomaintain up-to-date labeled datasets. We present differ-ent label acquisition strategies and their benefits, draw-backs, and challenges in Section 3.

Challenge #2: A giant leap from “AutoML” toscalable model deployments. The scale of model de-velopment required for ML-based cybersecurity solutionsto be effective is unprecedented, and must be coupledwith real-world operational settings and requirements. Ina nutshell, supporting the ML-based detection of a widerange of threats introduces the need to simultaneously an-alyze a diversity of data sources and to learn and deploy alarge number of disparate machine learning models (eachpossibly learned from minimal training examples cata-logued on a daily basis). Meanwhile, real-time require-ments demand a complex data processing infrastructure.The systems currently available for automating machinelearning are dwarfed by the scale and specifics of theserequirements, which we discuss further in Section 4.

Challenge #3: Real-time explanations. In real set-tings, complex, multi-stage attacks that are worth detect-ing usually require analyst investigations. This meansthat any findings made by ML models must be inte-grated into existing analyst workflows. In fact, the use ofML models for detection introduces challenges related tothe interpretability and explainability of machine learn-ing models. We describe more on this in Section 5.

Challenge #4: It’d take a village. The changing na-ture of the cyberthreat landscape introduces the need tocreate and maintain ML models on a continuous basis.This requirement involves new tasks and workflows forthe cybersecurity operations workforce, which will needto be augmented with a new set of roles. We expandupon this observation in Section 6.

While usage of machine learning for cybersecurity has beenpiecemeal basis, the area of machine learning systems de-velopment has been similar. It seems like the community

Page 2: The Holy Grail of “Systems for Machine Learning”

is chipping away one problem at a time, AutoML systemstry to find best fit models [10], a separate system helps hu-mans label datasets [25], another theme of work focuses onmodel development and deployment [22] and yet another setof systems focus on explaining models. Perhaps the mostcompelling learning of all for us has been that building amachine learning driven cybersecurity platform requires allof the above, which begs the question: would this be theholy grail of machine learning systems development? Weargue that it is.

2. RELATED WORKThere is a large corpus of research focused on tackling cy-bersecurity use cases with machine learning. In particular,there are three cybersecurity problems for which the avail-ability of labeled datasets has resulted in a proliferation ofmachine learning applications: file analysis for malware de-tection [13], botnet detection [29], and the domain or URLdetection problem [6,20].

However, the cyberattack detection problem spans a widerange of malicious activities. As of the day of this writ-ing, MITRE’s ATT&CK initiative [30] lists 282 techniquesused by adversaries at different phases of an attack. Thereare isolated works demonstrating the feasibility and bene-fits of adopting machine learning for the detection of someof these techniques. To illustrate this point, we provide ex-amples of works that tackle different stages of an attack:browser exploits [14], web application attacks [16], TOR con-nections [18] and VPN traffic [8], lateral movement [31] andcredential access [12], HTTP [7] and DNS tunneling [9, 23],encrypted malware flows [1], web-based command and con-trol [33], and data exfiltration [2, 19]. While many of theseapproaches are promising and convey meaningful insightsinto designing detection strategies, they do not discuss howtheir methods fit into the type of broader detection strategyneeded in real-world security operations. In fact, the mod-eling approaches reported in these works must be coupledwith appropriate support in order to function in real life.In this paper, we discuss these requirements and their as-sociated challenges in depth, with a special focus on modeldeployment and maintenance and the integration of modelfindings into existing operational workflows.

3. LABEL ACQUISITION SCENARIOS INCYBERSECURITY

In this section, we take a deep dive into cybersecurity’s labelacquisition challenge. To drive the discussion, we first clas-sify common strategies for label acquisition in Figure 1. Thefigure shows a series of paths leading to the acquisition oflabels (numbered from 1 to 7). In the following subsections,we describe the steps involved in each of these paths, alongwith their challenges, benefits, and drawbacks.

3.1 Real data versus labThe first distinction involves whether the data originates in areal-world system or organization, or instead is the outcomeof lab simulations. Lab simulations present several benefits:they are able to simulate a wide range of attacks, and re-sult in the creation of high-quality (low-noise) data at a fastpace. This label acquisition strategy corresponds to path (1)in Figure 1. On the other hand, depending on the simulationstrategy, it is possible data generated in this way will suffer

from a lack of generality, realism, and representativeness [3].In the authors’ experience, this is particularly the case whenthe goal is to generate benign data – shown as path (2) inthe figure – because it is challenging to replicate the greatvariety of legitimate activity and traffic patterns observedin real-world organizations. Therefore, relying on simulateddata alone might prove insufficient to build detection mod-els, which motivates label acquisition from real-world data.

3.2 Continuous monitoringThis refers to cases where the data has been monitored, an-alyzed, and investigated in a timely manner (in close-to-realtime), as opposed to cases where the data has been analyzedand investigated a posteriori. Continuous monitoring occursin most medium and large organizations, where a series ofsecurity solutions monitor the traffic and activity at the or-ganization. Upon the detection of malicious activity, theyeither prevent further activities, or generate alerts that arethen reviewed, investigated, and (if necessary) remediatedby a security operations team.

3.3 Analyst confirmationAlerts generated by continuous monitoring mechanisms arereviewed and investigated by security analysts. In most se-curity operations, ticketing systems are used to monitor,track, and manage alert resolutions, and security analystslog the investigation outcomes upon closing the tickets thatthey are assigned. These logs correspond to a label thatcan be leveraged by AI security analytics (paths 3 and 4 inFigure 1). It is important to stress that labels are beingcaptured in security operations today.

To put AI security analytics opportunities in context, hereis a high-level view of the different mechanisms used to gen-erate alerts in today’s security operations:

3.3.1 Detection based on known malicious entitiesThis detection is performed based on indicators of compro-mise, such as signatures, file hashes, and IP and URL ordomain blacklists (shown as ”Repeat entity” in the figure).

3.3.2 Detection based on known malicious activitypatterns

Instead of looking for known malicious entities, this detec-tion approach leverages existing knowledge of malicious ac-tivity patterns for detection (shown as ”Repeat pattern” inthe figure). It is implemented predominantly as rules and (inisolated cases) as supervised machine learning models; eachmethod presents its advantages and drawbacks. In a nut-shell, rules are easy to interpret (more on this in Section 5)and do not require training data, whereas supervised mod-els provide improved detection accuracy (if training data isavailable) at the cost of interpretability.

3.3.3 Detection based on anomaly detectionThis last approach uses anomaly detection to surface rareactivity patterns. These methods rely on baselining strate-gies, whereby historic activity patterns are modeled and newactivities are assigned a score that captures their deviationfrom the norm. The analysis can be built on a populationbasis (by obtaining baselines for a group of entities), or on anindividual basis (by obtaining an activity baseline for eachentity). This approach can detect attacks where neither theentities or the patterns are known a priori, given that the as-

Page 3: The Holy Grail of “Systems for Machine Learning”

RealData or

lab?

Simulateactivity

Positivelabel

(1) (2) (3) (4) (5) (6)

Benignlabel

Benignlabel

Anomaly A posteriorianalysis

Repeatpattern

Repeatentity

Attack?

Alert?

Continuousmonitoring?

Lab

Yes No

Positivelabel

Assumebenign

Notdetermined

Analystconfirms?

Positivelabel

Notdetermined

Finding?

Real data: enterprise, honeypots, open sources

Yes No Yes

Yes

Yes

No

No

No

Figure 1: Label acquisition paths in cybersecurity. The acquisition of labeled data can be the result of lab simulations (1 and2), analyst investigations (3 and 4), or a posteriori threat hunting exercises (5 and 6).

sociated activity patterns are expected to be different frompreviously observed activity. On the other hand, in practice,they also generate a large number of false alerts, given thatnot all anomalies actually correspond to malicious activities.For instance, activity patterns may present anomalies due tomisconfigurations, seasonality, or legitimate changes in theactivities of the monitored entities, among other reasons.

3.4 A posteriori analysisAs shown in Figure 1, there are two paths leading to a pos-teriori analysis, often referred to as threat hunting in thecybersecurity domain. The first path corresponds to caseswhere continuous monitoring is in place, i.e. activities thatwere monitored and analyzed, but were not alerted upon.The second case corresponds to scenarios where continuousmonitoring is not taking place. The latter is currently afrequent situation: for instance, organizations today rarelyperform security analytics on DNS traffic or cloud-based ap-plications.

A posteriori investigations allow security researchers to per-form in-depth investigations to understand whether unclas-sified activities correspond to attacks (path 5 in Figure 1) orto benign activities (path 6), with the benefit of not havingto comply with deadlines imposed by service level agree-ments. The drawbacks of this approach are twofold. Thefirst issue has to do with the timing of the investigations: it ishard to perform fully-fledged investigations after the fact be-cause the state of the resources involved in the attack mighthave changed, and actors and attack infrastructure mightnot be active anymore. This often leads to a roadblock inthese investigations, making it difficult if not impossible toassess the reach of the analyzed activities, and to determinewhether they were malicious. The second limiting factor isoperational: given that these one-off investigations requiread-hoc workflows, researchers might require additional toolsand analysis infrastructure that are not readily available attheir organizations, especially in cases where there is a lotof data to analyze.

4. THE MANIFOLD DEPLOYMENT LEAPWe introduce the challenges involved in building the process-ing infrastructure necessary to adopt a machine learning-based attack detection strategy for the cybersecurity do-main, given the wide range of possible threats and datasources involved. Such infrastructure would need to in-gest multiple data sources, analyze the activity of multipleentities, and leverage machine learning models to generatealerts.

4.1 A primer on machine learning pipelinesMachine learning pipelines involve a sequence of data trans-formations that have been devised by experts to score enti-ties. Generally, we begin by extracting features from times-tamped log data, and next obtain entity scores using theseextracted features and machine learning models. The stepsinvolved in ML pipelines are schematically represented inFigure 2 and detailed in the following.

Step 1: Ingestion of raw logs. Logs are files that reg-ister a time sequence of events associated with entities ina monitored system. Logs are generated in a variety offormats: json, comma-separated-values, key-value pairs,and event logs (event types are assigned unique identi-fiers and described with a varying number of fields). Thefirst step entails parsing these logs to identify entities (e.g,IP addresses, users, domains, etc.) relationships betweenentities (one-to-many or many-to-many), timestamps, datatypes and other relevant information about the relationalstructure of the data. The data is then stored either inthe relational data model (database) or as-is. Either way,in this step, a relational model is designed based on priorknowledge of how the data is collected and what it means.

Step 2: Feature extraction. Once the data has beenparsed into a relational structure, the next step consists ofextracting, for each entity instance, a series of features thatdescribe its activity. For example, for any given IP address,we can average the number of bytes sent and received perconnection over a time interval of an hour. The output ofthis phase is generally an entity-feature matrix, in which

Page 4: The Holy Grail of “Systems for Machine Learning”

Raw logsTS

8/23/2017 10.0.0.1 100 10

Src. IP #conns #ports...

...... ... ... ......

...

...

...

...

...

...

Feature extraction Feature data Models Alert Dashboard

feature { name: ‘num_failed_logins’ definition: “COUNT(fail_login_event)}

feature { name: ‘num_failed_logins’ definition: “COUNT(success_login_event)}

TS

8/23/2017 54.2.0.4 1000 1

Dst . IP #conns #URL

... ... ... ...

...

... ...

TS

8/23/2017 google.es 9 40

Domain Length Alexa

... ... ... ...

Fwd proxy logs

Fw / traffic logs

AD logs

EDR logs

App. logs

DNS logs Account takeover

Alert Summary

22

49

3069

1442

4

3

k

1

Alert ID Rank

Figure 2: A sequence of data transformations performed by machine learning pipelines for cyberthreat detection. From leftto right: a wide range of data sources are ingested and features are extracted to describe the activity of the entities observedin the data (source IPs, destination IPs, domains, etc.). Finally, a set of machine learning models score the activity of eachentity instance and populate an alert dashboard.

each row corresponds to an instance of an entity and eachcolumn is a feature.

Step 3: Entity scoring. Each entity receives scores gen-erated by machine learning models. The scores representhow rare the activity of the entity is in the case of anomalydetection models, or how closely it resembles a maliciousactivity pattern in the case of supervised models.

Step 4: Alert generation. In the final layer, the scoresgenerated by machine learning models are aggregated intoalerts in a process known as alert correlation [21]. Thesealerts populate the alert dashboard and are reviewed by se-curity analysts.

4.2 Variety of sourcesThe first challenge involved in incorporating machine learn-ing into cyberattack detection is the variety of data sourcesin the cybersecurity space. In the authors’ experience, orga-nizations commonly store and index around 50 data sources,either for forensics or for live analysis. Such variety makesit challenging to ingest and model the data: on one hand,parsing logic needs to be developed for each data source, andon the other hand, the information useful for threat mod-eling needs to be identified for each source, which remainsa time-consuming process that can only be performed byskilled domain experts.

This challenge becomes more acute when the goal is to buildgeneric cybersecurity platforms, because it is then necessaryto account for different products and versions of security de-vices and applications. In fact, for the categories shown inFigure 2 such firewalls, DNS, EDR etc., there are multi-ple solutions that generate logs, with different informationcontent and different formats. Moreover, log formats are of-ten augmented over time as improvements are made to thesolutions’ capabilities and new versions are released, whichinvolves continuous maintenance of the ingestion logic.

4.3 Variety of threatsThe biggest challenge for AI security analytics is arguablythe variety of techniques used by attackers. As noted be-fore, MITREs ATT&CK initiative [30], the most compre-hensive resource to date, currently lists 282 techniques usedby adversaries in different phases of an attack – an under-count, given that new techniques are continuously being im-plemented and used by adversaries.

As shown in Figure 2, analysis of the activity of commonentities such as internal IPs, external IPs, domains, users

etc., can be leveraged in combination with machine learn-ing models to identify malicious techniques. However, it isimportant to note that the activity traces necessary to iden-tify a given technique can be scattered across data sources.Therefore, the analysis of cybersecurity data introduces amany-to-many mapping from data sources to entity activity,as well as a many-to-many mapping from entity to threat ortechnique.

These observations suggest the need for modular pipelines inwhich multiple entities are modeled, and where the activityof a given entity is extracted from multiple data sources andfed to one or more machine learning models. The design andimplementation of such a manifold system represents an-other challenge, because most machine learning-ready, end-to-end analytics solutions aim at supporting a reduced num-ber of use cases.

4.4 Close-to-real-time requirementDue to regulation, policy, and/or service-level agreements,many organizations require incident response times to re-main in the sub-hour range. This includes the threat detec-tion step, which in the case of AI security analytics involvesthe steps described in Section 4.1, along with investigationsand incident response. This means that threat detectionneeds to be performed in close-to-real-time; in the authors’experience, detection times in the 5 to 15 minute range afterthe event are acceptable in operational scenarios.

Despite advances in big data computing, the large scale ofdata remains a significant roadblock in the attempt to de-liver close-to-real-time detections. While some data sourcesmay be small in size, the daily volume of sources such asfirewall, proxy, or DNS logs range from a few GBs for smalland medium businesses to multiple TBs for large organi-zations. The ability to process data for machine learningmodel consumption at the multi-terabyte scale is achievedwith distributed streaming computing frameworks which, inturn, introduce a series of maintenance challenges. (Whilethese maintenance challenges can limit the adoption of AIsecurity analytics, they are not unique to cybersecurity, andtherefore we consider their analysis out of the scope of thispaper.)

5. EXPLAINING ML-GENERATED ALERTSIn this section, we describe the process that security analystscurrently follow while investigating threats and describe thechallenges associated to integrating ML-generated alerts.

Page 5: The Holy Grail of “Systems for Machine Learning”

Account takeover

Alert Summary Alert 22

Logs for alert 22 - Event 2

Event 1Event 2Event 3Event 4

... ...

...

User Jane - Login rule_114

Event ID Event information

{ user: "Jane" action: "failed login" timestamp: "2018-08-14 03:27:19" message: "wrong password" }

{ user: "Jane" action: "failed login" timestamp: "2018-08-14 03:27:49" message: "wrong password"}

{ user: "Jane" action: "successful login" timestamp: "2018-08-14 03:34:42" message: ""}

22

49

3069

1442

4

3

k

1

Alert ID Rank

Figure 3: Schematic representation of the three-layer interface used by security analysts. The alert summary view (left)displays all the alerts. The event view (center) shows the set of events that were aggregated in the same alert. The raw dataview (right) shows the logs, packets, files, process information, etc. for each event included in the alert.

5.1 Existing analyst interfaceSecurity analysts review alerts and take appropriate actionsto remediate actual threats. Alerts are generated either bysecurity devices or analytics tools, and are aggregated in asingle area for digestion by the analysts. Figure 3 showsa schematic representation of a common interface used bysecurity analysts to review and investigate alerts. As shownin the figure, the context of each alert is displayed in threedifferent layers. In the first and highest-level layer, referredto in this paper as the alert summary view, a dashboarddisplays all the alerts that need to be reviewed. Generally,each alert is contextualized with summary information suchas the detected malicious activity, a risk score, a timestamp,and the entities involved in the alert. The second layer, theevent view, shows the set of events that were aggregated inthe same alert. The aggregation or alert correlation logicgroups related entities and removes duplicates. (Note thatwhen a new event takes place, this logic determines whetheran existing alert is updated or a new alert created.) Thethird and lowest-level view shows the raw data (logs, pack-ets, files, process information etc.) that triggered the detec-tion mechanism for each event included in the alert. Thislast view is referred to as raw data view.

5.2 Incorporating ML-generated alertsWe now describe the challenges and strategies involved withgenerating the analyst views described above – alert sum-mary view, event set view, and raw data view – when spurredby a detection by machine learning models.

5.2.1 Alert summary view:When machine learning models are added to the existingset of detection strategies (signatures, blacklists, and rules),the generation of this view can remain for the most part un-changed. Given that ML models evaluate the activity of anentity over a predefined time window, it is straightforwardto retrieve the entity information and the timestamp of anygiven detection.

In a case where supervised models are looking for repeatingmalicious activity patterns, the identified threat is given bythe class predicted by the classifier. Score are computed asa function of the class probabilities returned by the classifierthat triggered the alert, among other parameters.

The case of anomaly detection models is arguably more com-plex. The score is generally a metric that indicates how

much an activity deviates from the norm. However, giventhat high scores indicate rare activities, it is harder to de-termine the threat represented by an activity that is rarelyseen. This limitation arguably introduces a challenge in theoperationalization of anomaly detection-based alerts, giventhat analysts are not informed of the specific threats iden-tified in the analysis.

5.2.2 Event view:This view is the result of the alert correlation logic, i.e. thelogic that groups events generated by the different detectionmechanisms. Given that this grouping is often performedbased on the entities involved in the events and the time ofthe activities [21], the addition of ML-generated events intothe logic does not introduce additional complexities.

On the other hand, this view shows the mechanism thattriggered the alert. For instance, Figure 3 shows the eventsaggregated into Alert 22, where Event 2 was triggered by aLogin rule with id 114. One of the benefits of rule-based de-tection is that rules are interpretable, such that analysts un-derstand the activity pattern identified by each rule. Achiev-ing this with machine learning models is more complex. Inthe following section, we describe a series of alternatives thatmight allow for a similar view with machine learning models.Note that we adopt the classification of the approaches andterminology introduced in the survey of methods for explain-ing supervised models contributed by Guidotti et al. [11].

1. Use of interpretable models: Also referred to as trans-parent box design, this is the option to learn inter-pretable models, such as rules, linear models or de-cision trees. In this way, a human-readable represen-tation of the models can be incorporated in place ofthe rules or signatures. In the case of decision treesand linear models, a second option consists of showinga summarized information of the feature importanceinstead of showing the full models.

2. Black-box explanation: There are two principal ap-proaches to explaining black-box models (tree ensem-bles, support vector machines, and neural networks).The first consists of providing a summary of the im-portance of each of the features. For instance, thiscapability is discussed in the seminal paper introduc-ing random forests [4], while works such as [35] aremodel-agnostic.

Page 6: The Holy Grail of “Systems for Machine Learning”

The second approach consists of showing, rather thanthe black-box model, an interpretable surrogate model(rules, decision trees, or linear models) that mimicsthe outputs of the black box. This approximation canbe global, whereby a surrogate model is used for allthe examples in the data, or local, whereby each re-gion of the data is explained with a different surrogatemodel. This way, the user is still presented with aninterpretable model. While initial works consideredglobal approximation strategies, the lack of fidelity ofthe surrogate models for complex applications resultedin inconsistencies in the explanations. Recent worksshow improved results by focusing on local approxi-mations [15,32].

3. Outcome explanation: Similar to model explanations,there are two main methods for providing outcome ex-planations. The first relies on the analysis of the con-tribution of each feature to a given prediction. Noteagain that some specific models, such as random forests,incorporate this capability out of the box [4], whilemodel agnostic methods have been developed to pro-vide this information [26, 27, 32] for any model thatgenerates output probabilities. For cases in which asurrogate model strategy is adopted, outcome explana-tion is achieved by showing the specific rule or decisionpath in a decision tree that triggered the prediction.

One of the shortcomings of these approaches when appliedto cybersecurity is that they use the features associated withmachine learning analysis as a basis to explain the modelsor the model outcomes. However, features created by datascientists are not necessarily interpretable by security ana-lysts. Therefore, additional care is needed to ensure the useof interpretable features to generate explanations, and it isparamount to complement each machine learning alert withits corresponding raw data.

5.2.3 Retrieving relevant raw data:To populate the third and lowest-level analyst view, it is nec-essary to retrieve the raw data corresponding to a machinelearning alert. In cases where the alerted entity presents areduced data footprint, this step is straightforward. In fact,it is enough to retrieve all the logs pertaining to the ana-lyzed entity, and to rely on the analyst to triage the relevantinformation for the investigation.

However, in practice, the footprint of alerted entities is of-ten large and manual triage is not efficient. For instance,the network activity of a given endpoint can be captured inthousands of log lines in firewall and DNS logs. In thesecommon cases, in order to improve the efficiency of the in-vestigation, it is necessary to reduce the set of logs displayedto as context for the analyzed alert. A simple option con-sists of designing predefined views of the raw data, generallyusing a hard-coded time window prior to the alert, and fil-tering logic to display a reduced set of selected logs, events(rows), and fields (columns). However, this approach canprove insufficient when the alert corresponds to an activ-ity observed for the first time or to variations of existingattacks. In fact, in these cases, a hard-coded, predefinedcontextualization logic might fail to retrieve the raw dataneeded to investigate the alert.

Although they are more often used for text, images, and tab-ular data, certain methods in the explainable machine learn-

Name Example 1 Example 2

-------------------------------------------------------

date 9/7/2017 9/8/2017

source_ip 10.0.0.4 10.0.0.5

Num. NX domains 16 14

Avg. NX domain length 22.13 21.07

Avg. NX domain digit ratio 0.25 0.17

Avg. NX domain consonant ratio 0.56 0.61

Avg. NX domain vowel ratio 0.12 0.15

Avg. NX domain entropy 3.99 3.97

Example 1: 9/7/2017

timestamp source_ip domain response_code

1:56:55 10.0.0.4 1k1r8daphzygb.com NXDOMAIN

1:57:01 10.0.0.4 6xzxrlq0sv218o4.com NXDOMAIN

1:57:04 10.0.0.4 d19pp1v-l-e1senafnh5pz.com NXDOMAIN

7:59:51 10.0.0.4 7-hxc3u9urnjnf.com NXDOMAIN

10:05:05 10.0.0.4 sdhtbtlf1q.com NXDOMAIN

10:05:12 10.0.0.4 xj316pmqvj.com NXDOMAIN

10:05:12 10.0.0.4 2syufo8vs0jprllr1a6lmd8q32.com NXDOMAIN

10:07:44 10.0.0.4 td-k3nctm5-byh4knfd7u.com NXDOMAIN

10:07:47 10.0.0.4 yb7561d7lw40yq35n-d3.com NXDOMAIN

10:08:01 10.0.0.4 30x3l0c-8.com NXDOMAIN

10:08:30 10.0.0.4 sie-tsuw57l2m.com NXDOMAIN

10:08:30 10.0.0.4 svczpy14t3.com NXDOMAIN

10:08:33 10.0.0.4 3tmg1apkz4n6pz3074m.com NXDOMAIN

10:10:04 10.0.0.4 v6dutthgl9aq1tezqric3llp2rm.com NXDOMAIN

10:10:14 10.0.0.4 t7hoyj6088be546f96h7s.com NXDOMAIN

10:17:15 10.0.0.4 q9eyeqxnudngw06urnp2f2t2kt.com NXDOMAIN

12:16:40 10.0.0.4 nu8zm4v2es8lbypbhg-g9nt6kof.com NXDOMAIN

Example 2: 9/8/2017

timestamp source_ip domain response_code

1:09:21 10.0.0.5 s6qlhjacym8klak6qct7wf.com NXDOMAIN

1:12:07 10.0.0.5 mezm62dw2s6.com NXDOMAIN

2:15:26 10.0.0.5 bv6wyeyfinr5cp71wra2.com NXDOMAIN

3:42:26 10.0.0.5 dbqfeyvvjs.com NXDOMAIN

9:09:13 10.0.0.5 lnmymb6kje7fs.com NXDOMAIN

9:09:13 10.0.0.5 b1ld1gh0wa08e3o6kjy.com NXDOMAIN

9:09:16 10.0.0.5 foi4jctjjuvidphz92.com NXDOMAIN

9:09:54 10.0.0.5 hp3k9p0u4qrlz8.com NXDOMAIN

9:49:14 10.0.0.5 9l-qsckvr7tlxgwv38imph9bt.com NXDOMAIN

9:51:11 10.0.0.5 8zmr-ksvlohihcvnx2.com NXDOMAIN

9:51:11 10.0.0.5 rz1vxe75i2lq694.com NXDOMAIN

9:51:36 10.0.0.5 qegxtcgpuyyyqupngj7z.com NXDOMAIN

9:52:42 10.0.0.5 rw8i-ikw.com NXDOMAIN

10:00:31 10.0.0.5 g--l3q9czusfvu-sdp0yevpk2p.com NXDOMAIN

Figure 4: Contextualization of DGA detections with thefeatures used for machine learning analysis and with relevantraw logs.

ing literature may apply here. In particular, those methodsthat retrieve the discriminant regions of the input data thatdetermine the model outcome. These regions are referred toas saliency masks in the context of natural language process-ing and computer vision [17, 28, 34, 36], and are commonlyretrieved via sensitivity analysis [5], whereby perturbationsto the input data and their resulting model outcomes arejointly analyzed to determine the contribution of each inputregion to the model outcome.

In the authors’ experience, retrieving relevant logs is impor-tant for providing actionable ML-generated alerts, becausefeatures used for machine learning analysis are not alwayseasy for security analysts to interpret. However, based onour review of related work, most papers in the machinelearning literature focus on problems in which data takesthe form of images, text, or tables, and do not consider thetimestamped activity logs of relational or transactional datathat are more relevant to cybersecurity use cases. While theunderlying ideas behind these works represent promising ap-proaches to retrieving relevant raw data in the cybersecurityapplications, the lack of domain-specific demonstrations ar-guably introduces a challenge for the integration of machinelearning alerts in security operations.

To illustrate this challenge, we next explore the detection ofDomain Generation Algorithms (DGAs), a well-documentedcommand and control mechanism [24]. DGAs are designedby attackers to generate dynamic domains that are then used

Page 7: The Holy Grail of “Systems for Machine Learning”

as meeting points for compromised machines and remoteattackers. The key idea is that the remote attacker and thecompromised machine share the generation logic, and willtherefore generate the same set of domains, each on its ownside. The remote attacker then registers one of the manydomains generated by the algorithm, while the malware inthe compromised machines attempts to connect to each andevery one of the generated domains until a communicationchannel is established with the remote attacker.

When such an attack takes place, compromised machinestry to connect to an elevated number of random looking do-mains, as shown in Figure 4. In doing so, they generate DNSqueries to resolve the domains, most of which fail becausethe domains do not exist (NXDOMAIN response code). Inorder to detect this type of attack with machine learning,we analyze the activity of internal IPs (source ip in the fig-ure) captured in DNS logs. In particular, we propose usingthe following features that quantify the characteristics of theNXDOMAINS queried by each IP:

1. Number of NX domains queried by the IP

2. Average length of the NX domains queried by the IP

3. Average digit ratio of the NX domains

4. Average consonant of the NX domains

5. Average vowel ratio of the NX domains

6. Average entropy of the NX domains

Figure 4, shows, for two examples, the values of the ex-tracted features as well as the section of DNS logs used forthe analysis. We argue that, in this case, the logs provideinformation that is more helpful and intuitive to analystswho are trying to determine the presence or absence of aDGA. As discussed above, the automated retrieval of theappropriate logs remains an open challenge for the machinelearning and cybersecurity communities.

6. ROLES AND WORKFLOWSIn this section we briefly describe current security operationworkflows, along with the adjustments needed to incorporateand maintain machine learning-based detection models.

6.1 Existing roles and workflowsSecurity operation centers (SOCs) split their analysts intothree tiers. The first tier is comprised of the juniormostanalysts, while the most skilled analysts are in tier three.First-tier analysts perform a first-pass screening and filter-ing of alerts in order to discard false positives and remediatecommon threats. For example, a common tier-1 workflow in-volves matching alerts against blacklists of known maliciousentities, and adding rules to block further connections tothe matched entities. In cases where the first tier analystscannot conclude whether an event is malicious, they escalatethe event for further investigation by a higher-level analyst.Tier-2 analysts perform investigations and remediations ofescalated alerts, and can in turn pass on events to tier-3analysts. Generally, only critical alerts requiring advancedinvestigation skills, such as reverse-engineering malware, areescalated to the last layer.

Apart from these layered investigation workflows, SOCs adoptcontinuous improvement strategies to maintain appropriate

detection rates as the threat landscape evolves, managingthe mechanisms (rules, threat intelligence sources etc.) thatgenerate alerts so that mechanisms with inadequate detec-tion rates are either updated or switched off. Update pro-cedures include subscription-based feeds, as well as in-housecontinuous improvement processes developed by the opera-tions team. For example, an outdated rule that generates anelevated number of false positives can be edited or deleted.Similarly, new detection mechanisms are constantly addedas information about new attacks becomes available.

6.2 Emerging roles and workflows in AI secu-rity analytics

As described in previous sections, existing mechanisms forgenerating alerts do not predominantly rely on machine learn-ing models. Consequently, current SOC responsibilities fo-cus on alert investigations and remediations, but do not in-clude some of the key roles and workflows necessary to oper-ationalize machine learning models. As shown in Figure 5,ML workflows introduce three distinct roles: security ana-lysts, model developers, and machine learning architects.

6.2.1 AI security analytics workflows:As shown in Figure 5, AI security analytics operations in-teract with data at different levels. In a nutshell, the initialsteps of big data analytics involve characterizing the activi-ties of active entities across a variety of data sources. Next,the deployed models generate alerts that are integrated intothe dashboard for analyst consumption. As such, analystsreview ML-generated alerts in the same way they reviewalerts generated by other mechanisms such as rules, signa-tures, and blacklists. One of the key aspects of this processis that the investigation outcomes provided by analysts arestored together with the alert contexts and associated data.These annotated alerts are leveraged by machine learningmodel developers to create new models and improve exist-ing ones. Finally, an AI security analytics architect decideson the model deployment strategy and identifies detectiongaps. In the following, we describe in more detail the workcarried out by the different AI security analytics actors.

6.2.2 Security analysts:The key responsibility of security analysts is still the inves-tigation and remediation of alerts. Despite the challengesdescribed in Section 5, the investigation and alert escala-tion workflows followed by security analysts are not highlyimpacted by the inclusion of ML-generated alerts.

6.2.3 Machine learning model developers:Model developers are charged with the creation and con-tinuous improvement of detection models. New models arecreated to detect new attacks, expanding the detection cov-erage. On the other hand, improvements to existing modelstarget a better trade-off between true positives (maliciousfindings) and false positives (false alerts). Continuous im-provement workflows are needed to counter model degrada-tion caused by an evolving threat landscape.

In the authors’ experience, the shortage of skilled modeldevelopers for cybersecurity represents one of the key chal-lenges in the adoption of AI security analytics. Working inthis area requires both an extensive knowledge of the cy-bersecurity domain, in order to understand attacks and howthey manifest in data, along with an understanding of the

Page 8: The Holy Grail of “Systems for Machine Learning”

22

49

3069

1442

4

3

k

1Account takeover

Alert ID Rank Alert Summary

22

49

3069

1442

4

3

k

1Account takeover

Alert ID Rank Alert Summary

BigDataSecurity analytics

Fwd proxy logs

Fw / traffic logs

AD logs

EDR logs

App. logs

DNS logs

AIModels

Normal144

22

22

49

49

3069

3069

1442

4

3

k

1

Attack

...

Normal Attack

Account takeover

Alert ID Rank Alert Summary

22

49

3069

1442

4

3

k

1Account takeover

Alert ID Rank Alert Summary

Figure 5: AI security analytics operations: Big data analytics characterize the activities observed from a variety of datasources. Next, the deployed machine learning models generate alerts that are integrated in the alert dashboard for analystconsumption. Processed data, in combination with investigation outcomes, are leveraged by model developers to create andimprove models. Finally, an AI security analytics architect defines the deployment strategy.

strengths and limitations of machine learning models. Whileframeworks such as [30] describe a large number of attacktechniques, developers need to identify opportunities wheremachine learning models can improve the detection capabil-ities of existing detection mechanisms.

6.2.4 AI security analytics architect:The responsibilities involved in this role are twofold. AI se-curity analytics architects are in charge of devising the ma-chine learning deployment strategy, deciding which modelsto deploy in order to obtain an appropriate threat coverageand an acceptable cost (false positives) versus benefit (truepositives) trade-off. They must also identify gaps in detec-tion coverage due to a lack of models or degraded modelperformance. These findings are then translated into re-quirements for the development or improvement of models.

7. CONCLUSIONIn this paper, we have deeply explored the challenges thatcurrently limit the adoption of machine learning for threatdetection in real-world security operations. The scope ofthese challenges is wide, and the solutions require not onlytechnological developments, but also changes in the rolesand processes involved in security operations. This paperhas focused on the label acquisition challenge, the processingcapabilities needed to address the many-to-many mappingsof data sources to the wide range of threats that compose thecybersecurity domain, and the challenges involved in addingthe alerts generated by machine learning models into exist-ing investigation workflows. Finally, the paper discussed theoperational changes in processes and roles needed to main-tain a machine learning-based detection strategy over time.

8. REFERENCES

[1] B. Anderson and D. McGrew. Identifying EncryptedMalware Traffic with Contextual Flow Data. In Pro-ceedings of the 2016 ACM Workshop on Artificial In-telligence and Security, AISec ’16, pages 35–46, NewYork, NY, USA, 2016. ACM.

[2] P. Awais Rashid, R. Ramdhany, M. Edwards, S. Muk-isa Kibirige, M. Ali Babar, D. Hutchison, andR. Chitchyan. Detecting and Preventing Data Exfiltra-tion. Technical report, Lancaster University, 04 2014.

[3] E. B. Beigi, H. H. Jazi, N. Stakhanova, and A. A.Ghorbani. Towards effective feature selection in ma-chine learning-based botnet detection approaches. In2014 IEEE Conference on Communications and Net-work Security, pages 247–255, 2014.

[4] L. Breiman. Random forests. Machine Learning,45(1):5–32, Oct 2001.

[5] P. Cortez and M. J. Embrechts. Opening black boxData Mining models using Sensitivity Analysis. In 2011IEEE Symposium on Computational Intelligence andData Mining (CIDM), pages 341–348, April 2011.

[6] M. Darling, G. Heileman, G. Gressel, A. Ashok, andP. Poornachandran. A lexical approach for classify-ing malicious URLs. In 2015 International Conferenceon High Performance Computing Simulation (HPCS),pages 195–202, July 2015.

[7] J. J. Davis and E. Foo. Automated Feature Engi-neering for HTTP Tunnel Detection. Comput. Secur.,59(C):166–185, June 2016.

[8] G. Draper-Gil, A. H. Lashkari, M. S. I. Mamun, andA. A. Ghorbani. Characterization of Encrypted andVPN Traffic using Time-related Features. In ICISSP,2016.

[9] G. Farnham and A. Atlasis. Detecting DNS Tunneling.SANS Institute InfoSec Reading Room, 2013.

[10] M. Feurer, A. Klein, K. Eggensperger, J. Springenberg,M. Blum, and F. Hutter. Efficient and Robust Auto-mated Machine Learning. In C. Cortes, N. D. Lawrence,D. D. Lee, M. Sugiyama, and R. Garnett, editors, Ad-vances in Neural Information Processing Systems 28,pages 2962–2970. Curran Associates, Inc., 2015.

Page 9: The Holy Grail of “Systems for Machine Learning”

[11] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Gi-annotti, and D. Pedreschi. A Survey of Methods forExplaining Black Box Models. ACM Comput. Surv.,51(5):93:1–93:42, Aug. 2018.

[12] A. Hagberg, N. Lemons, A. Kent, and J. Neil. Con-nected Components and Credential Hopping in Authen-tication Graphs. In 2014 Tenth International Confer-ence on Signal-Image Technology and Internet-BasedSystems, pages 416–423, Nov 2014.

[13] H. Jiang, J. Nagra, and P. Ahammad. SoK: ApplyingMachine Learning in Security-A Survey. arXiv preprintarXiv:1611.03186, 2016.

[14] A. Kapravelos, Y. Shoshitaishvili, M. Cova, C. Kruegel,and G. Vigna. Revolver: An Automated Approach tothe Detection of Evasive Web-based Malware. In Pre-sented as part of the 22nd USENIX Security Sympo-sium (USENIX Security 13), pages 637–652, Washing-ton, D.C., 2013.

[15] S. Krishnan and E. Wu. PALM: Machine Learning Ex-planations For Iterative Debugging. In Proceedings ofthe 2Nd Workshop on Human-In-the-Loop Data Ana-lytics, pages 4:1–4:6, New York, NY, USA, 2017. ACM.

[16] H. Lampesberger, P. Winter, M. Zeilinger, and E. Her-mann. An on-line learning statistical model to detectmalicious web requests. In International Conference onSecurity and Privacy in Communication Systems, pages19–38. Springer, 2011.

[17] W. Landecker, M. D. Thomure, L. M. A. Bettencourt,M. Mitchell, G. T. Kenyon, and S. P. Brumby. In-terpreting individual classifications of hierarchical net-works. In 2013 IEEE Symposium on Computational In-telligence and Data Mining, pages 32–38, April 2013.

[18] A. H. Lashkari, G. D. Gil, M. S. I. Mamun, and A. A.Ghorbani. Characterization of Tor Traffic using Timebased Features. In Proceedings of the 3rd InternationalConference on Information Systems Security and Pri-vacy - Volume 1: ICISSP,, pages 253–262. INSTICC,SciTePress, 2017.

[19] Y. Liu, C. Corbett, K. Chiang, R. Archibald,B. Mukherjee, and D. Ghosal. SIDD: A Framework forDetecting Sensitive Data Exfiltration by an Insider At-tack. In 2009 42nd Hawaii International Conference onSystem Sciences, pages 1–10, Jan 2009.

[20] M. S. I. Mamun, M. A. Rathore, A. H. Lashkari,N. Stakhanova, and A. A. Ghorbani. Detecting Ma-licious URLs Using Lexical Analysis. In Network andSystem Security - 10th International Conference, NSS2016, Taipei, Taiwan, September 28-30, 2016, Proceed-ings, pages 467–482, 2016.

[21] S. A. Mirheidari, S. Arshad, and R. Jalili. Alert correla-tion algorithms: A survey and taxonomy. In G. Wang,I. Ray, D. Feng, and M. Rajarajan, editors, Cy-berspace Safety and Security, pages 183–197, Cham,2013. Springer International Publishing.

[22] MLflow. An open source platform for the machine learn-ing lifecycle, 2019.

[23] A. Nadler, A. Aminov, and A. Shabtai. Detection ofMalicious and Low Throughput Data Exfiltration Overthe DNS Protocol. CoRR, abs/1709.08395, 2017.

[24] D. Plohmann, K. Yakdan, M. Klatt, J. Bader, andE. Gerhards-Padilla. A Comprehensive MeasurementStudy of Domain Generating Malware. In 25th USENIXSecurity Symposium (USENIX Security 16), pages 263–278, Austin, TX, 2016. USENIX Association.

[25] A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu,and C. Re. Snorkel: rapid training data creation withweak supervision. The VLDB Journal, Jul 2019.

[26] M. T. Ribeiro, S. Singh, and C. Guestrin. ”Why ShouldI Trust You?”: Explaining the Predictions of Any Clas-sifier. In Proceedings of the 22Nd ACM SIGKDD Inter-national Conference on Knowledge Discovery and DataMining, KDD ’16, pages 1135–1144, New York, NY,USA, 2016. ACM.

[27] M. Robnik-ikonja and I. Kononenko. Explaining Clas-sifications For Individual Instances. IEEE Transactionson Knowledge and Data Engineering, 20(5):589–600,May 2008.

[28] K. Simonyan, A. Vedaldi, and A. Zisserman. Deepinside convolutional networks: Visualising imageclassification models and saliency maps. CoRR,abs/1312.6034, 2013.

[29] M. Stevanovic and J. M. Pedersen. On the use ofmachine learning for identifying botnet network traf-fic. Journal of Cyber Security and Mobility, 4(3):1–32,2016.

[30] The MITRE Corporation. ATT&CK: Adversarial Tac-tics, Techniques & Common Knowledge, 2019.

[31] I. Ullah. Detecting Lateral Movement Attacks throughSMB using BRO. Master’s thesis, University of Twente,2016.

[32] M. M. Vidovic, N. Gornitz, K. Muller, and M. Kloft.Feature importance measure for non-linear learning al-gorithms. CoRR, abs/1611.07567, 2016.

[33] M. Warmer. Detection of web based command & con-trol channels. Master’s thesis, University of Twente,2011.

[34] B. Zhou, A. Khosla, L. A., A. Oliva, and A. Torralba.Learning Deep Features for Discriminative Localiza-tion. CVPR, 2016.

[35] A. Zien, N. Kramer, S. Sonnenburg, and G. Ratsch.The feature importance ranking measure. In W. Bun-tine, M. Grobelnik, D. Mladenic, and J. Shawe-Taylor,editors, Machine Learning and Knowledge Discoveryin Databases, pages 694–709, Berlin, Heidelberg, 2009.Springer Berlin Heidelberg.

[36] L. M. Zintgraf, T. S. Cohen, T. Adel, and M. Welling.Visualizing deep neural network decisions: Predictiondifference analysis. CoRR, abs/1702.04595, 2017.