Comparing Spark GraphXand Cray Graph Engine using large-Comparing Spark GraphXand Cray Graph Engine...

15
Comparing Spark GraphX and Cray Graph Engine using large- scale client data Eric Dull and Brian Sacash Deloitte Advisory, May 11, 2017

Transcript of Comparing Spark GraphXand Cray Graph Engine using large-Comparing Spark GraphXand Cray Graph Engine...

Page 1: Comparing Spark GraphXand Cray Graph Engine using large-Comparing Spark GraphXand Cray Graph Engine using large-scale client data Eric Dull and Brian Sacash Deloitte Advisory, May

Comparing Spark GraphX and Cray Graph Engine using large-scale client data

Eric Dull and Brian SacashDeloitte Advisory, May 11, 2017

Page 2: Comparing Spark GraphXand Cray Graph Engine using large-Comparing Spark GraphXand Cray Graph Engine using large-scale client data Eric Dull and Brian Sacash Deloitte Advisory, May

Deloitte Advisory 2Copyright © 2017 Deloitte Development LLC. All rights reservedCopyright © 2017 Deloitte Development LLC. All rights reserved

Topics

• Introductions

• Our collection and analysis environment

• Problem statement:

• Experimental design

• GraphX

• CGE

• Weaknesses and “gotchas”

• Next steps

Page 3: Comparing Spark GraphXand Cray Graph Engine using large-Comparing Spark GraphXand Cray Graph Engine using large-scale client data Eric Dull and Brian Sacash Deloitte Advisory, May

Deloitte Advisory 3Copyright © 2017 Deloitte Development LLC. All rights reserved

IntroductionsDeloitte Advisory’s Cyber Reconnaissance team

Eric DullSpecialist LeaderDeloitte & Touche LLP• Experience includes network analysis, applied

graph analysis, behavior-based anomaly detection

• Prior CUG papers:• Cyberthreat analytics using graph analysis,

CUG 2015

Deloitte Advisory’s Cyber Reconnaissance team uses a combination of big data tools, data science, graph analytics, and supercomputing to uncover potential threat vectors or ongoing attacks

Brian SacashSpecialist SeniorDeloitte & Touche LLP• Data scientist who focuses on software

development for analytic-based decision making• Experience employing natural language

processing, statistical analysis, and machine learning using big data technologies

Page 4: Comparing Spark GraphXand Cray Graph Engine using large-Comparing Spark GraphXand Cray Graph Engine using large-scale client data Eric Dull and Brian Sacash Deloitte Advisory, May

Deloitte Advisory 4Copyright © 2017 Deloitte Development LLC. All rights reservedCopyright © 2017 Deloitte Development LLC. All rights reserved

Deloitte’s collection and analysis environment

Collectors (BRO)

AnalystBuild a graph

Execute Algorithm

Validated FindingsResults

SPARKPARQUET

CGE

Page 5: Comparing Spark GraphXand Cray Graph Engine using large-Comparing Spark GraphXand Cray Graph Engine using large-scale client data Eric Dull and Brian Sacash Deloitte Advisory, May

Deloitte Advisory 5Copyright © 2017 Deloitte Development LLC. All rights reservedCopyright © 2017 Deloitte Development LLC. All rights reserved

Data size

February Data:

• ~83,110,000,000 connection records

• ~1,800,000 unique clients

• ~55,000,000 unique external IPs

Source: Deloitte February data pull

0

1E+10

2E+10

3E+10

4E+10

5E+10

6E+10

7E+10

8E+10

9E+10

Connection Logs Per Month

Page 6: Comparing Spark GraphXand Cray Graph Engine using large-Comparing Spark GraphXand Cray Graph Engine using large-scale client data Eric Dull and Brian Sacash Deloitte Advisory, May

Deloitte Advisory 6Copyright © 2017 Deloitte Development LLC. All rights reservedCopyright © 2017 Deloitte Development LLC. All rights reserved

Cray Urika-GX

Specifications

• 32 Blades

• 1000 cores

• 8 Terabytes of Ram

• 120 TB of Lustre

• 25 Blades available for Apache Spark

Additional details:

• Hosted in Deloitte’s Federal Technology Center in Suwanee, GA

• Used for multiple Spark work streams supporting multiple clients

What compute did we use for the experiments

Page 7: Comparing Spark GraphXand Cray Graph Engine using large-Comparing Spark GraphXand Cray Graph Engine using large-scale client data Eric Dull and Brian Sacash Deloitte Advisory, May

Deloitte Advisory 7Copyright © 2017 Deloitte Development LLC. All rights reservedCopyright © 2017 Deloitte Development LLC. All rights reserved

Motivation

Cyber Kill Chain:

• External Reconnaissance

• Infection

• Lurking

• Activity

Applicable Graph algorithms:

• Community of Interest Identification

• Betweenness Centrality

Connecting Cyber Kill Chain to Graph Algorithms

Reconnaissance Assembly of Malware

Malware Delivery Exploitation Malware

InstallationInternal NetworkReconnaissance

Obfuscation & Deception

Remote Control (of a Botnet)

Multiple Probes & Attempts Example Attack Model

Page 8: Comparing Spark GraphXand Cray Graph Engine using large-Comparing Spark GraphXand Cray Graph Engine using large-scale client data Eric Dull and Brian Sacash Deloitte Advisory, May

Deloitte Advisory 8Copyright © 2017 Deloitte Development LLC. All rights reservedCopyright © 2017 Deloitte Development LLC. All rights reserved

Model

Betweenness Centrality: Which graph node has the most paths go through it?

Or, “All roads lead to Rome”

Target Graph Topologies

Server Server Server

Page 9: Comparing Spark GraphXand Cray Graph Engine using large-Comparing Spark GraphXand Cray Graph Engine using large-scale client data Eric Dull and Brian Sacash Deloitte Advisory, May

Deloitte Advisory 9Copyright © 2017 Deloitte Development LLC. All rights reservedCopyright © 2017 Deloitte Development LLC. All rights reserved

Experiment description

Build a graph:

• Use network connection logs

• Focus on known behaviors

Run Betweenness Centrality:

• GraphX implementation

• Cray Graph Engine implementation

Validate Algorithm Results:

• Look for known scanners

• Analyst feedback

How did we execution on this vision?

Page 10: Comparing Spark GraphXand Cray Graph Engine using large-Comparing Spark GraphXand Cray Graph Engine using large-scale client data Eric Dull and Brian Sacash Deloitte Advisory, May

Deloitte Advisory 10Copyright © 2017 Deloitte Development LLC. All rights reservedCopyright © 2017 Deloitte Development LLC. All rights reserved

Graph Building

Approach:

• Focus on TCP/UDP ports targeted by attackers

• Focus on successful connections

• Bring in multiple days

Observations:

• Successful connections reduces connection volumes by ~ 47%

• Targeted TCP ports (20, 21, 22, 23, 123, 445, 3389)

• Days remained in flux

How did we build the graph?

0.E+00

1.E+06

2.E+06

3.E+06

4.E+06

5.E+06

6.E+06

0 5 10 15 20 25

# of distinct IP pairs

Number of days in February

Page 11: Comparing Spark GraphXand Cray Graph Engine using large-Comparing Spark GraphXand Cray Graph Engine using large-scale client data Eric Dull and Brian Sacash Deloitte Advisory, May

Deloitte Advisory 11Copyright © 2017 Deloitte Development LLC. All rights reservedCopyright © 2017 Deloitte Development LLC. All rights reserved

GraphX results

Algorithm:

• No out-of-the-box implementation

• Spent time getting available 3rd party implementation running

Observations:

• GraphX did not perform above small graphs

• Observed variation in execution times likely related to network latency

How did GraphX perform?

Vertices Edges 1 Node 4 Nodes 8 Nodes

5,419 11,726 54 seconds 50.1 seconds 71.8 seconds

42,687 125,564 N/A N/A N/A

Page 12: Comparing Spark GraphXand Cray Graph Engine using large-Comparing Spark GraphXand Cray Graph Engine using large-scale client data Eric Dull and Brian Sacash Deloitte Advisory, May

Deloitte Advisory 12Copyright © 2017 Deloitte Development LLC. All rights reservedCopyright © 2017 Deloitte Development LLC. All rights reserved

CGE results

Algorithm:

• Cray Graph Engine provides betweenness centrality callable through Sparql

• CGE implementation uses directed edges, and traditional betweennesscentrality is undirected

Observations:

• CGE ran, took longer than expected

• Performance did not scale well when given additional nodes

How did the CGE implementation perform?

Vertices Edges 1 Node 8 Nodes 16 Nodes 5,419 11,726 2.6 seconds 29.4 seconds 29.3 seconds15,359 52,042 77.7 seconds 653 seconds 718 seconds42,687 125,564 354 seconds 1643 seconds 1891 seconds66,955 369,720 938 seconds 4730 seconds 6600 seconds115,276 1,281,918 3205 seconds 15068 seconds N/A

Page 13: Comparing Spark GraphXand Cray Graph Engine using large-Comparing Spark GraphXand Cray Graph Engine using large-scale client data Eric Dull and Brian Sacash Deloitte Advisory, May

Deloitte Advisory 13Copyright © 2017 Deloitte Development LLC. All rights reservedCopyright © 2017 Deloitte Development LLC. All rights reserved

Observations, Weaknesses, and “gotchas”

• Analyst validation in progress

• Building meaningful graphs at scale is difficult

• CGE is easy to use, and some algorithms are in progress

• GraphX is hard to use

Page 14: Comparing Spark GraphXand Cray Graph Engine using large-Comparing Spark GraphXand Cray Graph Engine using large-scale client data Eric Dull and Brian Sacash Deloitte Advisory, May

Deloitte Advisory 14Copyright © 2017 Deloitte Development LLC. All rights reservedCopyright © 2017 Deloitte Development LLC. All rights reserved

Next steps

• Further analyst validation

• Additional algorithms / use cases

• Hybrid architectures and workflows

Page 15: Comparing Spark GraphXand Cray Graph Engine using large-Comparing Spark GraphXand Cray Graph Engine using large-scale client data Eric Dull and Brian Sacash Deloitte Advisory, May

This document contains general information only and Deloitte Advisory is not, by means of this document, rendering accounting, business, financial, investment, legal, tax, or other professional advice or services. This document is not a substitute for such professional advice or services, nor should it be used as a basis for any decision or action that may affect your business. Before making any decision or taking any action that may affect your business, you should consult a qualified professional advisor.

Deloitte Advisory shall not be responsible for any loss sustained by any person who relies on this document.

As used in this document, “Deloitte Advisory” means Deloitte & Touche LLP, which provides audit and enterprise risk services; Deloitte Financial Advisory Services LLP, which provides forensic, dispute, and other consulting services; and its affiliate, Deloitte Transactions and Business Analytics LLP, which provides a wide range of advisory and analytics services. Deloitte Transactions and Business Analytics LLP is not a certified public accounting firm. These entities are separate subsidiaries of Deloitte LLP. Please see www.deloitte.com/us/about for a detailed description of the legal structure of Deloitte LLP and its subsidiaries. Certain services may not be available to attest clients under the rules and regulations of public accounting.

Copyright © 2017 Deloitte Development LLC. All rights reserved.36 USC 220506