Hadoop User Group Ireland (HUG) Ireland - Eddie Baggot Presentation April 2016

20
BAE SYSTEMS PROPRIETARY 1 Unpublished Work Copyright 2015 BAE Systems. All Rights Reserved. (See final slide for restrictions on use.) | BAE SYSTEMS PROPRIETARY BAE Systems Apache Spark GraphX and GraphFrames April 11 th 2016 Eddie Baggott

Transcript of Hadoop User Group Ireland (HUG) Ireland - Eddie Baggot Presentation April 2016

BAE SYSTEMS PROPRIETARY 1 Unpublished Work Copyright 2015 BAE Systems. All Rights Reserved. (See final slide for restrictions on use.)

|

BAE SYSTEMS PROPRIETARY

BAE Systems Apache Spark GraphX and GraphFrames April 11th 2016

 Eddie Baggott

BAE SYSTEMS PROPRIETARY 2 Unpublished Work Copyright 2015 BAE Systems. All Rights Reserved. (See final slide for restrictions on use.)

|

BAE SYSTEMS PROPRIETARY

• Functional and Data Architect

• BAE Systems, Norkom

• Anti Fraud, AML, Compliance, Watch lists, Cyber Security

• Disclaimer • All my own opinion

Introduction

BAE SYSTEMS PROPRIETARY 3 Unpublished Work Copyright 2015 BAE Systems. All Rights Reserved. (See final slide for restrictions on use.)

|

BAE SYSTEMS PROPRIETARY

• Graph databases are databases that use graph structures for semantic queries with nodes, edges and properties to represent and store data.

• Storing and showing Networks

What are graph databases

BAE SYSTEMS PROPRIETARY 4 Unpublished Work Copyright 2015 BAE Systems. All Rights Reserved. (See final slide for restrictions on use.)

|

BAE SYSTEMS PROPRIETARY

• Finding networks

• Analyse Relationships

• What to see how customers and accounts are connected • See the transactions between them

• Credit Card

• Comprised Devices

• AML Rings

• Insurance

• Unauthorized Trading

• Social Networks

• Uber – Lyft Cancel Wars

• Panama Papers

What are they used for

BAE SYSTEMS PROPRIETARY 5 Unpublished Work Copyright 2015 BAE Systems. All Rights Reserved. (See final slide for restrictions on use.)

|

BAE SYSTEMS PROPRIETARY

Customer behaviour Relationships Showing direction of payments , co-ownerships Use different type of lines and shapes to give extra meanings Width of lines can show bigger amounts

BAE SYSTEMS PROPRIETARY 6 Unpublished Work Copyright 2015 BAE Systems. All Rights Reserved. (See final slide for restrictions on use.)

|

BAE SYSTEMS PROPRIETARY

offshoreleaks.icij.org/nodes/262484 Start search with “mossack fonseca”

Panama Papers

BAE SYSTEMS PROPRIETARY 7 Unpublished Work Copyright 2015 BAE Systems. All Rights Reserved. (See final slide for restrictions on use.)

|

BAE SYSTEMS PROPRIETARY

Spider out one level Panama Papers

BAE SYSTEMS PROPRIETARY 8 Unpublished Work Copyright 2015 BAE Systems. All Rights Reserved. (See final slide for restrictions on use.)

|

BAE SYSTEMS PROPRIETARY

Show more connections Panama Papers

BAE SYSTEMS PROPRIETARY 9 Unpublished Work Copyright 2015 BAE Systems. All Rights Reserved. (See final slide for restrictions on use.)

|

BAE SYSTEMS PROPRIETARY

• Graph Databases • Neo4j, Titan ,OrientDB • Can Store and manage data • Transversal queries

• Processing Engine

• Spark , Giraph • GraphX • GraphFrames

• Can be complementary and used together e.g. MazeRunner

• Elastic Search Graph • New , uses search and term relevancy

Graph Databases : different approaches

BAE SYSTEMS PROPRIETARY 10 Unpublished Work Copyright 2015 BAE Systems. All Rights Reserved. (See final slide for restrictions on use.)

|

BAE SYSTEMS PROPRIETARY

Apache Spark

DataFrames

GraphFrames

BAE SYSTEMS PROPRIETARY 11 Unpublished Work Copyright 2015 BAE Systems. All Rights Reserved. (See final slide for restrictions on use.)

|

BAE SYSTEMS PROPRIETARY

• GraphX is a graph computation engine built on top of Spark that enables users to interactively build, transform and reason about graph structured data at scale. It comes complete with a library of common algorithms.

• Spark , based on RDDs

• Num Vertices, Num Edges ,Degrees

•  Algorithms • PageRank • Connected Components • Triangle Counting

GraphX

BAE SYSTEMS PROPRIETARY 12 Unpublished Work Copyright 2015 BAE Systems. All Rights Reserved. (See final slide for restrictions on use.)

|

BAE SYSTEMS PROPRIETARY

•  In Big Data “Hello World” is usually a “Word Count”, of Wikipedia

• So lets graph wiki •  Clean the Data

•  Making a Vertex RDD

val vertices = articles.map(a => (pageHash(a.title), a.title))

•  Making the Edge RDD

val edges: RDD[Edge[Double]] = articles.flatMap { a => Edge(srcVid, dstVid, 1.0) }

•  Making the Graph

val graph = Graph(vertices, edges, "")

•  Run PageRank on Wikipedia val dublinGraph = graph.subgraph(vpred = (v, t) => t.toLowerCase contains “dublin") val prDublin = dublinGraph.staticPageRank(5) titleAndPrGraph.vertices.top(10).print

GraphX Example

BAE SYSTEMS PROPRIETARY 13 Unpublished Work Copyright 2015 BAE Systems. All Rights Reserved. (See final slide for restrictions on use.)

|

BAE SYSTEMS PROPRIETARY

• GraphFrames support general graph processing, similar to Apache Spark’s GraphX library. However, GraphFrames are built on top of Spark DataFrames, resulting in some key advantages:

• Python, Java & Scala APIs: GraphFrames provide uniform APIs for all 3 languages. For the first time, all algorithms in GraphX are available from Python & Java.

• Powerful queries: GraphFrames allow users to phrase queries in the familiar, powerful APIs of Spark SQL and DataFrames.

• Saving & loading graphs: GraphFrames fully support DataFrame data sources , allowing writing and reading graphs using many formats like Parquet, JSON, and CSV.

• In GraphFrames, vertices and edges are represented as DataFrames, allowing us to store arbitrary data with each vertex and edge

• http://spark-packages.org/package/graphframes/graphframes

Spark Graph Frames

BAE SYSTEMS PROPRIETARY 14 Unpublished Work Copyright 2015 BAE Systems. All Rights Reserved. (See final slide for restrictions on use.)

|

BAE SYSTEMS PROPRIETARY

Spark Graph Frames Example

Customer ID

Eddie 1

Alan 2

Matt 3

Deirdre 4

Bob 5

Sue 6

John 7

// Create Vertices ( customer ) and Edges payments ) Vertices = customers.select("Customer", "id").distinct() Edges = payments.select("Sender","Receiver","Amount", "Country") Graph = GraphFrame(Vertices, Edges)

Sender Receiver Amount Country

Eddie Matt 10,000 UK

Eddie Deirdre 15,000 Irl

Eddie Bob 25,000 USA

Alan Sue 32,000 USA

Alan John 43,000 USA

Matt Alan 50,000 Irl

Matt Deirdre 60,000 Irl

Matt Bob 120,000 USA

BAE SYSTEMS PROPRIETARY 15 Unpublished Work Copyright 2015 BAE Systems. All Rights Reserved. (See final slide for restrictions on use.)

|

BAE SYSTEMS PROPRIETARY

• Who sent more than 100k? graph.vertices.filter(“amount> 100000").show Matt

• Who sent to more than 2 people?

graph.inDegrees.filter("inDegree > 2").show

Eddie,Matt

• Who sent to most to Ireland?

graph.edges.filter(“country =‘Irl’” "). groupBy(”sender”).sum

•  Who are most connected? results = graph.pageRank(resetProbability=0.15, maxIter=10) display(results.vertices)

Spark Graph Frames Example

BAE SYSTEMS PROPRIETARY 16 Unpublished Work Copyright 2015 BAE Systems. All Rights Reserved. (See final slide for restrictions on use.)

|

BAE SYSTEMS PROPRIETARY

• Another way to see who is sending money to who Chord Diagram

BAE SYSTEMS PROPRIETARY 17 Unpublished Work Copyright 2015 BAE Systems. All Rights Reserved. (See final slide for restrictions on use.)

|

BAE SYSTEMS PROPRIETARY

www.elastic.co/products/graph

• Find connections based on relevance

• 

Elastic Search : Graph

BAE SYSTEMS PROPRIETARY 18 Unpublished Work Copyright 2015 BAE Systems. All Rights Reserved. (See final slide for restrictions on use.)

|

BAE SYSTEMS PROPRIETARY

• Graph good for Finding networks and Analysing Relationships • Different approaches • Lots of visualization options

• Get the benefits of using Spark

• We’re hiring! • http://www.baesystems.com/en/cybersecurity/careers

•  Any Questions?

Recap

FREEDOM OF INFORMATION ACT

This document (<projectreference><documentnumber>) contains confidential and commercially sensitive material which is provided for the Authority’s internal use only and is not intended for general dissemination. The information contained herein pertains to bodies dealing with security, national security and/or defence matters that would be exempt under Sections 23, 24 and 26 of the Freedom of Information Act 2000 (FOIA). It also consists of information which describes our methodologies, processes and commercial arrangements all of which would be exempt from disclosure under Sections 41 and 43 of the Act. Should the Authority receive any request for disclosure of the information provided in this document, the Authority is requested to notify BAE Systems Applied Intelligence. BAE Systems Applied Intelligence shall provide every assistance to the Authority in complying with its obligations under the Act. BAE Systems Applied Intelligence’s point of contact for FOIA requests is: Chief Counsel Legal Department BAE Systems Applied Intelligence Surrey Research Park Guildford Gu2 7YP Telephone 01483 816082

BAE SYSTEMS PROPRIETARY 19 Unpublished Work Copyright 2015 BAE Systems. All Rights Reserved. (See final slide for restrictions on use.)

|

BAE SYSTEMS PROPRIETARY

BAE SYSTEMS Surrey Research Park Guildford Surrey GU2 7YP United Kingdom T: +44 (0)1483 816000 F: +44 (0)1483 816144 Copyright © 2015 BAE Systems. All Rights Reserved. BAE SYSTEMS, the BAE SYSTEMS Logo and the product names referenced herein are trademarks of BAE Systems plc. No part of this document may be copied, reproduced, adapted or redistributed in any form or by any means without the express prior written consent of BAE Systems Applied Intelligence. BAE Systems Applied Intelligence Limited registered in England and Wales Company No. 1337451 with its registered office at Surrey Research Park, Guildford, England, GU2 7YP.

BAE SYSTEMS PROPRIETARY 20 Unpublished Work Copyright 2015 BAE Systems. All Rights Reserved. (See final slide for restrictions on use.)

|

BAE SYSTEMS PROPRIETARY