Download - Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case

Transcript
Page 1: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case

Traversing our way through Apache Spark GraphFrames

and GraphX

Mo PatelData Day Texas 2017

Page 2: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case

A bit about me• Currently Deep Learning Practice Director at Teradata

– Road Object Detection & Scene Labeling– Visual Product Search– Chatbots

• Previously– Analytics @ Social Sharing Startup– Analytics @ Intelligence Community– Distributed Systems @ Satellite Operations Company– Software Engineering @ Defense Communications Program

• Research Interests: Distributed Systems for Analytics

• Love snowboarding and in general outdoor sports and working out to keep doing those things

mopatel

Page 3: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case

What is this talk about?• What are Graphs and what are some

interesting things about Graphs?• What are some Graph Analytics Examples?• What are GraphFrames?• What is GraphX?• How can Graph Analytics help financial

companies fight Synthetic Identity Fraud?

Page 4: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case

What is a Graph?Natural Artificial

WikipediaWikipedia

Page 6: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case

Power of Graphs• Good: Facebook, Twitter, WhatApp…

most popular social networks

• Bad: MySpace, Friendster, Orkut…“Nobody goes there anymore. It's too crowded” – Yogi Berra

Page 7: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case

• Data Growth: Recall Metcalfe’s (n2) and Reed’s Law (2n)

• Memory Intensive• Processing Intensive

Graph Databases cost

money, Graph Analytics make money!

Page 8: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case

Graph Databases cost money, Graph Analytics

make money!• Page Rank, EigenCentrality• Modularity, Clustering Coefficient,

Betweenness, Closeness• Loopy Belief Propogation, SALSA

Page 9: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case

Node Score in a Graph• Usecase: Find out how important an

entity is in a graph– Entity Fraud Detection– Influencers– Crime Bosses

• Methods: PageRank, EigenCentralityPageRank: http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm (Implemented: Spark, Aster, iGraph) EigenCentrality: http://www.stat.washington.edu/~pdhoff/courses/567/Notes/l6_centrality.pdf (Implemented: Spark, iGraph)

Page 10: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case

Communities in a Graph• Usecase: Detect similar nodes– Behavioral Segmentation– Crime Rings– Product Strength & Weakness

• Methods: Modularity, Clustering Coefficient, Betweenness, Closeness

Modularity: https://github.com/gephi/gephi/wiki/Modularity (Implemented: Aster, Gephi) Clustering Coefficient, Betweenness, Closeness: http://www.stat.washington.edu/~pdhoff/courses/567/Notes/l6_centrality.pdf (Implemented: Spark, iGraph)

Page 11: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case

Growth in Graph• Usecase: Predict where will the graph

grow or suggest new edges– Event Prediction– Product Recommendation

• Methods: Loopy Belief Propagation, Belief Networks, SALSA

Loopy Belief Propagation: https://people.csail.mit.edu/fisher/publications/papers/ihler05b.pdf (Implemented: Aster, Markovian) SALSA: http://www9.org/w9cdrom/175/175.html (Implemented: Aster, Github PageRanking)

Page 12: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case

GraphX• Apache Spark Library for conducting Graph

Analytics• Graph Operations: num[Edges, Vertices],

degress, collectNeighbors• Graph Analytics:– PageRank– Connected Components– Triangle Counter

http://spark.apache.org/graphx/

Page 13: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case

Property Graph

Page 14: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case

GraphFrame• SQL like context is very popular• Lots of ways to work with Graphs: Cypher,

SPARQL, Gremlin..• Spark introduced DataFrame in February 2015• Goal: Make it easy for DataFrame users to

work with Graphs• GraphFrame: GraphX & DataFrame Operations

https://graphframes.github.io/index.html

Page 15: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case

GraphFrameVertices DataFrameval vertices = sqlContext.createDataFrame(List(

(“a1", “Wine", “Beverage”), (“b2", "Beer", “Beverage”), (“c3", “Pretzel", “Snack”), (“d4", "Cheese", “Snack”)

)).toDF("id", "name", “type")

Edges DataFrame GraphFrameval edges = sqlContext.createDataFrame(List(

("a1", “d4", 15455), ("b2", “c3", 4849), (“a1", “c3", 40),(“b2”, “d4”, 134)

)).toDF(“item1", “item2", “count")

val productsGraphFrame = GraphFrame(vertices, edges)

productsGraphFrame. vertices.filter(“type == Snack")

productsGraphFrame. numEdges

Page 16: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case

What is Synthetic Identity Fraud?

http://security.frontline.online/article/2014/2/2379-Synthetic-Identity-Fraud

Page 17: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case

Why has Synthetic Identity Fraud emerged as a big problem?

Verafin

Page 18: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case

How are Synthetic IDs created?

Verafin

Verafin

Page 19: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case

How are Financial Companies exploited?

Verafin

Page 20: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case

What is the impact of Synthetic Identity Fraud?

Verafin

Verafin

Page 21: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case

How can Graph Analytics helps solve Synthetic Identity Problem?

Customer Address DataFrameval customerAddresses = sqlContext.createDataFrame(List(

(“a1", “123 Main Street", “123abc456efg”), (“b2", ”345 High Street", “123abc456efg”), (“c3", “789 Park Ave", “123abc456efg”)

)).toDF("id", ”address", “customerid")vertices.

Add Fake Addressval fakeAddress = sqlContext.createDataFrame(List(

(“d4", “999 Ocean Ave", “123abc456efg”)

)).toDF("id", ”address", “customerid")

val tempCustomerAddresses = customerAddresses.union(fakeAddress)

DataBricks Cloud Notebook: http://tiny.cc/ddtx17graphx

Page 22: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case

How can Graph Analytics helps solve Synthetic Identity Problem?

Master Address Connection Edges DataFrameval masterAddressConnections = sqlContext.createDataFrame(List(

("b2", "a1"), ("e5", "c3"), ("c3", "b2"),("a1", "c3"),("e5", "d4") …

)).toDF("src", "dst")

val toEdgeMatches = masterAddressConnections.join(customerAddresses, masterAddressConnections("to") === customerAddresses("address")).select("to","from")

val fromEdgeMatches = masterAddressConnections.join(customerAddresses, masterAddressConnections("from") === customerAddresses("address")).select("to","from")

val checkEdges = fromEdgeMatches.union(toEdgeMatches)

Detection GraphFramePageRankval detectionGraphFrame = GraphFrame(tempCustomerAddresses , checkEdges)

//PageRankval resultRanks = detectionGraphFrame.pageRank.resetProbability(0.15).tol(0.01).run()

//Personalized PageRankval d4Ranks = detectionGraphFrame.pageRank.resetProbability(0.15).maxIter(10).sourceId("d4").run()

resultRanks.vertices.select("id", "pagerank").show()

DataBricks Cloud Notebook: http://tiny.cc/ddtx17graphx

Page 23: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case

How do we decide if this address is fraud or not?

PageRankid pageranka10.9463535901944437b20.9463535901944437c30.9463535901944437d4 0.15

Personalized PageRank

DataBricks Cloud Notebook: http://tiny.cc/ddtx17graphx

a1id pageranka1 0.3334337192862304

5 c3 0.2834186613932958

6 b2 0.2158043756308593

3 d4 0.0

b2id pagerankb2 0.3334337192862304

5 a1 0.2834186613932958

6 c3 0.2158043756308593

3 d4 0.0

c2id pagerankc3 0.3334337192862304

5 b2 0.2834186613932958

6 a1 0.2158043756308593

3 d4 0.0

d4id pagerankd4 0.15 a1 0.0 b2 0.0c3 0.0

Page 24: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case

Future Directions and Thoughts• Focus on delivering value over tools and

technologies• Will we settle on a language for Graph

Analytics?• More algorithms in GraphX?• Large scale Graph Analytics is still not

scalable

Page 25: Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case

Apache Spark GraphX: http://spark.apache.org/graphx/ Follow me on Twitter (@mopatel) for interesting Deep Learning and Analytics tweets