Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case

download Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case

of 25

  • date post

  • Category


  • view

  • download


Embed Size (px)

Transcript of Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case

Traversing our way through Apache Spark GraphFrames and GraphX

Traversing our way through Apache Spark GraphFrames and GraphXMo PatelData Day Texas 2017

A bit about meCurrently Deep Learning Practice Director at TeradataRoad Object Detection & Scene LabelingVisual Product SearchChatbotsPreviouslyAnalytics @ Social Sharing StartupAnalytics @ Intelligence CommunityDistributed Systems @ Satellite Operations CompanySoftware Engineering @ Defense Communications Program

Research Interests: Distributed Systems for Analytics

Love snowboarding and in general outdoor sports and working out to keep doing those things


What is this talk about?What are Graphs and what are some interesting things about Graphs?What are some Graph Analytics Examples?What are GraphFrames?What is GraphX?How can Graph Analytics help financial companies fight Synthetic Identity Fraud?

What is a Graph?Natural




Power of Graphs

Graphic Source: http://a16z.com/2016/03/07/all-about-network-effects/ slide 14

Power of GraphsGood: Facebook, Twitter, WhatAppmost popular social networks

Bad: MySpace, Friendster, OrkutNobody goes there anymore. It's too crowded Yogi Berra

Data Growth: Recall Metcalfes (n2) and Reeds Law (2n)Memory IntensiveProcessing Intensive

Graph Databases cost money, Graph Analytics make money!

Graph Databases cost money, Graph Analytics make money!Page Rank, EigenCentralityModularity, Clustering Coefficient, Betweenness, ClosenessLoopy Belief Propogation, SALSA

Node Score in a GraphUsecase: Find out how important an entity is in a graphEntity Fraud DetectionInfluencersCrime BossesMethods: PageRank, EigenCentralityPageRank: http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm (Implemented: Spark, Aster, iGraph) EigenCentrality: http://www.stat.washington.edu/~pdhoff/courses/567/Notes/l6_centrality.pdf (Implemented: Spark, iGraph)

Communities in a GraphUsecase: Detect similar nodesBehavioral SegmentationCrime RingsProduct Strength & WeaknessMethods: Modularity, Clustering Coefficient, Betweenness, Closeness

Modularity: https://github.com/gephi/gephi/wiki/Modularity (Implemented: Aster, Gephi) Clustering Coefficient, Betweenness, Closeness: http://www.stat.washington.edu/~pdhoff/courses/567/Notes/l6_centrality.pdf (Implemented: Spark, iGraph)

Growth in GraphUsecase: Predict where will the graph grow or suggest new edgesEvent PredictionProduct RecommendationMethods: Loopy Belief Propagation, Belief Networks, SALSA

Loopy Belief Propagation: https://people.csail.mit.edu/fisher/publications/papers/ihler05b.pdf (Implemented: Aster, Markovian) SALSA: http://www9.org/w9cdrom/175/175.html (Implemented: Aster, Github PageRanking)

GraphXApache Spark Library for conducting Graph AnalyticsGraph Operations: num[Edges, Vertices], degress, collectNeighborsGraph Analytics:PageRankConnected ComponentsTriangle Counterhttp://spark.apache.org/graphx/

Property Graph

GraphFrameSQL like context is very popularLots of ways to work with Graphs: Cypher, SPARQL, Gremlin..Spark introduced DataFrame in February 2015Goal: Make it easy for DataFrame users to work with GraphsGraphFrame: GraphX & DataFrame Operationshttps://graphframes.github.io/index.html

GraphFrameVertices DataFrameval vertices = sqlContext.createDataFrame(List((a1", Wine", Beverage), (b2", "Beer", Beverage), (c3", Pretzel", Snack), (d4", "Cheese", Snack))).toDF("id", "name", type")Edges DataFrameGraphFrameval edges = sqlContext.createDataFrame(List(("a1", d4", 15455), ("b2", c3", 4849), (a1", c3", 40),(b2, d4, 134))).toDF(item1", item2", count")val productsGraphFrame = GraphFrame(vertices, edges)

productsGraphFrame. vertices.filter(type == Snack")

productsGraphFrame. numEdges

What is Synthetic Identity Fraud?


Why has Synthetic Identity Fraud emerged as a big problem?



How are Synthetic IDs created?


How are Financial Companies exploited?



What is the impact of Synthetic Identity Fraud?


How can Graph Analytics helps solve Synthetic Identity Problem?Customer Address DataFrameval customerAddresses = sqlContext.createDataFrame(List((a1", 123 Main Street", 123abc456efg), (b2", 345 High Street", 123abc456efg), (c3", 789 Park Ave", 123abc456efg) )).toDF("id", address", customerid")vertices.Add Fake Addressval fakeAddress = sqlContext.createDataFrame(List((d4", 999 Ocean Ave", 123abc456efg) )).toDF("id", address", customerid")

val tempCustomerAddresses = customerAddresses.union(fakeAddress)DataBricks Cloud Notebook: http://tiny.cc/ddtx17graphx

How can Graph Analytics helps solve Synthetic Identity Problem?Master Address Connection Edges DataFrameval masterAddressConnections = sqlContext.createDataFrame(List(("b2", "a1"), ("e5", "c3"), ("c3", "b2"),("a1", "c3"),("e5", "d4") )).toDF("src", "dst")

val toEdgeMatches = masterAddressConnections.join(customerAddresses, masterAddressConnections("to") === customerAddresses("address")).select("to","from")

val fromEdgeMatches = masterAddressConnections.join(customerAddresses, masterAddressConnections("from") === customerAddresses("address")).select("to","from")

val checkEdges = fromEdgeMatches.union(toEdgeMatches)Detection GraphFramePageRankval detectionGraphFrame = GraphFrame(tempCustomerAddresses , checkEdges)

//PageRankval resultRanks = detectionGraphFrame.pageRank.resetProbability(0.15).tol(0.01).run()

//Personalized PageRankval d4Ranks = detectionGraphFrame.pageRank.resetProbability(0.15).maxIter(10).sourceId("d4").run()

resultRanks.vertices.select("id", "pagerank").show()

DataBricks Cloud Notebook: http://tiny.cc/ddtx17graphx

How do we decide if this address is fraud or not?PageRankidpageranka10.9463535901944437b20.9463535901944437c30.9463535901944437d40.15 Personalized PageRankDataBricks Cloud Notebook: http://tiny.cc/ddtx17graphxa1idpageranka10.33343371928623045 c30.28341866139329586 b20.21580437563085933 d40.0

b2idpagerankb20.33343371928623045 a10.28341866139329586 c30.21580437563085933 d40.0

c2idpagerankc30.33343371928623045 b20.28341866139329586 a10.21580437563085933 d40.0

d4idpagerankd40.15 a10.0 b20.0c30.0

Future Directions and ThoughtsFocus on delivering value over tools and technologiesWill we settle on a language for Graph Analytics?More algorithms in GraphX?Large scale Graph Analytics is still not scalable

Apache Spark GraphX: http://spark.apache.org/graphx/

Follow me on Twitter (@mopatel) for interesting Deep Learning and Analytics tweets