Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case

download Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case

of 25

  • date post

    25-Jan-2017
  • Category

    Technology

  • view

    234
  • download

    5

Embed Size (px)

Transcript of Apache Spark GraphX & GraphFrame Synthetic ID Fraud Use Case

Traversing our way through Apache Spark GraphFrames and GraphX

Traversing our way through Apache Spark GraphFrames and GraphXMo PatelData Day Texas 2017

A bit about meCurrently Deep Learning Practice Director at TeradataRoad Object Detection & Scene LabelingVisual Product SearchChatbotsPreviouslyAnalytics @ Social Sharing StartupAnalytics @ Intelligence CommunityDistributed Systems @ Satellite Operations CompanySoftware Engineering @ Defense Communications Program

Research Interests: Distributed Systems for Analytics

Love snowboarding and in general outdoor sports and working out to keep doing those things

mopatel

What is this talk about?What are Graphs and what are some interesting things about Graphs?What are some Graph Analytics Examples?What are GraphFrames?What is GraphX?How can Graph Analytics help financial companies fight Synthetic Identity Fraud?

What is a Graph?Natural

Artificial

WikipediaWikipedia

4

Power of Graphs

Graphic Source: http://a16z.com/2016/03/07/all-about-network-effects/ slide 14

Power of GraphsGood: Facebook, Twitter, WhatAppmost popular social networks

Bad: MySpace, Friendster, OrkutNobody goes there anymore. It's too crowded Yogi Berra

Data Growth: Recall Metcalfes (n2) and Reeds Law (2n)Memory IntensiveProcessing Intensive

Graph Databases cost money, Graph Analytics make money!

Graph Databases cost money, Graph Analytics make money!Page Rank, EigenCentralityModularity, Clustering Coefficient, Betweenness, ClosenessLoopy Belief Propogation, SALSA

Node Score in a GraphUsecase: Find out how important an entity is in a graphEntity Fraud DetectionInfluencersCrime BossesMethods: PageRank, EigenCentralityPageRank: http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm (Implemented: Spark, Aster, iGraph) EigenCentrality: http://www.stat.washington.edu/~pdhoff/courses/567/Notes/l6_centrality.pdf (Implemented: Spark, iGraph)

Communities in a GraphUsecase: Detect similar nodesBehavioral SegmentationCrime RingsProduct Strength & WeaknessMethods: Modularity, Clustering Coefficient, Betweenness, Closeness

Modularity: https://github.com/gephi/gephi/wiki/Modularity (Implemented: Aster, Gephi) Clustering Coefficient, Betweenness, Closeness: http://www.stat.washington.edu/~pdhoff/courses/567/Notes/l6_centrality.pdf (Implemented: Spark, iGraph)

Growth in GraphUsecase: Predict where will the graph grow or suggest new edgesEvent PredictionProduct RecommendationMethods: Loopy Belief Propagation, Belief Networks, SALSA

Loopy Belief Propagation: https://people.csail.mit.edu/fisher/publications/papers/ihler05b.pdf (Implemented: Aster, Markovian) SALSA: http://www9.org/w9cdrom/175/175.html (Implemented: Aster, Github PageRanking)

GraphXApache Spark Library for conducting Graph AnalyticsGraph Operations: num[Edges, Vertices], degress, collectNeighborsGraph Analytics:PageRankConnected ComponentsTriangle Counterhttp://spark.apache.org/graphx/

Property Graph

GraphFrameSQL like context is very popularLots of ways to work with Graphs: Cypher, SPARQL, Gremlin..Spark introduced DataFrame in February 2015Goal: Make it easy for DataFrame users to work with GraphsGraphFrame: GraphX & DataFrame Operationshttps://graphframes.github.io/index.html

GraphFrameVertices DataFrameval vertices = sqlContext.createDataFrame(List((a1", Wine", Beverage), (b2", "Beer", Beverage), (c3", Pretzel", Snack), (d4", "Cheese", Snack))).toDF("id", "name", type")Edges DataFrameGraphFrameval edges = sqlContext.createDataFrame(List(("a1", d4", 15455), ("b2", c3", 4849), (a1", c3", 40),(b2, d4, 134))).toDF(item1", item2", count")val productsGraphFrame = GraphFrame(vertices, edges)

productsGraphFrame. vertices.filter(type == Snack")

productsGraphFrame. numEdges

What is Synthetic Identity Fraud?

http://security.frontline.online/article/2014/2/2379-Synthetic-Identity-Fraud

Why has Synthetic Identity Fraud emerged as a big problem?

Verafin

17

How are Synthetic IDs created?

VerafinVerafin

How are Financial Companies exploited?

Verafin

19

What is the impact of Synthetic Identity Fraud?

VerafinVerafin

How can Graph Analytics helps solve Synthetic Identity Problem?Customer Address DataFrameval customerAddresses = sqlContext.createDataFrame(List((a1", 123 Main Street", 123abc456efg), (b2", 345 High Street", 123abc456efg), (c3", 789 Park Ave", 123abc456efg) )).toDF("id", address", customerid")vertices.Add Fake Addressval fakeAddress = sqlContext.createDataFrame(List((d4", 999 Ocean Ave", 123abc456efg) )).toDF("id", address", customerid")

val tempCustomerAddresses = customerAddresses.union(fakeAddress)DataBricks Cloud Notebook: http://tiny.cc/ddtx17graphx

How can Graph Analytics helps solve Synthetic Identity Problem?Master Address Connection Edges DataFrameval masterAddressConnections = sqlContext.createDataFrame(List(("b2", "a1"), ("e5", "c3"), ("c3", "b2"),("a1", "c3"),("e5", "d4") )).toDF("src", "dst")

val toEdgeMatches = masterAddressConnections.join(customerAddresses, masterAddressConnections("to") === customerAddresses("address")).select("to","from")

val fromEdgeMatches = masterAddressConnections.join(customerAddresses, masterAddressConnections("from") === customerAddresses("address")).select("to","from")

val checkEdges = fromEdgeMatches.union(toEdgeMatches)Detection GraphFramePageRankval detectionGraphFrame = GraphFrame(tempCustomerAddresses , checkEdges)

//PageRankval resultRanks = detectionGraphFrame.pageRank.resetProbability(0.15).tol(0.01).run()

//Personalized PageRankval d4Ranks = detectionGraphFrame.pageRank.resetProbability(0.15).maxIter(10).sourceId("d4").run()

resultRanks.vertices.select("id", "pagerank").show()

DataBricks Cloud Notebook: http://tiny.cc/ddtx17graphx

How do we decide if this address is fraud or not?PageRankidpageranka10.9463535901944437b20.9463535901944437c30.9463535901944437d40.15 Personalized PageRankDataBricks Cloud Notebook: http://tiny.cc/ddtx17graphxa1idpageranka10.33343371928623045 c30.28341866139329586 b20.21580437563085933 d40.0

b2idpagerankb20.33343371928623045 a10.28341866139329586 c30.21580437563085933 d40.0

c2idpagerankc30.33343371928623045 b20.28341866139329586 a10.21580437563085933 d40.0

d4idpagerankd40.15 a10.0 b20.0c30.0

Future Directions and ThoughtsFocus on delivering value over tools and technologiesWill we settle on a language for Graph Analytics?More algorithms in GraphX?Large scale Graph Analytics is still not scalable

Apache Spark GraphX: http://spark.apache.org/graphx/

Follow me on Twitter (@mopatel) for interesting Deep Learning and Analytics tweets