Traversing our way through Apache Spark GraphFrames
and GraphX
Mo PatelData Day Texas 2017
A bit about me• Currently Deep Learning Practice Director at Teradata
– Road Object Detection & Scene Labeling– Visual Product Search– Chatbots
• Previously– Analytics @ Social Sharing Startup– Analytics @ Intelligence Community– Distributed Systems @ Satellite Operations Company– Software Engineering @ Defense Communications Program
• Research Interests: Distributed Systems for Analytics
• Love snowboarding and in general outdoor sports and working out to keep doing those things
mopatel
What is this talk about?• What are Graphs and what are some
interesting things about Graphs?• What are some Graph Analytics Examples?• What are GraphFrames?• What is GraphX?• How can Graph Analytics help financial
companies fight Synthetic Identity Fraud?
What is a Graph?Natural Artificial
WikipediaWikipedia
Power of Graphs
Graphic Source: http://a16z.com/2016/03/07/all-about-network-effects/ slide 14
Power of Graphs• Good: Facebook, Twitter, WhatApp…
most popular social networks
• Bad: MySpace, Friendster, Orkut…“Nobody goes there anymore. It's too crowded” – Yogi Berra
• Data Growth: Recall Metcalfe’s (n2) and Reed’s Law (2n)
• Memory Intensive• Processing Intensive
Graph Databases cost
money, Graph Analytics make money!
Graph Databases cost money, Graph Analytics
make money!• Page Rank, EigenCentrality• Modularity, Clustering Coefficient,
Betweenness, Closeness• Loopy Belief Propogation, SALSA
Node Score in a Graph• Usecase: Find out how important an
entity is in a graph– Entity Fraud Detection– Influencers– Crime Bosses
• Methods: PageRank, EigenCentralityPageRank: http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm (Implemented: Spark, Aster, iGraph) EigenCentrality: http://www.stat.washington.edu/~pdhoff/courses/567/Notes/l6_centrality.pdf (Implemented: Spark, iGraph)
Communities in a Graph• Usecase: Detect similar nodes– Behavioral Segmentation– Crime Rings– Product Strength & Weakness
• Methods: Modularity, Clustering Coefficient, Betweenness, Closeness
Modularity: https://github.com/gephi/gephi/wiki/Modularity (Implemented: Aster, Gephi) Clustering Coefficient, Betweenness, Closeness: http://www.stat.washington.edu/~pdhoff/courses/567/Notes/l6_centrality.pdf (Implemented: Spark, iGraph)
Growth in Graph• Usecase: Predict where will the graph
grow or suggest new edges– Event Prediction– Product Recommendation
• Methods: Loopy Belief Propagation, Belief Networks, SALSA
Loopy Belief Propagation: https://people.csail.mit.edu/fisher/publications/papers/ihler05b.pdf (Implemented: Aster, Markovian) SALSA: http://www9.org/w9cdrom/175/175.html (Implemented: Aster, Github PageRanking)
GraphX• Apache Spark Library for conducting Graph
Analytics• Graph Operations: num[Edges, Vertices],
degress, collectNeighbors• Graph Analytics:– PageRank– Connected Components– Triangle Counter
http://spark.apache.org/graphx/
Property Graph
GraphFrame• SQL like context is very popular• Lots of ways to work with Graphs: Cypher,
SPARQL, Gremlin..• Spark introduced DataFrame in February 2015• Goal: Make it easy for DataFrame users to
work with Graphs• GraphFrame: GraphX & DataFrame Operations
https://graphframes.github.io/index.html
GraphFrameVertices DataFrameval vertices = sqlContext.createDataFrame(List(
(“a1", “Wine", “Beverage”), (“b2", "Beer", “Beverage”), (“c3", “Pretzel", “Snack”), (“d4", "Cheese", “Snack”)
)).toDF("id", "name", “type")
Edges DataFrame GraphFrameval edges = sqlContext.createDataFrame(List(
("a1", “d4", 15455), ("b2", “c3", 4849), (“a1", “c3", 40),(“b2”, “d4”, 134)
)).toDF(“item1", “item2", “count")
val productsGraphFrame = GraphFrame(vertices, edges)
productsGraphFrame. vertices.filter(“type == Snack")
productsGraphFrame. numEdges
What is Synthetic Identity Fraud?
http://security.frontline.online/article/2014/2/2379-Synthetic-Identity-Fraud
Why has Synthetic Identity Fraud emerged as a big problem?
Verafin
How are Synthetic IDs created?
Verafin
Verafin
How are Financial Companies exploited?
Verafin
What is the impact of Synthetic Identity Fraud?
Verafin
Verafin
How can Graph Analytics helps solve Synthetic Identity Problem?
Customer Address DataFrameval customerAddresses = sqlContext.createDataFrame(List(
(“a1", “123 Main Street", “123abc456efg”), (“b2", ”345 High Street", “123abc456efg”), (“c3", “789 Park Ave", “123abc456efg”)
)).toDF("id", ”address", “customerid")vertices.
Add Fake Addressval fakeAddress = sqlContext.createDataFrame(List(
(“d4", “999 Ocean Ave", “123abc456efg”)
)).toDF("id", ”address", “customerid")
val tempCustomerAddresses = customerAddresses.union(fakeAddress)
DataBricks Cloud Notebook: http://tiny.cc/ddtx17graphx
How can Graph Analytics helps solve Synthetic Identity Problem?
Master Address Connection Edges DataFrameval masterAddressConnections = sqlContext.createDataFrame(List(
("b2", "a1"), ("e5", "c3"), ("c3", "b2"),("a1", "c3"),("e5", "d4") …
)).toDF("src", "dst")
val toEdgeMatches = masterAddressConnections.join(customerAddresses, masterAddressConnections("to") === customerAddresses("address")).select("to","from")
val fromEdgeMatches = masterAddressConnections.join(customerAddresses, masterAddressConnections("from") === customerAddresses("address")).select("to","from")
val checkEdges = fromEdgeMatches.union(toEdgeMatches)
Detection GraphFramePageRankval detectionGraphFrame = GraphFrame(tempCustomerAddresses , checkEdges)
//PageRankval resultRanks = detectionGraphFrame.pageRank.resetProbability(0.15).tol(0.01).run()
//Personalized PageRankval d4Ranks = detectionGraphFrame.pageRank.resetProbability(0.15).maxIter(10).sourceId("d4").run()
resultRanks.vertices.select("id", "pagerank").show()
DataBricks Cloud Notebook: http://tiny.cc/ddtx17graphx
How do we decide if this address is fraud or not?
PageRankid pageranka10.9463535901944437b20.9463535901944437c30.9463535901944437d4 0.15
Personalized PageRank
DataBricks Cloud Notebook: http://tiny.cc/ddtx17graphx
a1id pageranka1 0.3334337192862304
5 c3 0.2834186613932958
6 b2 0.2158043756308593
3 d4 0.0
b2id pagerankb2 0.3334337192862304
5 a1 0.2834186613932958
6 c3 0.2158043756308593
3 d4 0.0
c2id pagerankc3 0.3334337192862304
5 b2 0.2834186613932958
6 a1 0.2158043756308593
3 d4 0.0
d4id pagerankd4 0.15 a1 0.0 b2 0.0c3 0.0
Future Directions and Thoughts• Focus on delivering value over tools and
technologies• Will we settle on a language for Graph
Analytics?• More algorithms in GraphX?• Large scale Graph Analytics is still not
scalable
Apache Spark GraphX: http://spark.apache.org/graphx/ Follow me on Twitter (@mopatel) for interesting Deep Learning and Analytics tweets
Top Related