Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Hadoop Summit 2016, Dublin, April 13th
Thomas VIAL, OCTO Technology - [email protected] GERMAINE, EDF R&D - [email protected]
Marie-Luce Picard, Benoit Grossin, Martin Soppé, Michel Lutz
2
Outline1. CONTEXT AND PROBLEM DESCRIPTION2. GRAPHS: KEY CONCEPTS AND TECHNICAL INSIGHTS3. TWO CHALLENGERS: SPARK GRAPHX AND TITAN4. QUERYING TIME-VARYING GRAPHS5. SCALING UP ON BIG GRAPHS6. AS A CONCLUSION
Brice Richard - Flickr
KC Tan Phoyography - Flickr
3
Outline1. CONTEXT AND PROBLEM DESCRIPTION2. GRAPHS: KEY CONCEPTS AND TECHNICAL INSIGHTS3. TWO CHALLENGERS: SPARK GRAPHX AND TITAN4. QUERYING TIME-VARYING GRAPHS5. SCALING UP ON BIG GRAPHS6. AS A CONCLUSION
Brice Richard - Flickr
KC Tan Phoyography - Flickr
4
ELECTRICITY GENERATION623.5 TWH
All electricity-related activitiesGenerationTransmission & DistributionTrading and Sales & MarketingEnergy services
Key figures* €72.9 billion in sales 38.5 million customers 158,161 employees worldwide 84.7% of generation does not emit CO2
2014 INVESTMENTS €4.5 BILLION
EDF: A GLOBAL LEADER IN ELECTRICITY
*as of 2015
5
Description of the French electrical network
2200 distribution substations 618 000 kilometers of medium voltage network 697 000 kilometers of low voltage network
A great diversity of components: lines, transformers, switches, line brakers, meters…
High voltage Medium voltage Low voltage
Transmission network Distribution network
More than 100 million components
6
French electrical network: more than 100m equipments to manage
Distribution substation Meters
… X2200
200+ nodes in
depth
~ 50000nodes
100m+elements
7
Some business cases to tackle BC1 Get a global picture of all powered
components, at any given time
BC2 Track the physical evolution of the network over time (new meters, reconfiguration of the network…)
BC3 Process electrical energy balance for any subpart of the network, over any period of time
BC4 If an equipment fails, figure out the best way to restore power supply as quickly as possible, given the state of the network at this specific time
8
How to modelize the problem?
define relationships between technical equipments
… it’s all about graphs
associate some properties to each component
keep track of events across time
explore the network, with complex calculations at hand
We have to find a way to:
9
Outline1. CONTEXT AND PROBLEM DESCRIPTION2. GRAPHS: KEY CONCEPTS AND TECHNICAL INSIGHTS3. TWO CHALLENGERS: SPARK GRAPHX AND TITAN4. QUERYING TIME-VARYING GRAPHS5. SCALING UP ON BIG GRAPHS6. AS A CONCLUSION
Brice Richard - Flickr
KC Tan Phoyography - Flickr
10
A property graph as a data model
Edge
Vertex
Properties
key1:valuekey2:value
key1:valuekey2:value
key1:valuekey2:value
belongsTo
Label
11
All components of the network are modelized as vertices:
• Distribution substations• Switches• Line brakers• Transformers• Loads• Lines• …
Edges only represent symbolic, non-oriented links between components
Electrical lines are also modelized as vertices!
A property graph as a data model
Load curves (metering data) are not stored as properties into the graph structure, but in a separate storage (HBase in this study)
12
Graphs and temporal data: not so simple!
A unique graph to record every event that has ever occurred
All events are recorded as vertex properties, in the form of time intervals:
Validity: track equipments life cycle (actual period as a working unit)
Failure: track equipments failures
SwitchState: track network reconfiguration through switches changes of state (open/close)
Each time an event occurs, a time interval is updated, or created
The concept: apply graph mutations at each new event, enriching the “maximal” graph
13
Graphs and temporal data: not so simple!The concept: apply graph mutations at each new event, enriching the “maximal” graph
Line LoadSrc
Line Load
Switch
Line
Validity : ]-∞, +∞[Failure : []
LoadLine
Validity : ]-∞, +∞[Failure : []
Validity : ]-∞, +∞[Failure : []
Validity : [t1, +∞[Failure : []SwitchState : []DefaultState : closed
Validity : ]-∞, +∞[Failure : []
Validity : ]-∞, +∞[Failure : []
Validity : ]-∞, +∞[Failure : []
Validity : ]-∞, t2]Failure : []
Validity : ]-∞, +∞[Failure : [t3, +∞[
Validity : [t1, +∞[Failure : []SwitchState : [t4, +∞[DefaultState : closed
14
Outline1. CONTEXT AND PROBLEM DESCRIPTION2. GRAPHS: KEY CONCEPTS AND TECHNICAL INSIGHTS3. TWO CHALLENGERS: SPARK GRAPHX AND TITAN4. QUERYING TIME-VARYING GRAPHS5. SCALING UP ON BIG GRAPHS6. AS A CONCLUSION
Brice Richard - Flickr
KC Tan Phoyography - Flickr
15
Spark GraphX: global architecture
Offers a graph API built on top of Spark
Extends RDD to represent property graphs: VertexRDD and EdgeRDD
Implements a collection of graph algorithms (e.g., Page Rank, Triangle Counting, Connected Components…) and also some fundamental operations on graphs (e.g., mapVertices, mapEdges, subgraph, collectNeighbors…)
16
Spark GraphX: Pregel
Implements a variant of Pregel, a very popular graph processing architecture developed at Google in 2010: “Pregel: a system for large-scale graph processing - Malewicz et al.”
Based on BSP (Bulk Synchronous Parallel)
Vertex-centric: edges don't carry any computation
Runs in sequence of iterations (supersteps)
For each superstep, every vertex can:
• read messages sent to it during the previous superstep• execute a user-defined function and generate some messages• send messages to other vertices (that will be received at the next
superstep)• vote to halt
17
Spark GraphX: Pregel
collectefusion
Vertex
Vertex
Vertex
Vertex
Vertex
Vertex
Superstep n-1 Superstep n Superstep n+1
sendMsg
sendMsg
sendMsg(other vertices)
sendMsg
sendMsg(other vertices)
mergeMsg vprog
haltVertex
18
TITAN: global architecture Optimized for storing and querying billions of vertices and edges over a cluster
Supports thousands of concurrent users Can execute local queries (OLTP) or distributed queries across a cluster (OLAP)
19
TITAN: GREMLIN
Gremlin OLTP: for local graphs traversal, queries run in a single process Gremlin Hadoop (OLAP): for large graph analysis, queries are distributed
across a cluster Gremlin Hadoop implements BSP-based vertex-centric computing
Gremlin is a graph traversal scripting language and is part of Tinkerpop, a widely supported open-source graph computing framework (e.g., Neo4J, OrientDB, Sparksee…)
Blueprints is like a JDBC driver for graphs
+++Storage backend
Graph DatabaseTinkerpop API Graph Processing
20
Outline1. CONTEXT AND PROBLEM DESCRIPTION2. GRAPHS: KEY CONCEPTS AND TECHNICAL INSIGHTS3. TWO CHALLENGERS: SPARK GRAPHX AND TITAN4. QUERYING TIME-VARYING GRAPHS5. SCALING UP ON BIG GRAPHS6. AS A CONCLUSION
Brice Richard - Flickr
KC Tan Phoyography - Flickr
21
Traversing a time-varying graph We have as inputs
• A period of observation [t1, t2]
• A particular vertex as a starting point
We know, for each vertex in the graph, when it was inactive
• For an equipment: taken down, under maintenance, commutated, …
We want to traverse the graph, following the paths as they vary according to the states of the vertices encountered
We devise a “Trickling” algorithm that can be easily implemented with either paradigm, Pregel or Gremlin
22
&=
&=
&=
&=
&=
&=
?
t1 t2
What paths from source S can we follow between t1 and t2?
VERTEX ACTIVITYOBSERVATION
ACTUAL FLOWS
Trickling down
23
=
=
=
==
=
What vertices are reachable from substation S at time t?
SCONNECTED
CONNECTED
CONNECTED
CONNECTEDCONNECTED
DISCONNECTED
t t+εACTUAL FLOW
BC1 – Get a picture of all powered components
?
24
=
=
=
==
=
t1 t2
For how long were vertices reachable from S actually powered between t1 and t2?
∑ intervals = 100%
∑ intervals = 100%
∑ intervals = 75%
∑ intervals = 55%∑ intervals = 75%
∑ intervals = 0%
ACTUAL FLOWS
BC2 – Track how long equipments are powered
?
25
==
=
How much aggregated power did flow from substation S to leaves between t1 and t2?
∑ loads = 1,988 MWh
1,132 MWh 856 MWh
0 MWh
kW
kW
kW
St1 t2
BC3 – Manage energy balance
?
26
Trickling with Gremlin
27
Trickling with GraphX
28
From what other substations S’ can vertices below S be reached? Pattern matching
SS’
SUBSTATION
LINESWITCH
SUBSTATION
BC4 – Restore power supply after a failure
?
SWITCH
(many things)
29
^[\d]+.*$!
// God created comments, and saw that it was good
Pattern matching with Gremlin
30
Outline1. CONTEXT AND PROBLEM DESCRIPTION2. GRAPHS: KEY CONCEPTS AND TECHNICAL INSIGHTS3. TWO CHALLENGERS: SPARK GRAPHX AND TITAN4. QUERYING TIME-VARYING GRAPHS5. SCALING UP ON BIG GRAPHS6. AS A CONCLUSION
Brice Richard - Flickr
KC Tan Phoyography - Flickr
31
Titan at scale A set of connected trees adding up to ~ 100 million vertices and ~ 100 million
edges, loaded on a 10-node HBase 1.1 cluster• The quasi-tree below each substation S contains roughly 50,000
vertices• The Titan table is made of 3 regions that are equally sollicited• The client machine is on the same network as the HBase servers
Unitary execution with Titan’s OLTP interface
Query Approximate time
BC1 – Get powered componentsTrickling 1 min
BC3 – Get aggregated loadTrickling 2 min
BC4 – Find backup substationsPattern matching 1 min
33
Scaling across substations
The execution times above are for a single source substation S
To compute over several substations, we run the same query in parallel with a different input
Client node
Titan+
Gremlin
Query Query Query
Query Query Query
Query Query Query
… … … HBase Cluster
34
Scaling across substations – Results
35
?
What about Titan OLAP? The execution times above are predictible but quickly become impractical
for interactive querying
We may want to compute some KPIs in advance with OLAP backends, re-using the same Gremlin queries
Titan 1.0.0 or 1.1.0-snapshot?TinkerPop 3.0.1 or 3.1.0?
Titan + TinkerPop JARs or the other way around?Giraph or Spark as the backend?
Be patient!Support for Hadoop 2 OLAP is still limited. Bits are
being moved around between the two projects,which are undergoing a big refactoring…Would that it were so simple
36
Next step: GraphX Spark GraphX is another natural candidate for OLAP workloads
We have yet to benchmark it against our big graph
But wait!GraphFrames for Spark is just being
released by Databricks.What’s going on?
http://graphframes.github.io/
37
Outline1. CONTEXT AND PROBLEM DESCRIPTION2. GRAPHS: KEY CONCEPTS AND TECHNICAL INSIGHTS3. TWO CHALLENGERS: SPARK GRAPHX AND TITAN4. QUERYING TIME-VARYING GRAPHS5. SCALING UP ON BIG GRAPHS6. AS A CONCLUSION
Brice Richard - Flickr
KC Tan Phoyography - Flickr
38
A turning point for graph analytics
A lot is happening these days• ThinkAurelius’s acquisition by Datastax (will they continue to support
HBase?)• Titan and TinkerPop cross-refactorings• Improvements in the computation backends (Giraph, Spark)• Better support for Hadoop 2 and YARN• Introduction of GraphFrames by Databricks
Graph analytics frameworks are becoming commodity, and this is a good thing!
39
Which framework to use? A quick sketch about the typical usages of the frameworks:
Be sure to test them thoroughly in your environment before making a final decision. Be prepared to go deep into Hadoop and Java/Scala internals!
Or maybe wait a little bit until the situation gets clearer?
Framework Usages
Titan OLTP Interactive querying, for a relatively small number of vertices
Titan OLAP Batch computations, KPI(but wait for Hadoop 2 support if you’re using HBase)
Spark GraphX Batch computations, KPI
GraphFrames ? (it’s too early)
40
Appendix
Appendix A: The ecosystem of graph-based solutions (Graph Databases)
Appendix B: The ecosystem of graph-based solutions (Graph Processing frameworks)
Appendix C: Graph traversal
Appendix D: Architecture with Spark GraphX
Appendix E: Architecture with Titan
41
Appendix A: The ecosystem of graph-based solutions
Graph Databases (OLTP)
Optimized for local graph exploration (traversal), with low latency
Optimized for handling multiple concurrent users
Data can be distributed across several machines
Queries themselves are not distributed: global graph analyses are inefficient
42
Appendix B: The ecosystem of graph-based solutionsGraph Processing Frameworks (OLAP)
Optimized for global graph processing (batch)
Queries and data are distributed across a cluster, and can handle very large graphs
Has a higher latency than OLTP solutions
Cannot handle a lot of concurrent users
43
Job 4Job 2 Job 3Job 1
Appendix C: Graph traversal
Src
Vtx
Vtx
Vtx
Vtx
Vtx
Vtx
Vtx
Vtx
Vtx Vtx
Vtx Vtx
Vtx Vtx
Vtx
Vtx
Src
Vtx
Vtx
Vtx
Vtx
Vtx
Vtx
Vtx
Vtx
Vtx Vtx
Vtx Vtx
Vtx Vtx
Vtx
Vtx
Computation nodes
Depth-first:
Breadth-first:
Hard to parallelize !
Low latency
High latency
44
Spark
Edges Vertices Vertex States Load curves
Edge RDD Vertex RDD
GraphX processors
Use case 1 Use case 2 Use case 3 …
Final result
Add intervals as properties
Combine
Appendix D: Architecture with GraphX
45
JVM
Groovy scripts library
Edges Vertices Load curves
Titan APIs
Transactional Gremlin
Add intervals as properties
Combine load curvesHBase API
Use case 1
Use case 2 …
Vertex States
Appendix E: Architecture with Titan
Store load curves as time series
Top Related