Download - Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks

Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks

Hadoop Summit 2016, Dublin, April 13th

Thomas VIAL, OCTO Technology - [email protected] GERMAINE, EDF R&D - [email protected]

Marie-Luce Picard, Benoit Grossin, Martin Soppé, Michel Lutz

2

Outline1. CONTEXT AND PROBLEM DESCRIPTION2. GRAPHS: KEY CONCEPTS AND TECHNICAL INSIGHTS3. TWO CHALLENGERS: SPARK GRAPHX AND TITAN4. QUERYING TIME-VARYING GRAPHS5. SCALING UP ON BIG GRAPHS6. AS A CONCLUSION

Brice Richard - Flickr

KC Tan Phoyography - Flickr

3




4

ELECTRICITY GENERATION623.5 TWH

All electricity-related activitiesGenerationTransmission & DistributionTrading and Sales & MarketingEnergy services

Key figures* €72.9 billion in sales 38.5 million customers 158,161 employees worldwide 84.7% of generation does not emit CO2

2014 INVESTMENTS €4.5 BILLION

EDF: A GLOBAL LEADER IN ELECTRICITY

*as of 2015

5

Description of the French electrical network

2200 distribution substations 618 000 kilometers of medium voltage network 697 000 kilometers of low voltage network

A great diversity of components: lines, transformers, switches, line brakers, meters…

High voltage Medium voltage Low voltage

Transmission network Distribution network

More than 100 million components

6

French electrical network: more than 100m equipments to manage

Distribution substation Meters

… X2200

200+ nodes in

depth

~ 50000nodes

100m+elements

7

Some business cases to tackle BC1 Get a global picture of all powered

components, at any given time

BC2 Track the physical evolution of the network over time (new meters, reconfiguration of the network…)

BC3 Process electrical energy balance for any subpart of the network, over any period of time

BC4 If an equipment fails, figure out the best way to restore power supply as quickly as possible, given the state of the network at this specific time

8

How to modelize the problem?

define relationships between technical equipments

… it’s all about graphs

associate some properties to each component

keep track of events across time

explore the network, with complex calculations at hand

We have to find a way to:

9




10

A property graph as a data model

Edge

Vertex

Properties

key1:valuekey2:value



belongsTo

Label

11

All components of the network are modelized as vertices:

• Distribution substations• Switches• Line brakers• Transformers• Loads• Lines• …

Edges only represent symbolic, non-oriented links between components

Electrical lines are also modelized as vertices!

A property graph as a data model

Load curves (metering data) are not stored as properties into the graph structure, but in a separate storage (HBase in this study)

12

Graphs and temporal data: not so simple!

A unique graph to record every event that has ever occurred

All events are recorded as vertex properties, in the form of time intervals:

Validity: track equipments life cycle (actual period as a working unit)

Failure: track equipments failures

SwitchState: track network reconfiguration through switches changes of state (open/close)

Each time an event occurs, a time interval is updated, or created

The concept: apply graph mutations at each new event, enriching the “maximal” graph

13

Graphs and temporal data: not so simple!The concept: apply graph mutations at each new event, enriching the “maximal” graph

Line LoadSrc

Line Load

Switch

Line

Validity : ]-∞, +∞[Failure : []

LoadLine



Validity : [t1, +∞[Failure : []SwitchState : []DefaultState : closed




Validity : ]-∞, t2]Failure : []

Validity : ]-∞, +∞[Failure : [t3, +∞[

Validity : [t1, +∞[Failure : []SwitchState : [t4, +∞[DefaultState : closed

14




15

Spark GraphX: global architecture

Offers a graph API built on top of Spark

Extends RDD to represent property graphs: VertexRDD and EdgeRDD

Implements a collection of graph algorithms (e.g., Page Rank, Triangle Counting, Connected Components…) and also some fundamental operations on graphs (e.g., mapVertices, mapEdges, subgraph, collectNeighbors…)

16

Spark GraphX: Pregel

Implements a variant of Pregel, a very popular graph processing architecture developed at Google in 2010: “Pregel: a system for large-scale graph processing - Malewicz et al.”

Based on BSP (Bulk Synchronous Parallel)

Vertex-centric: edges don't carry any computation

Runs in sequence of iterations (supersteps)

For each superstep, every vertex can:

• read messages sent to it during the previous superstep• execute a user-defined function and generate some messages• send messages to other vertices (that will be received at the next

superstep)• vote to halt

17

Spark GraphX: Pregel

collectefusion

Vertex

Vertex

Vertex

Vertex

Vertex

Vertex

Superstep n-1 Superstep n Superstep n+1

sendMsg

sendMsg

sendMsg(other vertices)

sendMsg

sendMsg(other vertices)

mergeMsg vprog

haltVertex

18

TITAN: global architecture Optimized for storing and querying billions of vertices and edges over a cluster

Supports thousands of concurrent users Can execute local queries (OLTP) or distributed queries across a cluster (OLAP)

19

TITAN: GREMLIN

Gremlin OLTP: for local graphs traversal, queries run in a single process Gremlin Hadoop (OLAP): for large graph analysis, queries are distributed

across a cluster Gremlin Hadoop implements BSP-based vertex-centric computing

Gremlin is a graph traversal scripting language and is part of Tinkerpop, a widely supported open-source graph computing framework (e.g., Neo4J, OrientDB, Sparksee…)

Blueprints is like a JDBC driver for graphs

+++Storage backend

Graph DatabaseTinkerpop API Graph Processing

20




21

Traversing a time-varying graph We have as inputs

• A period of observation [t1, t2]

• A particular vertex as a starting point

We know, for each vertex in the graph, when it was inactive

• For an equipment: taken down, under maintenance, commutated, …

We want to traverse the graph, following the paths as they vary according to the states of the vertices encountered

We devise a “Trickling” algorithm that can be easily implemented with either paradigm, Pregel or Gremlin

22

&=

&=

&=

&=

&=

&=

?

t1 t2

What paths from source S can we follow between t1 and t2?

VERTEX ACTIVITYOBSERVATION

ACTUAL FLOWS

Trickling down

23

=

=

=

==

=

What vertices are reachable from substation S at time t?

SCONNECTED

CONNECTED

CONNECTED

CONNECTEDCONNECTED

DISCONNECTED

t t+εACTUAL FLOW

BC1 – Get a picture of all powered components

?

24

=

=

=

==

=

t1 t2

For how long were vertices reachable from S actually powered between t1 and t2?

∑ intervals = 100%

∑ intervals = 100%

∑ intervals = 75%

∑ intervals = 55%∑ intervals = 75%

∑ intervals = 0%

ACTUAL FLOWS

BC2 – Track how long equipments are powered

?

25

==

=

How much aggregated power did flow from substation S to leaves between t1 and t2?

∑ loads = 1,988 MWh

1,132 MWh 856 MWh

0 MWh

kW

kW

kW

St1 t2

BC3 – Manage energy balance

?

26

Trickling with Gremlin

27

Trickling with GraphX

28

From what other substations S’ can vertices below S be reached? Pattern matching

SS’

SUBSTATION

LINESWITCH

SUBSTATION

BC4 – Restore power supply after a failure

?

SWITCH

(many things)

29

^[\d]+.*$!

// God created comments, and saw that it was good

Pattern matching with Gremlin

30




31

Titan at scale A set of connected trees adding up to ~ 100 million vertices and ~ 100 million

edges, loaded on a 10-node HBase 1.1 cluster• The quasi-tree below each substation S contains roughly 50,000

vertices• The Titan table is made of 3 regions that are equally sollicited• The client machine is on the same network as the HBase servers

Unitary execution with Titan’s OLTP interface

Query Approximate time

BC1 – Get powered componentsTrickling 1 min

BC3 – Get aggregated loadTrickling 2 min

BC4 – Find backup substationsPattern matching 1 min

33

Scaling across substations

The execution times above are for a single source substation S

To compute over several substations, we run the same query in parallel with a different input

Client node

Titan+

Gremlin

Query Query Query

Query Query Query

Query Query Query

… … … HBase Cluster

34

Scaling across substations – Results

35

?

What about Titan OLAP? The execution times above are predictible but quickly become impractical

for interactive querying

We may want to compute some KPIs in advance with OLAP backends, re-using the same Gremlin queries

Titan 1.0.0 or 1.1.0-snapshot?TinkerPop 3.0.1 or 3.1.0?

Titan + TinkerPop JARs or the other way around?Giraph or Spark as the backend?

Be patient!Support for Hadoop 2 OLAP is still limited. Bits are

being moved around between the two projects,which are undergoing a big refactoring…Would that it were so simple

36

Next step: GraphX Spark GraphX is another natural candidate for OLAP workloads

We have yet to benchmark it against our big graph

But wait!GraphFrames for Spark is just being

released by Databricks.What’s going on?

http://graphframes.github.io/

37




38

A turning point for graph analytics

A lot is happening these days• ThinkAurelius’s acquisition by Datastax (will they continue to support

HBase?)• Titan and TinkerPop cross-refactorings• Improvements in the computation backends (Giraph, Spark)• Better support for Hadoop 2 and YARN• Introduction of GraphFrames by Databricks

Graph analytics frameworks are becoming commodity, and this is a good thing!

39

Which framework to use? A quick sketch about the typical usages of the frameworks:

Be sure to test them thoroughly in your environment before making a final decision. Be prepared to go deep into Hadoop and Java/Scala internals!

Or maybe wait a little bit until the situation gets clearer?

Framework Usages

Titan OLTP Interactive querying, for a relatively small number of vertices

Titan OLAP Batch computations, KPI(but wait for Hadoop 2 support if you’re using HBase)

Spark GraphX Batch computations, KPI

GraphFrames ? (it’s too early)

40

Appendix

Appendix A: The ecosystem of graph-based solutions (Graph Databases)

Appendix B: The ecosystem of graph-based solutions (Graph Processing frameworks)

Appendix C: Graph traversal

Appendix D: Architecture with Spark GraphX

Appendix E: Architecture with Titan

41

Appendix A: The ecosystem of graph-based solutions

Graph Databases (OLTP)

Optimized for local graph exploration (traversal), with low latency

Optimized for handling multiple concurrent users

Data can be distributed across several machines

Queries themselves are not distributed: global graph analyses are inefficient

42

Appendix B: The ecosystem of graph-based solutionsGraph Processing Frameworks (OLAP)

Optimized for global graph processing (batch)

Queries and data are distributed across a cluster, and can handle very large graphs

Has a higher latency than OLTP solutions

Cannot handle a lot of concurrent users

43

Job 4Job 2 Job 3Job 1

Appendix C: Graph traversal

Src

Vtx

Vtx

Vtx

Vtx

Vtx

Vtx

Vtx

Vtx

Vtx Vtx

Vtx Vtx

Vtx Vtx

Vtx

Vtx

Src

Vtx

Vtx

Vtx

Vtx

Vtx

Vtx

Vtx

Vtx

Vtx Vtx

Vtx Vtx

Vtx Vtx

Vtx

Vtx

Computation nodes

Depth-first:

Breadth-first:

Hard to parallelize !

Low latency

High latency

44

Spark

Edges Vertices Vertex States Load curves

Edge RDD Vertex RDD

GraphX processors

Use case 1 Use case 2 Use case 3 …

Final result

Add intervals as properties

Combine

Appendix D: Architecture with GraphX

45

JVM

Groovy scripts library

Edges Vertices Load curves

Titan APIs

Transactional Gremlin

Add intervals as properties

Combine load curvesHBase API

Use case 1

Use case 2 …

Vertex States

Appendix E: Architecture with Titan

Store load curves as time series