Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks

download Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks

of 44

  • date post

    07-Jan-2017
  • Category

    Technology

  • view

    1.372
  • download

    2

Embed Size (px)

Transcript of Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks

Exemple: Analyse des donnes non structures

Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks

Hadoop Summit 2016, Dublin, April 13th

Thomas VIAL, OCTO Technology - tvial@octo.comGuillaume GERMAINE, EDF R&D - guillaume.germaine@edf.frMarie-Luce Picard, Benoit Grossin, Martin Sopp, Michel Lutz

1

OutlineCONTEXT AND PROBLEM DESCRIPTIONGRAPHS: KEY CONCEPTS AND TECHNICAL INSIGHTSTWO CHALLENGERS: Spark GraphX AND TitanQuerying time-varying graphsScaling up on big graphsAs a conclusionBrice Richard - Flickr

KC Tan Phoyography - Flickr

#

2

OutlineCONTEXT AND PROBLEM DESCRIPTIONGRAPHS: KEY CONCEPTS AND TECHNICAL INSIGHTSTWO CHALLENGERS: Spark GraphX AND TitanQuerying time-varying graphsScaling up on big graphsAs a conclusionBrice Richard - Flickr

KC Tan Phoyography - Flickr

#

3

ELECTRICITY GENERATION623.5 TWh

All electricity-related activitiesGenerationTransmission & DistributionTrading and Sales & MarketingEnergy servicesKey figures* 72.9 billion in sales 38.5 million customers 158,161 employees worldwide 84.7% of generation does not emit CO2

2014 INVESTMENTS 4.5 BILLION

EDF: A GLOBAL LEADER IN ELECTRICITY*as of 2015

#

4

Description of the French electrical network2200 distribution substations

618 000 kilometers of medium voltage network697 000 kilometers of low voltage network

A great diversity of components: lines, transformers, switches, line brakers, meters

High voltageMedium voltageLow voltage

Transmission network

Distribution network

More than 100 million components

#

5

French electrical network: more than 100m equipments to manage

Distribution substation

Meters X2200

200+ nodes in depth

~ 50000nodes

100m+elements

#

6

Some business cases to tackleBC1 Get a global picture of all powered components, at any given timeBC2 Track the physical evolution of the network over time (new meters, reconfiguration of the network)BC3 Process electrical energy balance for any subpart of the network, over any period of timeBC4 If an equipment fails, figure out the best way to restore power supply as quickly as possible, given the state of the network at this specific time

#

7

How to modelize the problem?define relationships between technical equipments its all about graphs

associate some properties to each componentkeep track of events across timeexplore the network, with complex calculations at hand

We have to find a way to:

#

8

OutlineCONTEXT AND PROBLEM DESCRIPTIONGRAPHS: KEY CONCEPTS AND TECHNICAL INSIGHTSTWO CHALLENGERS: Spark GraphX AND TitanQuerying time-varying graphsScaling up on big graphsAs a conclusionBrice Richard - Flickr

KC Tan Phoyography - Flickr

#

9

A property graph as a data model

EdgeVertex

Propertieskey1:valuekey2:value

key1:valuekey2:value

key1:valuekey2:valuebelongsToLabel

#

10

All components of the network are modelized as vertices:

Distribution substationsSwitchesLine brakersTransformersLoadsLines

Edges only represent symbolic, non-oriented links between componentsElectrical lines are also modelized as vertices!

A property graph as a data modelLoad curves (metering data) are not stored as properties into the graph structure, but in a separate storage (HBase in this study)

#

11

Graphs and temporal data: not so simple!A unique graph to record every event that has ever occurredAll events are recorded as vertex properties, in the form of time intervals:Validity: track equipments life cycle (actual period as a working unit)Failure: track equipments failuresSwitchState: track network reconfiguration through switches changes of state (open/close)Each time an event occurs, a time interval is updated, or created The concept: apply graph mutations at each new event, enriching the maximal graph

#

12

Graphs and temporal data: not so simple!The concept: apply graph mutations at each new event, enriching the maximal graph

Line

Load

Src

Line

Load

Switch

LineValidity : ]-, +[Failure : []

Load

Line

Validity : ]-, +[Failure : []Validity : ]-, +[Failure : []Validity : [t1, +[Failure : []SwitchState : []DefaultState : closedValidity : ]-, +[Failure : []Validity : ]-, +[Failure : []Validity : ]-, +[Failure : []Validity : ]-, t2]Failure : []

Validity : ]-, +[Failure : [t3, +[

Validity : [t1, +[Failure : []SwitchState : [t4, +[DefaultState : closed

#

13

OutlineCONTEXT AND PROBLEM DESCRIPTIONGRAPHS: KEY CONCEPTS AND TECHNICAL INSIGHTSTWO CHALLENGERS: Spark GraphX AND TitanQuerying time-varying graphsScaling up on big graphsAs a conclusionBrice Richard - Flickr

KC Tan Phoyography - Flickr

#

14

Spark GraphX: global architectureOffers a graph API built on top of Spark

Extends RDD to represent property graphs: VertexRDD and EdgeRDDImplements a collection of graph algorithms (e.g., Page Rank, Triangle Counting, Connected Components) and also some fundamental operations on graphs (e.g., mapVertices, mapEdges, subgraph, collectNeighbors)

#

15

Spark GraphX: Pregel

Implements a variant of Pregel, a very popular graph processing architecture developed at Google in 2010: Pregel: a system for large-scale graph processing - Malewicz et al.Based on BSP (Bulk Synchronous Parallel)Vertex-centric: edges don't carry any computationRuns in sequence of iterations (supersteps)For each superstep, every vertex can:read messages sent to it during the previous superstepexecute a user-defined function and generate some messagessend messages to other vertices (that will be received at the next superstep)vote to halt

#

16

Spark GraphX: Pregelcollectefusion

Vertex

Vertex

Vertex

Vertex

Vertex

VertexSuperstep n-1Superstep n

Superstep n+1sendMsgsendMsg sendMsg(other vertices)sendMsg sendMsg(other vertices)

mergeMsg

vprog

haltVertex

#

17

TITAN: global architecture

Optimized for storing and querying billions of vertices and edges over a clusterSupports thousands of concurrent usersCan execute local queries (OLTP) or distributed queries across a cluster (OLAP)

#

18

TITAN: GREMLIN

Gremlin OLTP: for local graphs traversal, queries run in a single processGremlin Hadoop (OLAP): for large graph analysis, queries are distributed across a clusterGremlin Hadoop implements BSP-based vertex-centric computing

Gremlin is a graph traversal scripting language and is part of Tinkerpop, a widely supported open-source graph computing framework (e.g., Neo4J, OrientDB, Sparksee)Blueprints is like a JDBC driver for graphs

++

+Storage backendGraph DatabaseTinkerpop APIGraph Processing

#

19

OutlineCONTEXT AND PROBLEM DESCRIPTIONGRAPHS: KEY CONCEPTS AND TECHNICAL INSIGHTSTWO CHALLENGERS: Spark GraphX AND TitanQuerying time-varying graphsScaling up on big graphsAs a conclusionBrice Richard - Flickr

KC Tan Phoyography - Flickr

#

20

Traversing a time-varying graphWe have as inputsA period of observation [t1, t2]A particular vertex as a starting point

We know, for each vertex in the graph, when it was inactiveFor an equipment: taken down, under maintenance, commutated,

We want to traverse the graph, following the paths as they vary according to the states of the vertices encountered

We devise a Trickling algorithm that can be easily implemented with either paradigm, Pregel or Gremlin

#

21

&=

&=

&=

&=&=

&=?

t1t2What paths from source S can we follow between t1 and t2?

VERTEX ACTIVITYOBSERVATIONACTUAL FLOWSTrickling down

#

===

==

=What vertices are reachable from substation S at time t?

SCONNECTEDCONNECTEDCONNECTEDCONNECTEDCONNECTEDDISCONNECTED

tt+

ACTUAL FLOWBC1 Get a picture of all powered components?

#

=

==

==

=t1t2For how long were vertices reachable from S actually powered between t1 and t2?

intervals = 100% intervals = 100% intervals = 75% intervals = 55% intervals = 75% intervals = 0%

ACTUAL FLOWSBC2 Track how long equipments are powered?

#

==

=How much aggregated power did flow from substation S to leaves between t1 and t2?

loads = 1,988 MWh

1,132 MWh

856 MWh

0 MWhkWkWkW

St1t2BC3 Manage energy balance

?

#

Trickling with Gremlin

#

Trickling with GraphX

#

From what other substations S can vertices below S be reached? Pattern matching

S

S

SUBSTATIONLINESWITCHSUBSTATION

BC4 Restore power supply after a failure?

SWITCH(many things)

#

28

^[\d]+.*$!

// God created comments, and saw that it was goodPattern matching with Gremlin

#

OutlineCONTEXT AND PROBLEM DESCRIPTIONGRAPHS: KEY CONCEPTS AND TECHNICAL INSIGHTSTWO CHALLENGERS: Spark GraphX AND TitanQuerying time-varying graphsScaling up on big graphsAs a conclusionBrice Richard - Flickr

KC Tan Phoyography - Flickr

#

30

Titan at scaleA set of connected trees adding up to ~ 100 million vertices and ~ 100 million edges, loaded on a 10-node HBase 1.1 clusterThe quasi-tree below each