Download - Powering a Virtual Power Station with Big Data

Powering a Virtual Power Station with

Big DataMichael BironneauApril 2016

CCGTCoal

Nuclear

Wind

Interconnecto

rsOCGT

Pumped StorageSo

lar Oil

BiomassHydro

0

5

10

15

20

25

30

35

Installed Capacity (GW) Generation (GW)

02468

101214161820

Total PowerM

W

Average upwards flex – 120%

Average downwards flex – 35%

• 25-40k messages processed per second• Total size of data 500TB-800TB

Open Energi in the coming year:

• 25-40k messages processed per second• Total size of data 500TB-800TB

Open Energi in the coming year:

Perspective: here’s what “big data” means to Boeing [1]:• ~64k messages per second from each aircraft• Total size of data over 100 petabytes

[1]: http://bit.ly/18kQlMn

Open Energi Boeing0

20

40

60

80

100

120

Size of data (PB)

Our data is not huge at the moment…

…but after domestic demand-side response (or something else on that scale)

Open Energi Boeing0

20

40

60

80

100

120

Size of data (PB)

Why Hortonworks Data Platform

• Can scale quickly to respond to market demands• Interoperability with existing code• Fantastic data integration• Knowledgeable technical support• Security and data governance

Batch | Our HDP setup

Flume

Asset Data

National Electricity Data

Market data

Other “live” timeseries data

Hive Streaming

Hive

otherApplications

Real-time | (Work ongoing)

Asset Data

ML models

HDFS, cache, Elasticsearch…

Update ML ModelsCorrelate Events

Enrich

Apache Hive | Example

CREATE EXTERNAL TABLE semi_structured_stuff (...) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES('es.resource' = ‘semi/structured',

'es.index.auto.create' = 'false') ;

SELECT something FROM semi_structured_stuffJOIN metadata m ON …LEFT JOIN timeseries t ON …

Index semi-structured data (Elasticsearch)

Use Hive to integrate this with timeseries data and other metadata

Farm out complex analytics to PythonSELECT transform(something) USING ‘insane_maths.py’AS (result)

Benefits

• Reduced storage cost compared to SAN + SQL Server• Better utilisation of infrastructure thanks to YARN• Pain-free integration of multiple data sources with external tables

in Hive• Scale up/down on demand• Re-use existing Python code = low development overhead

Dynamic Demand

SimulationsInsights via web

Machine learningStatistical Analysis

Event correlationExpert system

Real-time aggregationReal-time web feed

Thanks for listening. Any questions?