Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive

32
© Hitachi, Ltd. 2016. All rights reserved. Hitachi, Ltd. OSS Solution Center 2016/10/27 Yusuke Furuyama Yang Xie Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive

Transcript of Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive

Page 1: Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive

© Hitachi, Ltd. 2016. All rights reserved.

Hitachi, Ltd. OSS Solution Center2016/10/27

Yusuke FuruyamaYang Xie

Leveraging smart meter data for electric utilities:Comparison of Spark SQL

with Hive

Page 2: Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive

2© Hitachi, Ltd. 2016. All rights reserved.

1. Leveraging smart meter data[Sample use case for electric utilities]2. Performance evaluation of MapReduce and Spark 1.6 (using Hive and Spark SQL)

3. Additional evaluation with Spark 2.0

Contents

4. Summary

Page 3: Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive

3© Hitachi, Ltd. 2016. All rights reserved.

1. Leveraging smart meter data   [Sample use case for electric utilities]

Page 4: Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive

4© Hitachi, Ltd. 2016. All rights reserved.

1-1 Hitachi’s Social innovation business approach

Page 5: Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive

5© Hitachi, Ltd. 2016. All rights reserved.

1-2 Situation of electric utilities and their needs

• Electric utilities have to adapt to competitive free market• Request for price cut of power transmission fee from government

Liberalization of the retail Electric Power Market

Needs to cut the cost for transmission and distribution equipment• Transmission and distribution equipment was replaced periodically• Decide the timing of replace by the condition of equipment

Maintenance team needs: Obtain the load status of each equipment

Electricity rate

Power plant Transmission

LinesSubstation Distributio

n LinesHome

Transmission and Distribution Companies

Power transmission

feeElectric Power Generation Companies Electricity

retailers

Page 6: Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive

6© Hitachi, Ltd. 2016. All rights reserved.

1-3 Situation of electric utilities and their needs (future)

• Decreasing nuclear plant as a stable power supplier• Increasing renewable energy supply

Unstable power supply

Needs for high level Demand Response• Rates by time zone (current demand response)• Many and small renewable energy suppliers• Near real-time demand response for each distribution system

Planning team needs: Obtain near real-time load status of each equipment

Electricity rate

Power plant Transmission

LinesSubstation Distributio

n LinesHome

Transmission and Distribution Companies

Power transmission

feeElectric Power Generation Companies Electricity

retailers

Page 7: Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive

7© Hitachi, Ltd. 2016. All rights reserved.

1-4 Leveraging big data for electric utilities

Power Grid Transformer(500/Distribution Line)

Switch(20/Distribution Line)

Meter DataManagement System Data from smart meters(every

30min.)

Analyze the data from smart meters to grasp the load status of equipment

Meet the needs of electric utilitiesSubstation

(1,000) Distribution Lines(5/Substation)

Smart Meter( 4 /Trans) ・・・0000

0000

0000

0000

0000

0000

0000

0000

0000

0000

0000

0000

Data Analysis System

Total 10,000,0

00

Concern about performance needed for near real-time processing

Page 8: Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive

8© Hitachi, Ltd. 2016. All rights reserved.

Data processing platform (Hadoop, Spark)

Data Analysis System

Target of this session

Planner(Equipment/

Demand response)

Meter DataManagement System

1-5 System Component

Raw Data Processed Data

Data visualization tool (Pentaho )

Page 9: Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive

9© Hitachi, Ltd. 2016. All rights reserved.

2. Performance evaluation of MapReduce and Spark 1.6 (using Hive and Spark SQL)

Page 10: Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive

10© Hitachi, Ltd. 2016. All rights reserved.

2-1 Use Case for electric utilities

Points of analysisNeeds Point of analysis

Find the equipment that needs to be replaced

Find the equipment that has heavy workload

Estimate the timing for replacement Check the trend of load status

Select the proper capacity of new equipment

Extract the peak of the load

Needs of electric utilities (recap)

Needs・ Analyze the data from smart meters to grasp the load status of equipment・ Near real-time

Page 11: Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive

11© Hitachi, Ltd. 2016. All rights reserved.

2-2 Contents of performance evaluation

Point of evaluation for near real-time processing

Check if MapReduce and Spark can process the data in 30min.

Point of analysis Aggregate per

Find the equipment has heavy workload

Equipment

Check the trend of load status

Term

Extract the peak of the load

Time zone

Items for evaluation

・ Distribution system  ・ Substation ・ Switch・ Transformer    ・ Meter・ 1day   ・ 1month(30days)   ・1year(365days)・ Specific 30min of each day  ・ 24h

• Concern about performance for processing 10,000,000 meters• Data comes from smart meter every 30min (48/Day, spec of smart

meter)

Page 12: Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive

12© Hitachi, Ltd. 2016. All rights reserved.

2-3 Target of performance evaluation

Target of evaluation

Data Processing Platform(MapReduce, Spark 1.6)

Data Analysis System

End

Data visualization tool ( Pentaho )

Meter DataManagement System

Target

Aggregation batch(Hive / Spark SQL)

Start

Time from start of aggregation batch for meter data to end of the batch

Page 13: Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive

13© Hitachi, Ltd. 2016. All rights reserved.

2-4 Target of performance evaluation System

Configuration Spec

Master NodeCPU Core 2

Memory 8 GBCapacity of

disk80 GB

# of disk 1  Per slave

nodetotal

CPU Core 16 64 Memory 128 GB 512 GB

Capacity of disk

900 GB -

# of disk 6 24 Total capacity

of disks5.4 TB

( 5,400 GB )

21.6 TB( 21,600

GB )

10Gbps LAN10Gbps   SW1Gbps   LAN

disk disk ・・・

disk

4 Slave Nodes(Physical Machines)

1 Master Node  (Virtual Machine)

Page 14: Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive

14© Hitachi, Ltd. 2016. All rights reserved.

2-5 Dataset

Smart meter data

Term # of records Size ( CSV ) Size ( ORC File )

365days (1year) 3,650 million 2.475 TB 1.325 TB30days (1month) 300 million 0.205 TB 0.158 TB1day 10 million 0.007 TB 0.005 TB

0:00-0:30 power usage

0:30-1:00 power usage

・・・

23:30-0:00 power usage

Meter mgmt. info

meter10:00-0:30 power

usage0:30-1:00 power

usage・・・

23:30-0:00 power usage

Meter mgmt. info

meter2・・・

0:00-0:30 power usage

0:30-1:00 power usage ・・

23:30-0:00 power usage

Meter mgmt. info

meter10,000,000

Smart meter data /day

Data size

48 columns (every 30min)

10,000,000 records/day

Page 15: Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive

15© Hitachi, Ltd. 2016. All rights reserved.

2-6 Contents of performance evaluation (recap)

Items for evaluation

Target of evaluation

Point of analysis Aggregate per

Find the equipment has heavy workload

Equipment

Check the trend of load status

Term

Extract the peak of the load

Time zone

・ Distribution system  ・ Substation ・ Switch・ Transformer    ・ Meter・ 1day   ・ 1month(30days)   ・1year(365days)・ Specific 30min of each day  ・ 24h

Point of evaluation

Time from start of aggregation batch for meter data to end of the batch

Check if MapReduce and Spark can process the data in 30min.

Page 16: Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive

16© Hitachi, Ltd. 2016. All rights reserved.

Time from start of aggregation batch for meter data to end of the batch

2-6 Contents of performance evaluation (recap)

Point of evaluationCheck if MapReduce and Spark can process the data in 30min.

Items for evaluationPoint of analysis Aggregate

perFind the equipment has heavy workload

Equipment

Check the trend of load status

Term

Extract the peak of the load

Time zone

・ Distribution system  ・ Substation ・ Switch・ Transformer    ・ Meter・ 1day   ・ 1month(30days)   ・1year(365days)・ Specific 30min of each day  ・ 24h

+ File type・ Text (CSV)・ ORCFile (Column-based)

Target of evaluation

Page 17: Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive

17© Hitachi, Ltd. 2016. All rights reserved.

2-7 Comparison of txt with ORCFile (MapReduce)

• Couldn’t finish processing in 30min• Performance improvement by ORCFile

1day

30days

365days

0 sec 2000 sec 4000 sec 6000 sec

62 sec

224 sec

3286 sec

99 sec

338 sec

6962 sec

Hive + TXTHive + ORCFile

1day

30days

365days

0 sec 2000 sec 4000 sec

31 sec

69 sec

492 sec

32 sec

138 sec

3015 sec

Hive + TXTHive + ORCFile

30min 30min

Result (24h) Result (0.5h)

Page 18: Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive

18© Hitachi, Ltd. 2016. All rights reserved.

2-8 Comparison of txt with ORCFile (Spark 1.6)

1day

30days

365days

0 sec 400 sec 800 sec 1200 sec 1600 sec

28 sec

50 sec

993 sec

47 sec

154 sec

1408 sec

Spark SQL + TXTSpark SQL + ORCFile

• Could finish processing in 30min (1800s)• Performance improvement by ORCFile

1day

30days

365days

0 sec 400 sec 800 sec 1200 sec 1600 sec

26 sec

32 sec

169 sec

30 sec

111 sec

1263 sec

Spark SQL + TXTSpark SQL + ORCFile

Result (24h) Result (0.5h)

Page 19: Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive

19© Hitachi, Ltd. 2016. All rights reserved.

2-9 Review of the resultsWhy the processing was fast with ORCFile

Processing 0.5h data

・・・

・・・

Smart meter data

・・・

0 :00

23 :300 :

00

0 :30

0 :00

0 :000 :00

0 :00

0 :000 :00

0 :00

Use 1column

only

Processing 24h data

・・・

・・・

・・・

・・・

・・・

0 :30

23 :30 SUM/day

Use all 48

columns+

• Processing big data with ORCFile was more effective than processing small data

Results

• Processing 0.5h data with ORCFile was more effective than processing 24h data

Compression

Reading specific column

Features of ORC File

++

++

+++

++

SUM/daySUM/day

SUM/daySUM/day

Smart meter data0 :00

0 :300 :30

0 :300 :30

23 :3023 :30

23 :3023 :30

0 :000 :000 :00

0 :000 :00

0 :300 :30

0 :300 :30

23 :30

23 :3023 :30

23 :30

Page 20: Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive

20© Hitachi, Ltd. 2016. All rights reserved.

Distributionsystem

Perequipment

0 sec 500 sec 1000 sec

32 sec

104 sec

69 sec

556 sec

HiveSpark SQL

2-10 Comparison of MapReduce with Spark 1.6

Distributionsystem

Perequipment

0 sec 500 sec 1000 sec

50 sec

145 sec

224 sec

718 sec

HiveSpark SQL

Distributionsystem

Perequipment

0 sec 1000 sec 2000 sec 3000 sec 4000 sec

993 sec

1359 sec

3286 sec

4131 sec

HiveSpark SQL

・ Could finish processing the data from 10,000,000 meter in 30min using Spark 1.6!・ Spark’s good performance with “per equipment” processing

Result (30days / 24h) Result (30days / 0.5h)

Result (365days / 24h)

30min

80%down

78%down

81%down

54%down

Page 21: Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive

21© Hitachi, Ltd. 2016. All rights reserved.

Trans2,500,000

Switch100,000

2-11 Review of the results Why the processing per equipment was more effective than the

processing for entire distribution system when using spark?

Per equipment・・・

・・・

・・・

・・・

・・・

・・・

・・・

・・・

・・・

・・・

0 :000 :000 :00

0 :000 :00

0 :300 :300 :30

0 :300 :30

23 :3023 :3023 :30

23 :3023 :30

Trans1Trans2Trans3

Switch1Switch2

Line1 Substa.1

10005000

・・・

・・・

・・・

・・・

・・・

0 :000 :000 :00

0 :30

23 :3023 :3023 :30

・・・

0 :000 :00

0 :300 :30

23 :3023 :30

For entiredistributionsystem

JOIN

・ Less disk I/O than MapReduce・ Smaller data (including re-distributing data) than total memory of cluster

Smart meter data Transformer Switch Distributio

nLine

Substation

Distribution

System

Distribution

System

JOIN JOIN JOINJOIN

Smart meter data

0 :30

0 :30

Page 22: Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive

22© Hitachi, Ltd. 2016. All rights reserved.

3. Additional evaluation with Spark 2.0

Page 23: Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive

23© Hitachi, Ltd. 2016. All rights reserved.

3-1 Evaluation environment System

Configuration Spec

Master NodeCPU Core 16Memory 12 GB

Capacity of disk

900 GB

# of disk 8  Per slave

nodetotal

CPU Core 20 120Memory 384 GB 2304 GB

Capacity of disk

1200 GB -

# of disk 10 60 Total capacity

of disks12.0 TB

( 12,000 GB )

72.0 TB( 72,000

GB )

10 Gbps LAN10 Gbps   SW

6 Slave Nodes(Physical)

1 Master Node(Physical)  

disk disk ・・・

disk

Page 24: Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive

24© Hitachi, Ltd. 2016. All rights reserved.

3-2 Comparison of Spark 2.0 with Spark 1.6

PerEquipment

0 sec 100 sec 200 sec 300 sec 400 sec

316 sec

322 sec

Spark 1.6Spark 2.0

PerEquipment

0 sec 30 sec 60 sec 90 sec 120 sec

87 sec

102 sec

Spark 1.6Spark 2.0

PerEquipment

0 sec 20 sec 40 sec 60 sec 80 sec 100 sec

71 sec

89 sec

Spark 1.6Spark 2.0

・ Performance improvement roughly 20% (including disk I/O)・ More effective with small data

Result (365days / 24h) Result (30days / 24h)

Result (1day / 24h)

Time from start of aggregation batch for meter data to end of the batch

Page 25: Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive

25© Hitachi, Ltd. 2016. All rights reserved.

Demo

Page 26: Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive

26© Hitachi, Ltd. 2016. All rights reserved.

Data visualization tool (Pentaho )

Aggregated Data(29 days)

Demo: Data aggregation and visualization

② Execute aggregation batch (Spark SQL)

① Show 29 days data

Aggregated Data(30 days)

③ Show 30 days data

Raw Data (1day)

④ Outlier detected!!

Page 27: Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive

27© Hitachi, Ltd. 2016. All rights reserved.

4. Summary

Page 28: Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive

28© Hitachi, Ltd. 2016. All rights reserved.

4 Summary

Leveraging data from 10,000,000 smart meters for electric utilities- Built data analysis system- Concern about performance

Evaluate the performance of batch processing- Spark could process the data from 10,000,000 meters in 30min (4node)

Evaluate the performance of Spark 2.0- Performance improvement roughly 20% (compared to 1.6)

Data Processing Platform(MapReduce, Spark)

Data Analysis System

Data visualization tool ( Pentaho )

Aggregation batch(Hive / Spark SQL)

Meter DataManagement System

Page 29: Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive

© Hitachi, Ltd. 2016. All rights reserved.

Hitachi, Ltd. OSS Solution Center

Comparison of Spark SQL with HiveLeveraging smart meter data for electric utilities:

2016/10/27

Yusuke FuruyamaYang Xie

END

29

Page 30: Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive

30© Hitachi, Ltd. 2016. All rights reserved.

他社商品名、商標等の引用に関する表示

Hadoop 、 Hive および Spark は、 Apache Software Foundation の米国およびその他の国における登録商標または商標です。

その他記載の会社名、製品名などは、それぞれの会社の商標もしくは登録商標です。

Page 31: Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive
Page 32: Leveraging smart meter data for electric utilities: Comparison of Spark SQL with Hive

32© Hitachi, Ltd. 2016. All rights reserved.

Appendix. Difficulty with large data to be shuffled

Attempted to aggregate raw (48 columns) meter data per equipment- Extremely slow (Spark 2.0) or Job failed (Spark 1.6)- Processing: Iteration of JOIN and GROUP BY+SUM- Huge data to be shuffled (spilled out from page cache)

1day 30days 365days0 sec

2000 sec4000 sec6000 sec8000 sec

10000 sec12000 sec14000 sec

76 sec 223 sec

13281 sec

85 sec 236 sec

3650 sec

Before addingAfter adding

Target data

Proc

essi

ng t

ime

Add HDFS disks as disks for shuffle- Performance Improved (365days)- Performance degraded (1day/30days)

・ Data for Spark (including temporary data) should be smaller than memory. ・ Had better to process as a trial to estimate

Adding disks for shuffle (Spark 2.0.0 )

Heavy load on a local disk (OS disk) by shuffle