Data Warehousing for Distributed R&DData Warehousing for ... · Data Warehousing for Distributed...

22
Data Warehousing for Distributed R&D Data Warehousing for Distributed R&D Data Warehousing for Distributed R&D Overview and Insights gained Data Warehousing for Distributed R&D Overview and Insights gained Arno Claassen Arno Claassen International Conference International Conference Stefan Gudenkauf Dr. Ulrike Steffens OFFIS Institute for Information Technology Stefan Gudenkauf Dr. Ulrike Steffens OFFIS Institute for Information Technology RAVE 2012 May 8-10, 2012 Bremerhaven, Germany RAVE 2012 May 8-10, 2012 Bremerhaven, Germany Gefördert auf Grund eines Beschlusses des Deutschen Bundestages Projektträger Koordination OFFIS Institute for Information Technology OFFIS Institute for Information Technology Bremerhaven, Germany Bremerhaven, Germany

Transcript of Data Warehousing for Distributed R&DData Warehousing for ... · Data Warehousing for Distributed...

Page 1: Data Warehousing for Distributed R&DData Warehousing for ... · Data Warehousing for Distributed R&DData Warehousing for Distributed R&D Overview and Insights gained Arno ClaassenArno

Data Warehousing for Distributed R&DData Warehousing for Distributed R&DData Warehousing for Distributed R&DOverview and Insights gainedData Warehousing for Distributed R&DOverview and Insights gained

Arno ClaassenArno Claassen International Conference International Conference Stefan GudenkaufDr. Ulrike SteffensOFFIS – Institute for Information Technology

Stefan GudenkaufDr. Ulrike SteffensOFFIS – Institute for Information Technology

RAVE 2012May 8-10, 2012Bremerhaven, Germany

RAVE 2012May 8-10, 2012Bremerhaven, Germany

Gefördert auf Grund eines Beschlusses des Deutschen Bundestages Projektträger Koordination

OFFIS Institute for Information TechnologyOFFIS Institute for Information Technology Bremerhaven, GermanyBremerhaven, Germany

Page 2: Data Warehousing for Distributed R&DData Warehousing for ... · Data Warehousing for Distributed R&DData Warehousing for Distributed R&D Overview and Insights gained Arno ClaassenArno

R&D I tit t f ICTR&D Institute for ICTAssociated institute of the Carl von Ossietzky University in OldenburgMore than 290 employees (~150 researchassistants)Established in 1991Established in 1991

i i iR&D Divisions

OFFIS – Escherweg 2 – 26121 Oldenburg – Germany

Page 3: Data Warehousing for Distributed R&DData Warehousing for ... · Data Warehousing for Distributed R&DData Warehousing for Distributed R&D Overview and Insights gained Arno ClaassenArno

Project Overview

Data Warehouse

Use in Transition

CurrentMeasures

InsightsGained

0 1 2 3 40 1 2 3 4

3© jeff_w_brooktree via flickr

Page 4: Data Warehousing for Distributed R&DData Warehousing for ... · Data Warehousing for Distributed R&DData Warehousing for Distributed R&D Overview and Insights gained Arno ClaassenArno

Project Overviewj

Data Warehouse (DWH)• Collection and harmonization of engergy

data• Data provisioning for research WEDA• Data provisioning for research

accompanying the alpha ventus offshore wind park

WEDASystem

– Wind turbine (WEA) optimization– Environmental impact analysis

• Data access policy enforcement• Data access policy enforcement

Storage-Relevant Energy Datag gy• Secondary data (calibrated sensor data)• Tertiary data (statistically aggregated data) Icons © Oxygen Team via iconarchive.com

4

• Metadata (data on sensors and tertiary data)

Page 5: Data Warehousing for Distributed R&DData Warehousing for ... · Data Warehousing for Distributed R&DData Warehousing for Distributed R&D Overview and Insights gained Arno ClaassenArno

Project Overview

Data Warehouse

Use in Transition

CurrentMeasures

InsightsGained

0 1 2 3 40 1 2 3 4

5© jeff_w_brooktree via flickr

Page 6: Data Warehousing for Distributed R&DData Warehousing for ... · Data Warehousing for Distributed R&DData Warehousing for Distributed R&D Overview and Insights gained Arno ClaassenArno

Architecture Overview

WEDA ETL

WEDASystem

• Central component for sensor data storageand retrieval

• Monitors data input delivery and quality

WEDAPortal

• Monitors data input delivery and quality• Guarantees technical data integrity WEDA

RUM

WEDA RUM• Role-based user access control

WEDAETL

WEDA Portal• Access portal for research partners

Data Store• Access portal for research partners• Data exploration and download

6

Page 7: Data Warehousing for Distributed R&DData Warehousing for ... · Data Warehousing for Distributed R&DData Warehousing for Distributed R&D Overview and Insights gained Arno ClaassenArno

Import Data VolumespSurvey „Storage Strategies forSensor Data in Wind Energy“

alpha ventus DWH Other (non-wind energy) Projects

20 to 250 sensors 1,500 sensors Data sizes comparableor even greater thanthose in alpha ventus5 min to 20 Hz approx. 75% of the sensors t ose a p a e tus

operating at 50 Hz

45,000 to 100,000 meteredvalues per day

4,860,000,000 metered valuesper day

~18 GByte per day (data valuesize approx. 4 Byte) and 6.6TByte per year

Up to 45,000 values file system-based storage or data base; over 45,000 only file system-based storage

Hybrid (see next slide) KIWI-concepts (Kill itwith Iron)e.g., [Yuen et al., 2007; Ghemawat et al 2003]based storage Ghemawat et al., 2003]

Volume reduction bycompression

Volume reduction bycompression

E l ti f St St t i d D i h i f l h t

7

Evaluation of Storage Strategies and Design choices for alpha ventusdocumented in [Beenken et al., 2009]

Page 8: Data Warehousing for Distributed R&DData Warehousing for ... · Data Warehousing for Distributed R&DData Warehousing for Distributed R&D Overview and Insights gained Arno ClaassenArno

( )

Hybrid Storage ConceptDWH System (WEDA ETL)

Extraction Staging (Load)Data BaseTertiary Data,

Primary DataSecondary Data

yReferences toSecondary Data

Workbench

Data Sources File ServerSecondary Data

Secondary Data

Monitoring

Transformation

Secondary DataStorage Format

Tertiary DataGeneration

DWH Manager

E t d d DWH R f M d l i d ith [B d Gü l

DWH Portal(WEDA Portal)

8

Extended DWH Reference Model in accordance with [Bauer and Günzel, 2004]

Page 9: Data Warehousing for Distributed R&DData Warehousing for ... · Data Warehousing for Distributed R&DData Warehousing for Distributed R&D Overview and Insights gained Arno ClaassenArno

Project Overview

Data Warehouse

Use in Transition

CurrentMeasures

InsightsGained

0 1 2 3 40 1 2 3 4

9© jeff_w_brooktree via flickr

Page 10: Data Warehousing for Distributed R&DData Warehousing for ... · Data Warehousing for Distributed R&DData Warehousing for Distributed R&D Overview and Insights gained Arno ClaassenArno

Use in Transition

Linear DWH growth

10

Page 11: Data Warehousing for Distributed R&DData Warehousing for ... · Data Warehousing for Distributed R&DData Warehousing for Distributed R&D Overview and Insights gained Arno ClaassenArno

Use in Transitionj i

Memory upgrade

RAVE projects terminate

Memory upgrade

Increased demand ford t t l i d f tidata sets over large periods of time

Test phase

11

Page 12: Data Warehousing for Distributed R&DData Warehousing for ... · Data Warehousing for Distributed R&DData Warehousing for Distributed R&D Overview and Insights gained Arno ClaassenArno

Use in Transition

Input and usageData delivery input

12

Page 13: Data Warehousing for Distributed R&DData Warehousing for ... · Data Warehousing for Distributed R&DData Warehousing for Distributed R&D Overview and Insights gained Arno ClaassenArno

Project Overview

Data Warehouse

Use in Transition

CurrentMeasures

InsightsGained

0 1 2 3 40 1 2 3 4

13© jeff_w_brooktree via flickr

Page 14: Data Warehousing for Distributed R&DData Warehousing for ... · Data Warehousing for Distributed R&DData Warehousing for Distributed R&D Overview and Insights gained Arno ClaassenArno

Current Measures

• Download cancellation and wait queue

• Complementary research documents– No restriction on file format– Upload is curated by OFFIS

• Time stamp calculation for download compilationcompilation– The frequency of the sensors

could not always be met during data recording before delivery todata recording before delivery to DWH

• Data compression for volume reductionreduction– Only data that is older than 1

yearO h fl d i

14

– On-the-fly decompression upon user queries

– ~ 20% compression rate

Page 15: Data Warehousing for Distributed R&DData Warehousing for ... · Data Warehousing for Distributed R&DData Warehousing for Distributed R&D Overview and Insights gained Arno ClaassenArno

Project Overview

Data Warehouse

Use in Transition

CurrentMeasures

InsightsGained

0 1 2 3 40 1 2 3 4

15© jeff_w_brooktree via flickr

Page 16: Data Warehousing for Distributed R&DData Warehousing for ... · Data Warehousing for Distributed R&DData Warehousing for Distributed R&D Overview and Insights gained Arno ClaassenArno

Insights Gainedg

Continuous Adaptations • Query optimization and table index updates (cascading change effects)• Changes to data format specification required changes to the portal

and repeated downloadsp• Increased demand for data sets over large periods of time required

memory upgrade

Local DWH test instance• Test with representative/real-life data after storage design phase

Tendency to local copies • If data is available, researcher partners will also want them on their , p

own systems

16

Page 17: Data Warehousing for Distributed R&DData Warehousing for ... · Data Warehousing for Distributed R&DData Warehousing for Distributed R&D Overview and Insights gained Arno ClaassenArno

Insights Gainedg

Recurring technical data quality issues• Delivered secondary input data to be stored in the DWH did not always

conform to the agreed specification• Communication overhead

Domain-specific data quality and additional information• If delivered secondary input data is corrupted it probably can be• If delivered secondary input data is corrupted, it probably can be

resolved automatically by batch processing– Batch job can be considered as a part of the sensor must also be

documented and archiveddocumented and archived – Unclear if processed data should replace the corrupted data or if it

should be stored separately– In general, experimental documentation and data history is highly

relevant for future research projects

17

Page 18: Data Warehousing for Distributed R&DData Warehousing for ... · Data Warehousing for Distributed R&DData Warehousing for Distributed R&D Overview and Insights gained Arno ClaassenArno

Outlook

• Progress information for big download compilations

• Thorough use of multithreadingf l d d d l d il ti– e.g., for uploads and download compilation

• Development of a performance laboratory based on the localDevelopment of a performance laboratory based on the local DWH test instance

• Re-evaluation of DWH-internal storage structures

• Server side data processing and analysis• Server-side data processing and analysis– e.g., standard interfaces for aggregated statistical data

18

Page 19: Data Warehousing for Distributed R&DData Warehousing for ... · Data Warehousing for Distributed R&DData Warehousing for Distributed R&D Overview and Insights gained Arno ClaassenArno

Questions?Questions?

[email protected]@[email protected]@offis.de

Page 20: Data Warehousing for Distributed R&DData Warehousing for ... · Data Warehousing for Distributed R&DData Warehousing for Distributed R&D Overview and Insights gained Arno ClaassenArno

ReferencesReferences[Bauer and Günzel, 2004] Bauer, A.; Günzel, H.: Data Warehouse Systeme, d k ldpunkt-Verlag, 2004

[Beenken et al., 2009] Beenken, P.; Schwassmann, S.; Albrecht, M.; Appelrath, H.-J.; Heisecke, S.: Speicherstrategien für die Sensordaten des Offshore-Windparks alpha ventus. Datenbank Spektrum, Heft 28, February 9, 2009

[Yuen et al., 2007] Yuen, C.; Hopkins Dreyer, L.; Krneta, P.: Performance [ , ] , ; p y , ; ,Sizing Report – Petabyte Data Warehouse. InfoSizing, Inc., August 20, 2007

[Ghemawat et al., 2003] Ghemawat, S.; Gobioff, H.; Leung, S.-T.: The[Ghemawat et al., 2003] Ghemawat, S.; Gobioff, H.; Leung, S. T.: The Google File System. 19th ACM Symposium on Operating Systems Principles, 2003

Page 21: Data Warehousing for Distributed R&DData Warehousing for ... · Data Warehousing for Distributed R&DData Warehousing for Distributed R&D Overview and Insights gained Arno ClaassenArno

Hybrid Storage Concepty g p

Storage Efficiency• Secondary data stored in file system• Tertiary data calculated from secondary data and stored in data base• Tertiary data references the corresponding secondary data in the fileTertiary data references the corresponding secondary data in the file

systemData Handling• Users can analyse tertiary data before secondary data must be considered• Users can analyse tertiary data before secondary data must be considered

and downloaded Export Optimization

d d b d l d d d l• Secondary data can be downloaded directly• Users can select subsets of secondary data (queries) according to various

parameters; server-side recompilation provides only the relevant dataLong-Term Technology-Independent Storage• Guaranteed by file system-based secondary data storage in plain CSV

format

21

Page 22: Data Warehousing for Distributed R&DData Warehousing for ... · Data Warehousing for Distributed R&DData Warehousing for Distributed R&D Overview and Insights gained Arno ClaassenArno

Import Data Volumesp

Some numbers• 1,500 sensors, approx. 75 % operating in 50 Hz• 4,320,000 data values per day and 50 Hz sensor

4 860 000 000 d t l d• 4,860,000,000 data values per day

22