20090701 Climate Data Staging

31
Folie 1 Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01 A Python Framework for Staging of Geo-referenced Data on the Collaborative Climate Community Grid (C3-Grid) Henning Bergmeyer (http://tr.im/bergmeyer) German Aerospace Center Simulation- and Software Technology (V.09-06-30) Most up-to-date version of these slides available at: http://tr.im/ep09_pymodest

description

This is the presentation version containing animations. Therefore it is not printable. I am going to post a printable version during the next days.

Transcript of 20090701 Climate Data Staging

Folie 1Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01

A Python Framework for Staging ofGeo-referenced Data on the Collaborative Climate Community Grid (C3-Grid)

Henning Bergmeyer (http://tr.im/bergmeyer)

German Aerospace CenterSimulation- and Software Technology(V.09-06-30)Most up-to-date version of these slides available at: http://tr.im/ep09_pymodest

Folie 2Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01

DLRGerman Aerospace Center

Research InstitutionSpace AgencyProject Management Agency

Folie 3Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01

Koeln

Lampoldshausen

Stuttgart

Oberpfaffenhofen

Braunschweig

Goettingen

Berlin-

Bonn

Trauen

Hamburg

Neustrelitz

Weilheim

Bremen-

Locations and employees

6000 employees across 29 research institutes and facilities at

13 sites.

Offices in Brussels, Paris and Washington.

Folie 4Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01

Talk Outline

Addressing the Heterogeneity Problem of Climate Data

PyModESt: Modular Extendable Stager in Python

Examples snippets from Data provider scripts for DLR data sets

initializeEnvironment

readStage Request

selectDataProcessor

chooseRequest Mode

handleCancel Request

handleStage Request

handleEstimation Request

handleExceptions

tidyWork Space

writeResponses

• authorize• retrieve Meta Data• adjust Constraints• estimate

• Stage Time• File Size

• find abandonedtemporary files

• (authorize)• prepare work space• retrieve meta data• adjust constraints• retrieve data• update meta data• transfer files to target

selectDataProcessor

• authorize• retrieve Meta Data• adjust Constraints• estimate

• Stage Time• File Size

• (authorize)• prepare work space• retrieve meta data• adjust constraints• retrieve data• update meta data• transfer files to target

initializeEnvironment

readStage Request

selectDataProcessor

chooseRequest Mode

handleCancel Request

handleStage Request

handleEstimation Request

handleExceptions

tidyWork Space

writeResponses

• authorize• retrieve Meta Data• adjust Constraints• estimate

• Stage Time• File Size

• find abandonedtemporary files

• (authorize)• prepare work space• retrieve meta data• adjust constraints• retrieve data• update meta data• transfer files to target

selectDataProcessor

• authorize• retrieve Meta Data• adjust Constraints• estimate

• Stage Time• File Size

• (authorize)• prepare work space• retrieve meta data• adjust constraints• retrieve data• update meta data• transfer files to target

PortalPortal

Workflow Management

Workflow Management

Data Management

Data Management

Data Information

System

Data Information

System

ArchiveArchive

Use

rs V

iew

Grid

Mid

dle

war

eR

esou

rces

PortalPortal

Workflow Management

Workflow Management

Data Management

Data Management

Data Information

System

Data Information

System

ArchiveArchive

Use

rs V

iew

Grid

Mid

dle

war

eR

esou

rces

Pre-Processing

Simulation Monitoring

Post-Processing

Visualization

End

Start

Stop Simulation

Optimum reached?

Yes

No

Yes

No

Problems?

Folie 5Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01

The HeterogeneityProblem with Data inClimate Research

Distributed Data Archives throughout Germany

Large Data Quantities(Peta Bytes)Storage atData Sources(Sensors, Institutes)

Many Data FormatsDue to nature and purpose of dataHistoric reasons

Folie 6Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01

Pre-Processing

Simulation Monitoring

Post-Processing

Visualization

End

Start

Stop Simulation

Optimum reached?

Yes

No

Yes

No

Problems?

Scientific Workflow and Use of DataWorkflows consist of dependent tasksTasks are carried out

manuallyas local Applicationsas Jobs on compute resources

Scheduler / Batch Systemsplan execution time of jobson required resources (time, space)

Tasks consume and produce dataStaging: job data retrieval and storage of resultaccess should be standardized

Folie 7Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01

Transparent Grid Infrastructure for the GermanClimate Research Community

Uniform access to heterogeneous climate data stores

Publishing

Discovery

Citation

Download

Data Processing Tools

Data Visualization

Standard Tools / Workflows

User Interface: Web Portal

The Collaborative Climate Community Grid C3-Grid (2005-2009)

Folie 8Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01

Working with Data on the Portal

Advanced Search Browse by Data Set

Folie 9Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01

C3-Grid Layered Architecture (simplified)

PortalPortal

Workflow Management

Workflow Management

Data Management

Data Management

Data Information

System

Data Information

System

ArchiveArchive

Use

rs V

iew

Grid

Mid

dlew

are

Res

ourc

es

WDC Climate WDC MareWDC RSAT DWD DKRZ, PIK, GKSS, AWI, MPI-M

IFM-Geomar FU Berlin Uni Cologne

Storage SolutionsdCacheOGSA-DAIFlat Files…heterogeneous!

Abstraction: Data Request Service + Metadata

Scheduled Tasks:• Computation• Data Download

=> custom staging scripts

Folie 10Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01

Stage Request Constraints: Data Selection

Data service receives selection constraints according to metadata

Data Set (Object ID)

Folie 11Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01

Stage Request Constraints: Data Selection

Data service receives selection constraints according to metadata

Data Set (Object ID)

Variables as CF Names

(Climate and Forecast MD Convention)

log_surface_pressure mole_fraction_of_ozone_in_air

Folie 12Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01

Stage Request Constraints: Data Selection

Data service receives selection constraints according to metadata

Data Set (Object ID)

Variables as CF Names

(Climate and Forecast MD Convention)

Regional Bounds (Longitude, Latitude)

Vertical Bounds andVertical Coordinate Reference System

Time Period

Data Set Specific Constraints

log_surface_pressure mole_fraction_of_ozone_in_air

Folie 13Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01

Problem solved… Solved in Middleware: Abstraction of heterogeneous storage access

Standardized Metadata Format (ISO 19115/19139)

Common Set of Request Constraints over Grid Service

…another one brought to the surface Computer Science Exercises for Meteorologists

Adaption of storage access to the standard service interface

extend Java service or write external stager script

Implementing Web/Grid Service Features

Concurrency Issues

Error handling

CF Name Translation

Metadata update requires XML processing & Schema knowledge

Folie 14Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01

PyModESt helps the data provider

Hide unnecessary complexity

e.g. XML processing

Give Recipes

Guided Template for Data Processors in Python

Make implementation straight forward (e.g. reduce interfaces)

1 Input Channel (Constraints as Python primitives)

1 Output Channel

Settings and Environment Access

Avoid common mistakes

Concurrency of Requests

Let Users stay in their domain

Let them use their own tools

Provide additional tools for common issues

e.g. data grid calculations

Folie 15Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01

Staging Process Skeleton

initializeEnvironment

readStage Request

selectDataProcessor

chooseRequest Mode

handleCancel Request

handleStage Request

handleEstimation Request

handleExceptions

tidyWork Space

writeResponses

• authorize• retrieve Meta Data• adjust Constraints• estimate

• Stage Time• File Size

• find abandoned temporary files

• (authorize)• prepare work space• retrieve xml metadata• adjust constraints• retrieve data• update metadata• transfer files to target

selectDataProcessor

• authorize• retrieve Meta Data• adjust Constraints• estimate

• Stage Time• File Size

• (authorize)• prepare work space• retrieve xml metadata • adjust constraints• retrieve data• update metadata• transfer files to target

Necessary Implementation effort for DP when using PyModESt

Folie 16Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01

Implement a Data Processor

Initialize data processor __init__(c3env, stage_request)

perform common steps for stage request and estimation

e.g. determine data mesh properties

c3env and stage_request reference in data processor context

do not touch base data

Hook 1: retrieveAndFilterDataFiles()

– Hook 2: updateMetaData(c3_metadata)

Folie 17Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01

Hook 1: retrieveAndFilterDataFiles()

1. determine required base data files

Data provider knowledge about data store

2. download a base data file to a temporary filedld_datafile_name = self.c3env.reserveTempFilePath()fln, headers = urllib.urlretrieve( \

url=src_url, filename= dld_datafile_name)

3. extract constraint fulfilling data and append result file

Use specific tools or libraries

4. update result properties

remember meta data relevant attributes of the result

5. remove downloaded file to save temporary disk spaceos.remove(fln)self.c3env.releaseTempFilePath(dld_datafile_name )

6. continue with step 2 until finished

Folie 18Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01

Metadata object is automatically initialized from metadata server

Just pass updated attributes to methods of the metadata object

md.removeQuicklook()

md.filterContentInfo()

md.setHorizontalBounds()

md.setVerticalExtent()

md.updateTimePeriod()

md.setObjectId()

md.addLineageProcessStep()

Hook 2: updateMetadata(md)

Folie 19Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01

C3Thesaurus - CF Names Translation(based on Guido’s recipe for dict inversion)

Climate and Forecast Convention used in C3-Grid to name variables

Data providers internally often follow other conventions

historic reasons

data file formats (e.g. GRB only supports numeric indexes)

Translates variable names between representations in different scopes

c3t.translateFromC3(self, attributes, tgt_scope)

c3t.translateToC3(self, attributes, src_scope)

{ SCOPE1 : { CF1 : DP1, CF2 : DP2 }, SCOPE2 : … }

SCOPE_MAP = { "g2.de.dlr.wdc.ERS2.GOME.L3.VCD.MONTHLYMEAN.O3" : { "mean_ozone_VCD" : "mean", "standard_deviation_ozone_VCD" : "strd_dev” } }

Folie 20Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01

Implement the Estimation Request(Scheduling Questions)

1. Verify request constraints

Is the request fulfillable?

2. Estimate result file size

How much storage space does the extracted data need?

3. Estimate staging time

When can I start a process that depends on the data?

4. Offer a contract to the Scheduling System

Do NOT process any base data files for this

Hook 3: estimateFileSize() returns long

Hook 4: estimateStageTime(stage_instant) returns timedelta

Folie 21Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01

Hook 3: estimateFileSize()

Return a long value (Size in Bytes)

For regular Raster / Table Data simplymultiply and sum-up

Use “GaussianGridHelper” to calculatetable index ranges on Gaussian Grids

Handle regions overlapping thelongitude periodic edge

gauss_grid_hlp = RegularGaussianGridHelper( src_lat_min, src_lat_max, lat_delta, lat_len, src_lon_min, src_lon_max, lon_delta, lon_len )

lat_idx_min, lat_idx_max, lat_idx_len, lon_idx_min, lon_idx_max, lon_idx_len = gauss_grid_hlp.calculateRegionIndices( lat_min, lat_max, lon_min, lon_max )

Folie 22Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01

Hook 4: estimateStageTime(stage_instant)

Given an instant for a Stage Request,when will the requested data be available?

Return a datetime.timedelta value

It is practically impossible to be precise in a complex dynamic system

theoretically statistics over past requests could be used

heuristically compare with sensor data of current server and network loads

Currently DLR implementations generously over-estimate with a constant: timedelta(seconds=60)

Folie 23Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01

Associate Data Processors with Data Set

Modularity of Stagers

Data Processor: Hooks + CF Translation for new data sets

The starter script

Contains all configuration settings

Associates data sets with data processors

{ ObjectID : DataProcessor }

Specific DataProcessor sub-class is plugged into the skeleton implementation

PROCESSOR_TYPES = {"g2.de.dlr.wdc.CWF" : netcdf_extraction.IPA_NCDF_Processor,“g2.de.dlr.wdc.ERS… “: wdc_hdf_processing.WDCHDFProcessor }

PROCESSOR_TYPES[stageRequest.object_ids[0]].retrieveAndFilter…

Starter

Config

DA

DA

DA

DA

Folie 24Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01

ERS2.GOME.L3.VCD.MONTHLYMEAN.O3 (95–05)(Example Data Set 1)

WDC-RSAT: World Data Center for Remote Sensing of the Atmosphere

File StructureBase Data: 1 file / monthFile format HDF4HTTP download

Retrieval and Processing using PyHDF librarycreate new HDF fileiterate over months covering requested time periodadjust data describing attributes

Folie 25Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01

PyHDF: Copying Parts of a Table to Another Table

HDF Scientific Data Set

Each variable is in a 2D table (longitude, latitude)

Result file contains 3D tables (date, longitude, latitude)

NumPy (Numerical Python) integration enables easy table operations

Select a 2D part of a table

Insert it into a specific range in a 3D table

tgt_ds[time_idx, :lat_idx_len, :(lon_len-lon_idx_min)] = src_ds[lat_idx_min:(lat_idx_max+1), lon_idx_min:]

tgt_ds[time_idx, :lat_idx_len, (lon_idx_len-lon_idx_max-1):] = src_ds[lat_idx_min:(lat_idx_max+1), :(lon_idx_max+1)]

Folie 26Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01

Chemical Weather Forecast Demo Data Set (2005)

Institute for Atmosphere Physics

File Structure

1 file / day with 8 time steps / day,

file format NetCDF

local file system

Retrieval and Processing using external command line tool CDO

iterate over files corresponding to time period

adjust data describing attributes by analyzing the result file

Folie 27Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01

Reading Structured Output of a Command Line Tool

Climate Data Operators (CDO)

supports a huge set of specific operations for climate data

command line tool

getRegionalBoundsFromNetCDF(filename) returns (xsize, ysize, …)sproc = subprocess.Popen(“cdo griddes %s” % filename, \

stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)

output, err = sproc.communicate()

retcode = sproc.wait()

ncdf_props = str(output).split()

xsize = int(ncdf_props[ncdf_props.index(“xsize”) + 2])

This part provided as re-usable function callTool(cmd_ln) returns str

Seems trivial (GOOD!) but is incredibly useful

Saved a lot of implementation time

Folie 28Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01

Benefits of Using Python for Impementation of the Stager Lib

EasinessComparably easy to understand and modify by non computer scientistsIntuitive configuration using dictionariesAdd new data sets by Copy and Customize DataProcessors (documented DataProcessor template is provided)

Python: Powerful LanguageAccess to all constraints as Python types (float, str, dict, ...) Python standard libraries (datetime, timedelta, math, …)

ExtensibilityNo compilation and re-deployment necessaryEasy integration of scriptable tools and libraries Rich set of useful libraries available (HDF, Numeric, iso8601, PyParsing…)Integration of Java and C/C++ Libraries with JPype, ctypes, BOOST

Folie 29Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01

Conclusion

C3-Grid is a grid infrastructure for collaboration of climate researchers

Uniform ISO-annotated Data Access

Tools and complex workflows in Portal

Python can ease the life of both middleware developers and service implementers

PyModESt makes becoming a data provider easy

Modular development

DP can exploit the strengths of Python

Just implement 4 documented hooks for each data set kind

Provide CF Name translations (dict in dict)

Folie 30Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01

Summary

The C3-Grid is a collaboration infrastructure for the climate science community. Main aim is transparency of the infrastructure to users by abstraction of heterogeneous data resources for easy data discovery and access and to allow execution of basic manipulation and analysis tasks as well as complex distributed workflows.

PyModESt is a Python framework for the comfortable modularized implementation of staging scripts for the C3-Grid that frees meteorological data providers from doing the work of system admins and vice versa.

Folie 31Climate Data Staging in C3-Grid > Henning Bergmeyer > 20090701_EuroPython_ClimateDataStaging_bergmeyer > 2009-07-01

References

C3-Grid: http://www.c3grid.deDLR Simulation and Software Technology: http://dlr.de/scDLR Intitute of Atmospheric Physics: http://www.dlr.de/pa/en/World Data Center for Remote Sensing of the Atmosphere: http://wdc.dlr.de/

Check-out for slideshow of beautiful current satellite images and animations: http://wdc.dlr.de/show/show.php [Sorry, it’s PHP ;-)]

PyHDF: http://pysclint.sourceforge.net/pyhdf/Numpy: http://numpy.scipy.org/Climate Data Operators: http://www.mpimet.mpg.de/fileadmin/software/cdo/JPype (Dynamically use Java in Python): http://jpype.sourceforge.net/ctypes (Integrate C/C++ libraries in Python): http://docs.python.org/library/ctypes.html

These slides on slideshare (always up-to-date):http://tr.im/ep09_pymodest