Tim Pugh-SPEDDEXES 2014

40
OPeNDAP has transformed the way we do science plus snapshots of ent developments and BoM’s operational systems built on this technol Tim Pugh SPEDDEXES workshop 17-21 March 2014

description

How OPeNDAP has transformed the way we do science plus snapshots of recent developments and BoM’s operational systems built on this technology

Transcript of Tim Pugh-SPEDDEXES 2014

Page 1: Tim Pugh-SPEDDEXES 2014

How OPeNDAP has transformed the way we do science plus snapshots of recent developments and BoM’s operational systems built on this technology

Tim PughSPEDDEXES workshop17-21 March 2014

Page 2: Tim Pugh-SPEDDEXES 2014

Evolution

• Traditionally…– Scientific research is conducted in a quiet room in isolation

utilising unique data, scripts, and code– Scientific collaboration is conducted at conferences with file

sharing by FTP or HTTP bulk download

• Today– Scientific research is being driven to shared research services

and supported infrastructure• To relieve the scientist of laborious developments• To manage more complex machinery• To improve scientific integrity and collaboration• To work within managed and supported infrastructure

– Science is moving from file sharing to data sharing collaboration

Page 3: Tim Pugh-SPEDDEXES 2014

CAWCR Research Data Server

• Location: http://opendap.bom.gov.au:8080/thredds• Unidata THREDDS Data Server v4.2.8

• http://www.unidata.ucar.edu/projects/THREDDS/tech/TDS.html• The THREDDS Data Server (TDS) is a JavaSevlet, and is contained in a single war file, which allows very easy installation into Tomcat web server.

Page 4: Tim Pugh-SPEDDEXES 2014

OPeNDAP Now Is:

• An acronym– “Open-source Project for a Network Data Access

Protocol”– Often a synonym for “DAP”

• A not-for-profit corp. developing/supporting– “DAPx” - a web-services protocol for data access

• Deployed by hundreds of data providers internationally• Employed in many analysis packages (MATLAB, e.g.)• Designated a “Community Standard” by NASA

– Server & client implementations* of DAP

*Note: there are other implementations

Page 5: Tim Pugh-SPEDDEXES 2014

BROAD VISION

1. A world in which a single data access protocol is used for the exchange of data between network-based applications regardless of discipline.

2. A layer above TCP/IP providing for syntactic and semantic consistency not available in existing protocols such as FTP.

Page 6: Tim Pugh-SPEDDEXES 2014

Fundamental Objective of OPENDAP

• The fundamental objective of OPeNDAP and OPeNDAP Inc. is to facilitate internet access to scientific data

• This is done by:• Providing a protocol (DAP) to access data over the internet,• Hiding the format (and organization) in which the data are stored from

the user, and• Providing subsetting (and other) capabilities for the data at the server

• OPeNDAP is based on a multi-tier architecture

• OPeNDAP software is open source

Page 7: Tim Pugh-SPEDDEXES 2014

OPeNDAP Data-Type Philosophy

the OPeNDAP data model has few data typessimplified programming/lowered risk of errors

they are intentionally discipline-neutralbetter trans-domain utility & programmer uptake

they nonetheless fill discipline-specific needsnetCDF-like (good in contexts where, e.g., data might

represent functions with 4- or 5-D domains)

sequences & selections match dbms sensibilities

Page 8: Tim Pugh-SPEDDEXES 2014

TDS Server

• TDS is THREDDS Data Server– THREDDS is Thematic Real-time Environmental Distributed Data Services– Middleware to bridge the gap between data providers and data users– THREDDS Data Server (TDS), a web server that provides catalog, metadata,

and data access services for scientific datasets. – The TDS is open source, 100% Java, and runs inside the open source Tomcat

Servlet container.

• Unidata’s Common Data Model– merges the OPeNDAP, netCDF, and HDF5 data models to create a common API

for scientific data– implemented by the NetCDF Java library– read netCDF, OPeNDAP, HDF5, HDF4, GRIB 1 & 2, BUFR, NEXRAD 2 & 3,

GEMPAK, MCIDAS, GINI, among others– A pluggable framework allows other developers to add readers for their own

specialized formats.– provides standard APIs for geo-referencing coordinate systems, and specialized

queries for scientific feature types like Grid, Point, and Radial datasets

Page 9: Tim Pugh-SPEDDEXES 2014

Some of the Technology in the TDS

1. THREDDS Dataset Inventory Catalogs provide virtual directories of available data and

associated metadata.

2. The Netcdf-Java/CDM library reads NetCDF, OpenDAP, and HDF5 datasets, as well as other

binary formats such as GRIB and NEXRAD, essentially an (extended) netCDF view of the data.

3. TDS can use the NetCDF Markup Language (NcML) to modify and create virtual aggregations

of datasets.

4. An integrated server provides OPeNDAP access with subsetting data access method.

5. An integrated server provides bulk file access through the HTTP protocol.

6. An integrated server provides data access through the OpenGIS Consortium (OGC) Web

Coverage Service (WCS) protocol, for any "gridded" dataset whose coordinate system

information is complete.

7. An integrated server provides data access through the OpenGIS Consortium (OGC) Web Map

Service (WMS) protocol, for any "gridded" dataset whose coordinate system information is

complete.

8. The integrated ncISO server provides automated metadata analysis and ISO metadata

generation.

Page 10: Tim Pugh-SPEDDEXES 2014

THREDDS Catalog

• The goal is…– to simplify the discovery and use of scientific data and to allow scientific

publications and educational materials to reference scientific data.

– initial focus was to allow data users to find datasets that are pertinent to their specific education and research needs, access the data, and use them without necessarily downloading the entire file to their local system.

– Catalogs are the heart of the data access services, and is the THREDDS concept. Catalogs consist of XML documents that describe on-line datasets.

– Catalogs can contain arbitrary metadata, however we also defined a standard set of metadata to bridge to discovery centers

• CF (Climate & Forecast) and Unidata Data Discovery metadata

Page 11: Tim Pugh-SPEDDEXES 2014

Spectrum of Use Cases

Application Data Representation

OGC data modeldomain specificgeospatial, 1-D, 2-D

DAP2 data modeldomain neutraln-D, time series

**DAP4 data modeldomain neutralnew data types and data structuresstreaming, compressed, chunked

Common Data Model (CDM)domain specific

Future data modeldomain neutral??

Application Types

Programmatic / Langauge APIFORTRAN, C/C++, JAVA, Python, NetCDF, Java NetCDF

Programmatic / ToolsNetCDF, NCO, PyDAPCustom Tools: OPeNDAP crawler, ocean_prep

Interactive Data ViewerIDV, Panolopy, IDL, MATLAB, iPython (matplotlib), NCL, web browser (metadata)

Interactive AnalysisMATLAB, IDL, iPython, NCLCustom Application: Inudation Modeller

Web ApplicationLive Access ServerIMOS Data Portal (WMS)Custom Java Servlet

ProgrammingDAP2 Legacy Codeexisting tools

DAP2 New CodeNew tools

**DAP4 programminglegacy code support

**DAP4 programmingnew data model and protocolsstreaming support

**DAP4 programmingAsynchronous access modes, server-side processing

Data Access Protocol

Metadata Requestdas, dds, ddx

ASCII/Binary Data RequestSimple data representation

DAP Binary Object Request NcML Data Requestaggregation, virtual data sets

**DAP4server-side operations, async access mode, new data model, posting

Syntax

Return data set infofile.nc.dds - readablefile.nc.ddx - XMLfile.nc.asc - ASCII data return

Select variablesfile.nc.dods?var1,var2,var3

subset arraysfile.dods?var1(0:1:10)

Return file translationsfile.nc.netcdf - NetCDF file

Server-side operationsfile.nc?GEOLOC()Async access mode??

Clients

Programmatic AccessTsunami inudation modeller, NetCDF,NCO, PyDAP, PyNetCDF, MATLAB, IDL, …

Interactive AccessWeb browser - CatalogMATLAB, IDL, Python, Panolopy,…

Data Library & Catalog Servicemetadata harvestingdirectory listingsremote THREDDS services

Web ServiceJava servlet, Java appletGeospatial Information ServiceOPeNDAP data service

Analysis ServiceLive Access Server

Service CapabilitiesDAP2 response metadata, dods, ASCII / Binary

**DAP4 Responseasync access mode, server-side, streaming,

NcMLAggregation serviceVirtual Data Set ServiceRemote Data Access

Metadata Conversion and RDFmetadata definitions, translations (-> ISO) sematics, ontalogyCF->ISO, CF->WMS, CF->WCS

Layered ServicesCatalogue serviceWMS, WCS servicesAuthenticationConformance checksCF metadata checkISO metadata check

**DAP4 features listed is my estimation and not the official specification

Page 12: Tim Pugh-SPEDDEXES 2014

Use Case limitations

• Time to access data is dependent on the following factors:

• Hardware and network performance

• Selection of variables and dimensions

• Number of data requests to be issued

−Latency inherent in the data request

• Number of concurrent accesses to the server

Page 13: Tim Pugh-SPEDDEXES 2014

DAP-enabled client tools/applications

OPeNDAP Clients (partial list) http://opendap.org/whatClients

1. Web browser returning ASCII data

2. Pydap - is a pure Python library implementation of the DAP2

3. NetCDF - is a set of software libraries and self-describing, machine-independent data formats with interfaces to Python, FORTRAN, C/C++, and Java languages

4. NCO – comprises a dozen standalone, command-line programs that take netCDF files as input

5. MATLAB – a high-level technical computing language and interactive environment for algorithm development, data visualization, data analysis, and numerical computation

6. Panoply – Panoply is a cross-platform application which plots geo-gridded arrays from netCDF, HDF and GRIB datasets.

Page 14: Tim Pugh-SPEDDEXES 2014

Developments by Bureau and CSIRO

• Development of web portals for data access services and information systems in climate and environment– Seasonal Climate Outlook Rebuild (Roald de Wit)– Natural Resource Management (NRM) Climate Change Portal (Tim

Erwin)– eReef’s Marine Quality Dashboard and data services (Jonathon Hodge)– National Environmental Information Infrastructure (NEII) (Andrew Woolf)– CAWCR research data services (Duan Beckett)

• Establish Climate Data Publishing services at NCI– NCI, CSIRO, Bureau of Meteorology, CoE CSS– Earth System Grid (ESG)– Climate and Weather Science Laboratory (CWSLab)

Page 15: Tim Pugh-SPEDDEXES 2014

SCO-R Project overview

Page 16: Tim Pugh-SPEDDEXES 2014

Project overview

• More interactivity and functionality needed• Demand for POAMA multi-week forecast products• Long term view of seamless transition

between forecasts• Building upon experiences /

technologies from other BoMprojects

(e.g.MetEye and PASAP/PACCSAP)

Page 17: Tim Pugh-SPEDDEXES 2014

SCO-R architecture

MapCache

BOM.Map / BOM.App

Custom WMS Service (Python)

Page 18: Tim Pugh-SPEDDEXES 2014

Climate FuturesClimate Futures approach to the provision of regional climate projection information

CMAR/CLIMATE ADAPTATION FLAGSHIP

Tim ErwinAcknowledgements: Penny Whetton, Kevin Hennessy, John Clarke, David Kent28 October 2013

Page 19: Tim Pugh-SPEDDEXES 2014

Climate Data

• Processed from climate model data (CMIP3 and CMIP5)• NetCDF file format• 10 variables (temperature, rainfall, humidity...)• 20 year seasonal averages (2030, 2035, ..., 2090) • Base period (1950 – 2005) stored as monthly time span

• Catalogued in THREDDS server– Allows DAP access

• Django• THREDDS catalogues are parsed and stored

– model, variable, dap url, layer name, time span

Page 20: Tim Pugh-SPEDDEXES 2014

Architecture

Page 21: Tim Pugh-SPEDDEXES 2014

Architecture

THREDDS

ZOO

Page 22: Tim Pugh-SPEDDEXES 2014

ZOO-Project (WPS Server)

• Consists of: Kernel, API, Service• Works with Apache through a cgi file and a conf file• Support several common programming languages C/C++,

Fortran, Python, PHP, Perl, Java, JavaScript• Used to create area average of gridded data using non-

rectangular mask• Predefined mask• Polygon (GML,KML,GEOM)

• Not limited to geographic operations

Page 23: Tim Pugh-SPEDDEXES 2014

OPeNDAP Technology Developments

• DAP4 protocol and data model implementation (OPULS)– OPULS (an OPeNDAP-Unidata collaboration)– DAP4 (to supersede DAP2)– Experimental extensions (Async access, UGRID subsets)

• DAP2 & DAP4 JSON response type– Improve javascript client utilisation of DAP services

• ncWMS integration and WMS extensions– contour map types– THREDDS and Hyrax integration of ncWMS

• Programmatic Data Access for secure services– RDSI DaSh project to support programmatic data access– Integration within reX Identity and Authorisation Management

Page 24: Tim Pugh-SPEDDEXES 2014

DAP4 Experiments

• DAP4 provides more complete support for functions including metadata responses (DAP2 does not provide this; a gap in the DAP2 specification)– Experiments with Unstructured Grid (irregular mesh) subsetting– Binning: returns a distribution (as a raster of boolean values on a

user-specified grid) of data values satisfying some criteria– Masking: accepts a raster of zero/nonzero values as a query

argument, perhaps as a geospatial selection criterion

• OPeNDAP are running several experimental mini-projects within its context:– Asynchronous access, data streaming, cloud computing and an

expanded, function-based, server-side processing system

Page 25: Tim Pugh-SPEDDEXES 2014

thank you – have a great experience

Tim F. Pugh

HPC and CWSLab Project Lead

Melbourne, Victoria, Australia

Email: [email protected]

Office: +61 3 9669 4345

Page 26: Tim Pugh-SPEDDEXES 2014

Workshop Use-Cases

Application Data Representation

 

DAP2 data modeldomain neutraln-D, time series

     

Application Types

Programmatic / Langauge APIFORTRAN, C/C++, JAVA, Python, NetCDF, Java Netcdf, PyDAP

Programmatic / ToolsNetCDF, NCO, PyDAPCustom Tools: OPeNDAP crawler

Interactive Data ViewerPanolopy, MATLAB, NCL, web browser

   

ProgrammingDAP2 Legacy Codeexisting tools:

DAP2 New CodeNew tools

     

Data Access Protocol

Metadata Requestdas, dds, ddx

ASCII/Binary Data RequestSimple data representation

DAP Binary Object Request NcML Data Requestaggregation

 

Syntax

Return metadata infofile.nc.das - readable file.nc.dds - readable file.nc.ddx - XML metadatafile.nc.help - help info

Select vars and return datafile.nc.asc?var1,var2,var3file.nc.dods?var1,var2,var3

subset arrays, return datafile.asc?var1(0:1:10)file.dods?var1(0:1:10)

Return file translationsfile.nc.netcdf - NetCDF file

Server-side operationsfile.nc?GEOLOC()

ClientsProgrammatic AccessNetCDF, NCO, PyDAP, PyNetCDF

Interactive AccessWeb browser - CatalogPython, MATLAB, Panolopy

     

Service Capabilities

DAP2 response THREDDS data serviceHyrax data service

 

NcMLAggregation service

 Layered ServicesCatalog serviceWMS

Page 27: Tim Pugh-SPEDDEXES 2014

Pydap client

• >>> from pydap.client import open_url• >>> dataset = open_url('http://test.opendap.org/dap/data/nc/coads_climatology.nc')• >>> var = dataset['SST']• >>> var.shape• (12, 90, 180)• >>> var.type• <class 'pydap.model.Float32'>• >>> print var[0,10:14,10:14] # this will download data from the server• <class 'pydap.model.GridType'>• with data• [[ -1.26285708e+00 -9.99999979e+33 -9.99999979e+33 -9.99999979e+33]• [ -7.69166648e-01 -7.79999971e-01 -6.75454497e-01 -5.95714271e-01]• [ 1.28333330e-01 -5.00000156e-02 -6.36363626e-02 -1.41666666e-01]• [ 6.38000011e-01 8.95384610e-01 7.21666634e-01 8.10000002e-01]]• and axes• 366.0• [-69. -67. -65. -63.]• [ 41. 43. 45. 47.]

Page 28: Tim Pugh-SPEDDEXES 2014

NetCDF client

• >>> import netCDF4• >>> url = 'http://test.opendap.org/dap/data/nc/coads_climatology.nc’• >>> dataset = netCDF4.Dataset(url)• >>> var = dataset.variables['SST']• >>> var.shape• (12, 90, 180)• >>> print var[0,10:14,10:14] # this will download data from the server• <class 'pydap.model.GridType'>• with data• [[-1.26285707951 -- -- --]• [-0.769166648388 -0.77999997139 -0.675454497337 -0.595714271069]• [0.128333330154 -0.0500000156462 -0.0636363625526 -0.141666665673]• [0.638000011444 0.895384609699 0.721666634083 0.810000002384]]• >>> print var• <type 'netCDF4.Variable'>• float32 SST('TIME', 'COADSY', 'COADSX')• …

Page 29: Tim Pugh-SPEDDEXES 2014

MATLAB and SNCtools

• % ex_snctools_opendap.m• % Read from a remote OPeNDAP server with the same file• %• ncRef =

'http://opendap.bom.gov.au:8080/thredds/dodsC/gamssa_4deg/2011/20111106-ABOM-L4LRfnd-GLOB-v01-fv01.nc'

• nc_dump( ncRef );• pause

• temp = nc_varget( ncRef, 'analysed_sst');• lon = nc_varget( ncRef, 'lon');• lat = nc_varget( ncRef, 'lat');

• imagesc(lat, lon, temp); axis xy

Page 30: Tim Pugh-SPEDDEXES 2014

MATLAB and NJTbx demo

• % ex_njtbx.m• % Read from a remote OPeNDAP server with the same file• %• ncRef =

'http://opendap.bom.gov.au:8080/thredds/dodsC/gamssa_4deg/2011/20111106-ABOM-L4LRfnd-GLOB-v01-fv01.nc'

• nj_info( ncRef )• pause

• [temp, grid] = nj_grid_varget(ncRef,'analysed_sst');

• imagesc(grid.lon, grid.lat, temp); axis xy; colorbar

Page 31: Tim Pugh-SPEDDEXES 2014

Tomcat/[Apache]

dodsC

fileServer

wms

ncss

THREDDS services syntax

{contextPath} = “thredds” (servlet default name){service} = “fileServer” | “dodsC” | “wms” | “wcs” • Bulk File Transfer

fileServer = HTTP Server (any file)• Remote access, subsetting CDM filesdodsC = OPeNDAP (any CDM file)wms = Web Map Server (grids)wcs = Web Coverage Server (grids)ncss = NetCDF Subset Service (grids)admin = Administration/debug interfaceNote, each server can change the service name in the xml catalogue.

http://{server:port}/{contextPath}/{service}/...

wcs

Catalogs

thredds

Page 32: Tim Pugh-SPEDDEXES 2014

Hyrax service syntax

Tomcat/[Apache]

opendap

hyrax

docs

{contextPath} = “opendap” (servlet default name){service} = “hyrax” | “admin” | “docs” hyrax = catalog interface admin = administration interface (v1.8+) docs = documentation (v1.8+)

Note, each server can change the service name within the server configuration file.

http://{server:port}/{contextPath}/{service}/…

http://test.opendap.org/opendap/hyrax/...e.g.

admin

Page 33: Tim Pugh-SPEDDEXES 2014

Hyrax Data Service

• DAP2 and DAP3.x as the protocol develops• Other dataset responses*

• ASCII & NetCDF renderings of data (not limited to data natively stored in netCDF)

• RDF• ISO 19115 and the conformance rubric (Hyrax 1.8)

• Other server responses**• THREDDS catalogs

Tomcat/[Apache]

Hyrax

DAP2

RDF*

Catalogs**

DAP3.x

Note: Hyrax and TDS are not mutually excusive;Sites can install both with little extra effort.

Page 34: Tim Pugh-SPEDDEXES 2014

Data Discovery and Access

• Data discovery services• NASA’s Global Change Master Directory

− http://gcmd.nasa.gov

• IMOS eMII portal − http://imosmest.aodn.org.au/geonetwork/srv/en/main.home− Help --> http://emii1.its.utas.edu.au/drupal/?q=node/25

• TERN AusCover portal − http://data.auscover.org.au/

• My Ocean portal − http://www.myocean.eu/web/24-catalogue.php

• TPAC Digital Library− http://dl.tpac.org.au

• Data access services• Unidata’s THREDDS Data Service

− http://www.unidata.ucar.edu/projects/THREDDS/

• OPeNDAP’s Hyrax Data Service− http://opendap.org/download/hyrax.html

• NOAA’s ERDDAP Data Service− http://coastwatch.pfeg.noaa.gov/erddap

Page 35: Tim Pugh-SPEDDEXES 2014

Some of the Technology in Hyrax

1. THREDDS Dataset Inventory Catalogs provide virtual directories of available data and associated metadata.

2. Supports many formats and data stores: netCDF3, netCDF4, HDF4, HDF5, FreeForm, SQL data bases

3. Uses a plug-in based architecture and includes tools to write custom handlers

4. NetCDF Markup Language (NcML) to modify and create virtual aggregations of datasets.

5. OPeNDAP access with subsetting data access method.

6. bulk file access through the HTTP protocol.

7. ncISO server provides automated metadata analysis and ISO metadata generation.

8. RDF output - Metadata as triples; used with web-based reasoning systems

9. Code that has passed a formal security audit

10. A true multi-system architecture that can fit in a variety of enterprise settings

11. An administrator’s interface

Page 36: Tim Pugh-SPEDDEXES 2014

DAP Responses

• DAP2 defines three response types:

• DAS: A text document that contains data set attributes

• DDS: A text document that contains data set variable types and names

• DODS: A quasi-multipart MIME document that contains the DDS and associated binary values for a data request

• DAP3.x defines two additional response types:

• DDX: An XML document that combines both variable type and name information along with attributes

• DataDDX: A multipart MIME document that combines a DDX with the associated binary values for a data request

TDS and Hyrax both support DAP2; Hyrax includes support for DAP3, TDS has support for the DDX

Page 37: Tim Pugh-SPEDDEXES 2014

Some Definitions

DAP = Data Access Protocol Model used to describe the data; Request syntax and semantics; and Response syntax and semantics.

The data structure returned to the user

OPeNDAP The software that forms the service; Numerous implementations (Hyrax (reference), THREDDS,…); Core/libraries for client applications and services.

THREDDS / Hyrax A service framework (portal) that contains the OPeNDAP

service;

Page 38: Tim Pugh-SPEDDEXES 2014

Decipher the URL

• http://opendap.bom.gov.au:8080/thredds/dodsC/gamssa_4deg/2011/20111106-ABOM-L4LRfnd-GLOB-v01-fv01.nc.ascii?lon[0:1:1439]

• Given the OPeNDAP data request above, decipher the URL.− Request Protocol? http

− Host name:port? //opendap.bom.gov.au:8080/

− ContextPath? thredds/

− Service? dodsC/

− Unique path to data set? gamssa_4deg/2009/

− Data reference? 20111106-ABOM-L4LRfnd-GLOB-v01-fv01.nc

− Return type? ascii

− Return variables? ?lon

− Return variable indice range? [0:1:1439] --> [start:skip:end]

Page 39: Tim Pugh-SPEDDEXES 2014

NcMLNetCDF Meta Language

NcML can provide two basic features:• Augmenting/Modifying data sets with new

• Attributes• Values

• Combining two or more data sets (i.e., files) in an aggregation

Three kinds of aggregation are supported:• Tile files• Join files along an existing axis• Join files along a new axis

While very powerful, these aggregations are not applicable to every data set made up of multiple files

Page 40: Tim Pugh-SPEDDEXES 2014

DAP4 Summary

• DAP (DAP2 and DAP4) is based on datasets built of variables that share the characteristics of programming languages

• Constraints are used to subset data on the server• DAP4 is a REST API• DAP4 specifies ‘modern’ web services

– While DAP2 was a data model only, DAP4 includes specification of the web services

• DAP4 provides more complete support for functions