Tim Pugh-SPEDDEXES 2014
-
Upload
aceas13tern -
Category
Education
-
view
148 -
download
2
description
Transcript of Tim Pugh-SPEDDEXES 2014
How OPeNDAP has transformed the way we do science plus snapshots of recent developments and BoM’s operational systems built on this technology
Tim PughSPEDDEXES workshop17-21 March 2014
Evolution
• Traditionally…– Scientific research is conducted in a quiet room in isolation
utilising unique data, scripts, and code– Scientific collaboration is conducted at conferences with file
sharing by FTP or HTTP bulk download
• Today– Scientific research is being driven to shared research services
and supported infrastructure• To relieve the scientist of laborious developments• To manage more complex machinery• To improve scientific integrity and collaboration• To work within managed and supported infrastructure
– Science is moving from file sharing to data sharing collaboration
CAWCR Research Data Server
• Location: http://opendap.bom.gov.au:8080/thredds• Unidata THREDDS Data Server v4.2.8
• http://www.unidata.ucar.edu/projects/THREDDS/tech/TDS.html• The THREDDS Data Server (TDS) is a JavaSevlet, and is contained in a single war file, which allows very easy installation into Tomcat web server.
OPeNDAP Now Is:
• An acronym– “Open-source Project for a Network Data Access
Protocol”– Often a synonym for “DAP”
• A not-for-profit corp. developing/supporting– “DAPx” - a web-services protocol for data access
• Deployed by hundreds of data providers internationally• Employed in many analysis packages (MATLAB, e.g.)• Designated a “Community Standard” by NASA
– Server & client implementations* of DAP
*Note: there are other implementations
BROAD VISION
1. A world in which a single data access protocol is used for the exchange of data between network-based applications regardless of discipline.
2. A layer above TCP/IP providing for syntactic and semantic consistency not available in existing protocols such as FTP.
Fundamental Objective of OPENDAP
• The fundamental objective of OPeNDAP and OPeNDAP Inc. is to facilitate internet access to scientific data
• This is done by:• Providing a protocol (DAP) to access data over the internet,• Hiding the format (and organization) in which the data are stored from
the user, and• Providing subsetting (and other) capabilities for the data at the server
• OPeNDAP is based on a multi-tier architecture
• OPeNDAP software is open source
OPeNDAP Data-Type Philosophy
the OPeNDAP data model has few data typessimplified programming/lowered risk of errors
they are intentionally discipline-neutralbetter trans-domain utility & programmer uptake
they nonetheless fill discipline-specific needsnetCDF-like (good in contexts where, e.g., data might
represent functions with 4- or 5-D domains)
sequences & selections match dbms sensibilities
TDS Server
• TDS is THREDDS Data Server– THREDDS is Thematic Real-time Environmental Distributed Data Services– Middleware to bridge the gap between data providers and data users– THREDDS Data Server (TDS), a web server that provides catalog, metadata,
and data access services for scientific datasets. – The TDS is open source, 100% Java, and runs inside the open source Tomcat
Servlet container.
• Unidata’s Common Data Model– merges the OPeNDAP, netCDF, and HDF5 data models to create a common API
for scientific data– implemented by the NetCDF Java library– read netCDF, OPeNDAP, HDF5, HDF4, GRIB 1 & 2, BUFR, NEXRAD 2 & 3,
GEMPAK, MCIDAS, GINI, among others– A pluggable framework allows other developers to add readers for their own
specialized formats.– provides standard APIs for geo-referencing coordinate systems, and specialized
queries for scientific feature types like Grid, Point, and Radial datasets
Some of the Technology in the TDS
1. THREDDS Dataset Inventory Catalogs provide virtual directories of available data and
associated metadata.
2. The Netcdf-Java/CDM library reads NetCDF, OpenDAP, and HDF5 datasets, as well as other
binary formats such as GRIB and NEXRAD, essentially an (extended) netCDF view of the data.
3. TDS can use the NetCDF Markup Language (NcML) to modify and create virtual aggregations
of datasets.
4. An integrated server provides OPeNDAP access with subsetting data access method.
5. An integrated server provides bulk file access through the HTTP protocol.
6. An integrated server provides data access through the OpenGIS Consortium (OGC) Web
Coverage Service (WCS) protocol, for any "gridded" dataset whose coordinate system
information is complete.
7. An integrated server provides data access through the OpenGIS Consortium (OGC) Web Map
Service (WMS) protocol, for any "gridded" dataset whose coordinate system information is
complete.
8. The integrated ncISO server provides automated metadata analysis and ISO metadata
generation.
THREDDS Catalog
• The goal is…– to simplify the discovery and use of scientific data and to allow scientific
publications and educational materials to reference scientific data.
– initial focus was to allow data users to find datasets that are pertinent to their specific education and research needs, access the data, and use them without necessarily downloading the entire file to their local system.
– Catalogs are the heart of the data access services, and is the THREDDS concept. Catalogs consist of XML documents that describe on-line datasets.
– Catalogs can contain arbitrary metadata, however we also defined a standard set of metadata to bridge to discovery centers
• CF (Climate & Forecast) and Unidata Data Discovery metadata
Spectrum of Use Cases
Application Data Representation
OGC data modeldomain specificgeospatial, 1-D, 2-D
DAP2 data modeldomain neutraln-D, time series
**DAP4 data modeldomain neutralnew data types and data structuresstreaming, compressed, chunked
Common Data Model (CDM)domain specific
Future data modeldomain neutral??
Application Types
Programmatic / Langauge APIFORTRAN, C/C++, JAVA, Python, NetCDF, Java NetCDF
Programmatic / ToolsNetCDF, NCO, PyDAPCustom Tools: OPeNDAP crawler, ocean_prep
Interactive Data ViewerIDV, Panolopy, IDL, MATLAB, iPython (matplotlib), NCL, web browser (metadata)
Interactive AnalysisMATLAB, IDL, iPython, NCLCustom Application: Inudation Modeller
Web ApplicationLive Access ServerIMOS Data Portal (WMS)Custom Java Servlet
ProgrammingDAP2 Legacy Codeexisting tools
DAP2 New CodeNew tools
**DAP4 programminglegacy code support
**DAP4 programmingnew data model and protocolsstreaming support
**DAP4 programmingAsynchronous access modes, server-side processing
Data Access Protocol
Metadata Requestdas, dds, ddx
ASCII/Binary Data RequestSimple data representation
DAP Binary Object Request NcML Data Requestaggregation, virtual data sets
**DAP4server-side operations, async access mode, new data model, posting
Syntax
Return data set infofile.nc.dds - readablefile.nc.ddx - XMLfile.nc.asc - ASCII data return
Select variablesfile.nc.dods?var1,var2,var3
subset arraysfile.dods?var1(0:1:10)
Return file translationsfile.nc.netcdf - NetCDF file
Server-side operationsfile.nc?GEOLOC()Async access mode??
Clients
Programmatic AccessTsunami inudation modeller, NetCDF,NCO, PyDAP, PyNetCDF, MATLAB, IDL, …
Interactive AccessWeb browser - CatalogMATLAB, IDL, Python, Panolopy,…
Data Library & Catalog Servicemetadata harvestingdirectory listingsremote THREDDS services
Web ServiceJava servlet, Java appletGeospatial Information ServiceOPeNDAP data service
Analysis ServiceLive Access Server
Service CapabilitiesDAP2 response metadata, dods, ASCII / Binary
**DAP4 Responseasync access mode, server-side, streaming,
NcMLAggregation serviceVirtual Data Set ServiceRemote Data Access
Metadata Conversion and RDFmetadata definitions, translations (-> ISO) sematics, ontalogyCF->ISO, CF->WMS, CF->WCS
Layered ServicesCatalogue serviceWMS, WCS servicesAuthenticationConformance checksCF metadata checkISO metadata check
**DAP4 features listed is my estimation and not the official specification
Use Case limitations
• Time to access data is dependent on the following factors:
• Hardware and network performance
• Selection of variables and dimensions
• Number of data requests to be issued
−Latency inherent in the data request
• Number of concurrent accesses to the server
DAP-enabled client tools/applications
OPeNDAP Clients (partial list) http://opendap.org/whatClients
1. Web browser returning ASCII data
2. Pydap - is a pure Python library implementation of the DAP2
3. NetCDF - is a set of software libraries and self-describing, machine-independent data formats with interfaces to Python, FORTRAN, C/C++, and Java languages
4. NCO – comprises a dozen standalone, command-line programs that take netCDF files as input
5. MATLAB – a high-level technical computing language and interactive environment for algorithm development, data visualization, data analysis, and numerical computation
6. Panoply – Panoply is a cross-platform application which plots geo-gridded arrays from netCDF, HDF and GRIB datasets.
Developments by Bureau and CSIRO
• Development of web portals for data access services and information systems in climate and environment– Seasonal Climate Outlook Rebuild (Roald de Wit)– Natural Resource Management (NRM) Climate Change Portal (Tim
Erwin)– eReef’s Marine Quality Dashboard and data services (Jonathon Hodge)– National Environmental Information Infrastructure (NEII) (Andrew Woolf)– CAWCR research data services (Duan Beckett)
• Establish Climate Data Publishing services at NCI– NCI, CSIRO, Bureau of Meteorology, CoE CSS– Earth System Grid (ESG)– Climate and Weather Science Laboratory (CWSLab)
SCO-R Project overview
Project overview
• More interactivity and functionality needed• Demand for POAMA multi-week forecast products• Long term view of seamless transition
between forecasts• Building upon experiences /
technologies from other BoMprojects
(e.g.MetEye and PASAP/PACCSAP)
SCO-R architecture
MapCache
BOM.Map / BOM.App
Custom WMS Service (Python)
Climate FuturesClimate Futures approach to the provision of regional climate projection information
CMAR/CLIMATE ADAPTATION FLAGSHIP
Tim ErwinAcknowledgements: Penny Whetton, Kevin Hennessy, John Clarke, David Kent28 October 2013
Climate Data
• Processed from climate model data (CMIP3 and CMIP5)• NetCDF file format• 10 variables (temperature, rainfall, humidity...)• 20 year seasonal averages (2030, 2035, ..., 2090) • Base period (1950 – 2005) stored as monthly time span
• Catalogued in THREDDS server– Allows DAP access
• Django• THREDDS catalogues are parsed and stored
– model, variable, dap url, layer name, time span
Architecture
Architecture
THREDDS
ZOO
ZOO-Project (WPS Server)
• Consists of: Kernel, API, Service• Works with Apache through a cgi file and a conf file• Support several common programming languages C/C++,
Fortran, Python, PHP, Perl, Java, JavaScript• Used to create area average of gridded data using non-
rectangular mask• Predefined mask• Polygon (GML,KML,GEOM)
• Not limited to geographic operations
OPeNDAP Technology Developments
• DAP4 protocol and data model implementation (OPULS)– OPULS (an OPeNDAP-Unidata collaboration)– DAP4 (to supersede DAP2)– Experimental extensions (Async access, UGRID subsets)
• DAP2 & DAP4 JSON response type– Improve javascript client utilisation of DAP services
• ncWMS integration and WMS extensions– contour map types– THREDDS and Hyrax integration of ncWMS
• Programmatic Data Access for secure services– RDSI DaSh project to support programmatic data access– Integration within reX Identity and Authorisation Management
DAP4 Experiments
• DAP4 provides more complete support for functions including metadata responses (DAP2 does not provide this; a gap in the DAP2 specification)– Experiments with Unstructured Grid (irregular mesh) subsetting– Binning: returns a distribution (as a raster of boolean values on a
user-specified grid) of data values satisfying some criteria– Masking: accepts a raster of zero/nonzero values as a query
argument, perhaps as a geospatial selection criterion
• OPeNDAP are running several experimental mini-projects within its context:– Asynchronous access, data streaming, cloud computing and an
expanded, function-based, server-side processing system
thank you – have a great experience
Tim F. Pugh
HPC and CWSLab Project Lead
Melbourne, Victoria, Australia
Email: [email protected]
Office: +61 3 9669 4345
Workshop Use-Cases
Application Data Representation
DAP2 data modeldomain neutraln-D, time series
Application Types
Programmatic / Langauge APIFORTRAN, C/C++, JAVA, Python, NetCDF, Java Netcdf, PyDAP
Programmatic / ToolsNetCDF, NCO, PyDAPCustom Tools: OPeNDAP crawler
Interactive Data ViewerPanolopy, MATLAB, NCL, web browser
ProgrammingDAP2 Legacy Codeexisting tools:
DAP2 New CodeNew tools
Data Access Protocol
Metadata Requestdas, dds, ddx
ASCII/Binary Data RequestSimple data representation
DAP Binary Object Request NcML Data Requestaggregation
Syntax
Return metadata infofile.nc.das - readable file.nc.dds - readable file.nc.ddx - XML metadatafile.nc.help - help info
Select vars and return datafile.nc.asc?var1,var2,var3file.nc.dods?var1,var2,var3
subset arrays, return datafile.asc?var1(0:1:10)file.dods?var1(0:1:10)
Return file translationsfile.nc.netcdf - NetCDF file
Server-side operationsfile.nc?GEOLOC()
ClientsProgrammatic AccessNetCDF, NCO, PyDAP, PyNetCDF
Interactive AccessWeb browser - CatalogPython, MATLAB, Panolopy
Service Capabilities
DAP2 response THREDDS data serviceHyrax data service
NcMLAggregation service
Layered ServicesCatalog serviceWMS
Pydap client
• >>> from pydap.client import open_url• >>> dataset = open_url('http://test.opendap.org/dap/data/nc/coads_climatology.nc')• >>> var = dataset['SST']• >>> var.shape• (12, 90, 180)• >>> var.type• <class 'pydap.model.Float32'>• >>> print var[0,10:14,10:14] # this will download data from the server• <class 'pydap.model.GridType'>• with data• [[ -1.26285708e+00 -9.99999979e+33 -9.99999979e+33 -9.99999979e+33]• [ -7.69166648e-01 -7.79999971e-01 -6.75454497e-01 -5.95714271e-01]• [ 1.28333330e-01 -5.00000156e-02 -6.36363626e-02 -1.41666666e-01]• [ 6.38000011e-01 8.95384610e-01 7.21666634e-01 8.10000002e-01]]• and axes• 366.0• [-69. -67. -65. -63.]• [ 41. 43. 45. 47.]
NetCDF client
• >>> import netCDF4• >>> url = 'http://test.opendap.org/dap/data/nc/coads_climatology.nc’• >>> dataset = netCDF4.Dataset(url)• >>> var = dataset.variables['SST']• >>> var.shape• (12, 90, 180)• >>> print var[0,10:14,10:14] # this will download data from the server• <class 'pydap.model.GridType'>• with data• [[-1.26285707951 -- -- --]• [-0.769166648388 -0.77999997139 -0.675454497337 -0.595714271069]• [0.128333330154 -0.0500000156462 -0.0636363625526 -0.141666665673]• [0.638000011444 0.895384609699 0.721666634083 0.810000002384]]• >>> print var• <type 'netCDF4.Variable'>• float32 SST('TIME', 'COADSY', 'COADSX')• …
MATLAB and SNCtools
• % ex_snctools_opendap.m• % Read from a remote OPeNDAP server with the same file• %• ncRef =
'http://opendap.bom.gov.au:8080/thredds/dodsC/gamssa_4deg/2011/20111106-ABOM-L4LRfnd-GLOB-v01-fv01.nc'
• nc_dump( ncRef );• pause
• temp = nc_varget( ncRef, 'analysed_sst');• lon = nc_varget( ncRef, 'lon');• lat = nc_varget( ncRef, 'lat');
• imagesc(lat, lon, temp); axis xy
MATLAB and NJTbx demo
• % ex_njtbx.m• % Read from a remote OPeNDAP server with the same file• %• ncRef =
'http://opendap.bom.gov.au:8080/thredds/dodsC/gamssa_4deg/2011/20111106-ABOM-L4LRfnd-GLOB-v01-fv01.nc'
• nj_info( ncRef )• pause
• [temp, grid] = nj_grid_varget(ncRef,'analysed_sst');
• imagesc(grid.lon, grid.lat, temp); axis xy; colorbar
Tomcat/[Apache]
dodsC
fileServer
wms
ncss
THREDDS services syntax
{contextPath} = “thredds” (servlet default name){service} = “fileServer” | “dodsC” | “wms” | “wcs” • Bulk File Transfer
fileServer = HTTP Server (any file)• Remote access, subsetting CDM filesdodsC = OPeNDAP (any CDM file)wms = Web Map Server (grids)wcs = Web Coverage Server (grids)ncss = NetCDF Subset Service (grids)admin = Administration/debug interfaceNote, each server can change the service name in the xml catalogue.
http://{server:port}/{contextPath}/{service}/...
wcs
Catalogs
thredds
Hyrax service syntax
Tomcat/[Apache]
opendap
hyrax
docs
{contextPath} = “opendap” (servlet default name){service} = “hyrax” | “admin” | “docs” hyrax = catalog interface admin = administration interface (v1.8+) docs = documentation (v1.8+)
Note, each server can change the service name within the server configuration file.
http://{server:port}/{contextPath}/{service}/…
http://test.opendap.org/opendap/hyrax/...e.g.
admin
Hyrax Data Service
• DAP2 and DAP3.x as the protocol develops• Other dataset responses*
• ASCII & NetCDF renderings of data (not limited to data natively stored in netCDF)
• RDF• ISO 19115 and the conformance rubric (Hyrax 1.8)
• Other server responses**• THREDDS catalogs
Tomcat/[Apache]
Hyrax
DAP2
RDF*
Catalogs**
DAP3.x
Note: Hyrax and TDS are not mutually excusive;Sites can install both with little extra effort.
Data Discovery and Access
• Data discovery services• NASA’s Global Change Master Directory
− http://gcmd.nasa.gov
• IMOS eMII portal − http://imosmest.aodn.org.au/geonetwork/srv/en/main.home− Help --> http://emii1.its.utas.edu.au/drupal/?q=node/25
• TERN AusCover portal − http://data.auscover.org.au/
• My Ocean portal − http://www.myocean.eu/web/24-catalogue.php
• TPAC Digital Library− http://dl.tpac.org.au
• Data access services• Unidata’s THREDDS Data Service
− http://www.unidata.ucar.edu/projects/THREDDS/
• OPeNDAP’s Hyrax Data Service− http://opendap.org/download/hyrax.html
• NOAA’s ERDDAP Data Service− http://coastwatch.pfeg.noaa.gov/erddap
Some of the Technology in Hyrax
1. THREDDS Dataset Inventory Catalogs provide virtual directories of available data and associated metadata.
2. Supports many formats and data stores: netCDF3, netCDF4, HDF4, HDF5, FreeForm, SQL data bases
3. Uses a plug-in based architecture and includes tools to write custom handlers
4. NetCDF Markup Language (NcML) to modify and create virtual aggregations of datasets.
5. OPeNDAP access with subsetting data access method.
6. bulk file access through the HTTP protocol.
7. ncISO server provides automated metadata analysis and ISO metadata generation.
8. RDF output - Metadata as triples; used with web-based reasoning systems
9. Code that has passed a formal security audit
10. A true multi-system architecture that can fit in a variety of enterprise settings
11. An administrator’s interface
DAP Responses
• DAP2 defines three response types:
• DAS: A text document that contains data set attributes
• DDS: A text document that contains data set variable types and names
• DODS: A quasi-multipart MIME document that contains the DDS and associated binary values for a data request
• DAP3.x defines two additional response types:
• DDX: An XML document that combines both variable type and name information along with attributes
• DataDDX: A multipart MIME document that combines a DDX with the associated binary values for a data request
TDS and Hyrax both support DAP2; Hyrax includes support for DAP3, TDS has support for the DDX
Some Definitions
DAP = Data Access Protocol Model used to describe the data; Request syntax and semantics; and Response syntax and semantics.
The data structure returned to the user
OPeNDAP The software that forms the service; Numerous implementations (Hyrax (reference), THREDDS,…); Core/libraries for client applications and services.
THREDDS / Hyrax A service framework (portal) that contains the OPeNDAP
service;
Decipher the URL
• http://opendap.bom.gov.au:8080/thredds/dodsC/gamssa_4deg/2011/20111106-ABOM-L4LRfnd-GLOB-v01-fv01.nc.ascii?lon[0:1:1439]
• Given the OPeNDAP data request above, decipher the URL.− Request Protocol? http
− Host name:port? //opendap.bom.gov.au:8080/
− ContextPath? thredds/
− Service? dodsC/
− Unique path to data set? gamssa_4deg/2009/
− Data reference? 20111106-ABOM-L4LRfnd-GLOB-v01-fv01.nc
− Return type? ascii
− Return variables? ?lon
− Return variable indice range? [0:1:1439] --> [start:skip:end]
NcMLNetCDF Meta Language
NcML can provide two basic features:• Augmenting/Modifying data sets with new
• Attributes• Values
• Combining two or more data sets (i.e., files) in an aggregation
Three kinds of aggregation are supported:• Tile files• Join files along an existing axis• Join files along a new axis
While very powerful, these aggregations are not applicable to every data set made up of multiple files
DAP4 Summary
• DAP (DAP2 and DAP4) is based on datasets built of variables that share the characteristics of programming languages
• Constraints are used to subset data on the server• DAP4 is a REST API• DAP4 specifies ‘modern’ web services
– While DAP2 was a data model only, DAP4 includes specification of the web services
• DAP4 provides more complete support for functions