Kitchen Sinks, Plumbing and Virtual Observatories Peter Fox [email protected] June 4, 2010 – CSIRO...
-
date post
21-Dec-2015 -
Category
Documents
-
view
218 -
download
2
Transcript of Kitchen Sinks, Plumbing and Virtual Observatories Peter Fox [email protected] June 4, 2010 – CSIRO...
Kitchen Sinks, Plumbing and Virtual Observatories
Peter [email protected]
June 4, 2010 – CSIRO Aspendale
Tetherless World Constellation 2
Introduction
• Systems compared to frameworks?• The need, and shifting the burden• Virtual Observatories• Architectures of VOs and semantics• In the lower layers of VOs
– Data access and transport– Formats, formats, formats– Sensor streams
• How do you/ would you participate?
Tetherless World Constellation 3
Frameworks vs. Systems
• Rough definitions– Systems have very well-define entry and exit
points. A user tends to know when they are using one. Options for extensions are limited and usually require engineering
– Frameworks have many entry and use points. A user often does not know when they are using one. Extension points are part of the design
• Treat this as a working definition
Diversity, Integration, Size, …
• Not just large (well organized, long-lived, well-funded) projects/ programs want to make their data available
• Data policies are emerging but are still highly variable (or non-existent)– How does a user deal with this?
• Need to manage data to solve challenging scientific or societal problems without the continued need for a scientist to know every detail of complex data management systems
• Large-scale, scientific data repositories:– Most data still created in a manner to simplify generation, not access or use– Very diverse organization of data; files, directories, metadata, emails, etc.– Source/origin management is driven by meta-mechanisms for integration,
interoperability (but still need performance)• Virtual Observatories• Data Grids
• Increasing realization: need management for all forms of ‘data’, I.e. virtual data products are becoming the norm
Size matters; personal data management is as big,
or bigger problem as source data management
Shifting the Burden from the Userto the Provider (with the help of VOs)
6
Terminology
• Workshop: A Virtual Observatory (VO) is a suite of software applications on a set of computers that allows users to uniformly find, access, and use resources (data, software, document, and image products and services using these) from a collection of distributed product repositories and service providers. A VO is a service that unites services and/or multiple repositories.
• VxOs - x is one discipline, domain, community, country
• NB: VO also refers to Virtual Organization
7
What should a VO do?
• Make “standard” scientific research much more efficient.– Even the principal investigator (PI) teams should want to use them.– Must improve on existing services (mission and PI sites, etc.). VOs will
not replace these, but will use them in new ways.
• Enable new, global problems to be solved. – Rapidly gain integrated views from the solar origin to the terrestrial
effects of an event.– Find data related to any particular observation.– (Ultimately) answer “higher-order” queries such as “Show me the
data from cases where a large coronal mass ejection observed by the Solar-Orbiting Heliospheric Observatory was also observed in situ.” (science-speak) or “What happens when the Sun disrupts the Earth’s environment” (general public)
8
Virtual Observatories
• Conceptual examples: • In-situ: Virtual measurements
– Related measurements
• Remote sensing: Virtual, integrative measurements– Data integration
• Both usage patterns lead to additional data management challenges at the source and for users; now managing virtual ‘datasets’
9
Virtual Observatories
Make data and tools quickly and easily accessible to a wide audience.
Operationally, virtual observatories need to find the right balance of data/model holdings, portals and client software that researchers can use without effort or interference as if all the materials were available on his/her local computer using the user’s preferred language: i.e. appear to be local and integrated
Likely to provide controlled vocabularies that may be used for interoperation in appropriate domains along with database interfaces for access and storage
10
Early days of VxOs
… … … …
VO1
VO2 VO3
DB2 DB3DBn
DB1
?
11
Federation
… … … …
VO1
VO2VO3
DB2 DB3DBn
DB1
VO4
12
The Astronomy approach; data-types as a service
… … … …
VO App1VO App2
VO App3
DB2 DB3DBn
DB1
VOTable Simple
Image Access Protocol Simple Spectrum
Access Protocol
Simple Time Access
ProtocolVO layer
Limited interoperability
Lightweight semantics
Limited meaning, hard coded
Limited extensibility
Under review
OGC: {WFS, WCS, WMS} and
SWE {SOS, SPS, SAS}
use the same approach
Similarities to Astronomy
• Some disciplines have chosen a data format (some even use FITS)• Common applications, community standards appearing• Images, spectra (incl. multi-band), …• More and more data is on-line, some (near) real-time• Data flood - synoptic measurements, spatial/ spectral resolution,
number of instruments, cadence - all increasing (peta-byte to exa-byte is real), data mining and knowledge extraction are now real needs
• Don’t move (or replicate?) the data when possible• Means for interoperation is being demanded - service-oriented
architectures• Some VOs even implementing IVoA standards (primarily
heliophysics and space physics)
Differences with astronomy
• Data types (+station/point, irregular, multi-resolution, ragged arrays, swath, …)
• Data formats - many• Lots of VOs• Metadata conventions range from strict to non-existent• Provenance, derivation and semantics being applied in (more)
formal ways• Geo-spatial dominates (cf helio-spatial), some standards but
little/no enforcement - efforts at conventions/ standards are at data model level
• New to the theme of integration and inter-disciplinary• Number and complexity of projects, systems, frameworks -
need to interoperate at many levels• Social, political and mission forces are immense
15 Fox - APAC 2007, Driving e-research:
Grids and Semantics
… … … …
VO Portal
Web Serv.
VO API
DB2 DB3DBn
DB1
Semantic mediation layer - VSTO - low level
Semantic mediation layer - mid-upper-level
Education, clearinghouses, other services, disciplines, etc.
Metadata, schema, data
Query, access and use of data
Semantic query, hypothesis and inference
Semantic interoperability
Added value
Added value
Added value
Added value
Mediation Layer• Ontology - capturing concepts of Parameters,
Instruments, Date/Time, Data Product (and associated classes, properties) and Service Classes
• Maps queries to underlying data• Generates access requests for metadata, data• Allows queries, reasoning, analysis, new hypothesis
generation, testing, explanation, etc.
16
Semantic Web Benefits• Unified/ abstracted query workflow: Parameters, Instruments, Date-Time• Decreased input requirements for query: in one case reducing the number of
selections from eight to three• Generates only syntactically correct queries: which was not always insurable in
previous implementations without semantics• Semantic query support: by using background ontologies and a reasoner, our
application has the opportunity to only expose coherent query (portal and services)
• Semantic integration: in the past users had to remember (and maintain codes) to account for numerous different ways to combine and plot the data whereas now semantic mediation provides the level of sensible data integration required, and exposed as smart web services– understanding of coordinate systems, relationships, data synthesis, transformations.– returns independent variables and related parameters
• A broader range of potential users (PhD scientists, students, professional research associates and those from outside the fields)
Tetherless World Constellation 17
Virtual Carbon Observatory
Environmental Assessment
Understand Communities Of Stakeholders
Tetherless World Constellation 20
Multi-domain Knowledge Base
21
Tetherless World Constellation 22
Vocabularies and Ontologies
• An underlying aspect of all VOs is the need to develop/ agree on a common presentation of the (virtual) holdings, aka a catalog
• As disciplines boundaries are crossed… (ecology)• Vocabularies are increasingly important in this
provision• And, interestingly, there is a real push toward more
explicit representations of semantics in the form of ontologies
• … and provision of vocabulary services*
Tetherless World Constellation 23
Let’s turn to plumbing
• Data formats are of resurgent interest but not so much for exchange– For structural representation and efficiency– For transparency and preservation– However, a lot of end-users still care about
formats immensely• Data access and transport• Implications of computing closer to the data
Tetherless World Constellation 24
netCDF and similar
• Version 3 (classic) vs. version 4 (aka CDM)• V4 - slow adoption to date (no specific reason)• Conventions (e.g. units, CF-1) make it work• Traditional focus on grids is now evolving as
in-situ data and model comparisons are becoming common, i.e. unstructured data
Tetherless World Constellation 25
Discipline neutral access
• One such approach, since 1993, is the DAP – Data Access Protocol (NASA, NOAA standard)
• opendap.org (U.S. not-for-profit)• OPeNDAP is the software
– Core, server (version 4 – Hyrax), client, services
26
OPeNDAP Hyrax Architecture
OLFS BES
OPeNDAP Lightweight Front end Server (OLFS) Receives requests and asks the BES to fill them Uses Java Servlets Does not directly ‘touch’ data Multi-protocol
Data
Back End Server (BES) Reads data files, Databases, et c., returns info May return DAP2 objects or other data Does not require web server
Client
27
GridFTPDAP2
HTTPDAP2
ASCII output
HTML form
Info output
OPeNDAP Lightweight Front end Server
THREDDS
Request Formulation**
Requ
est f
rom
clie
ntRe
spon
se to
clie
ntBES
SOAP-DAP (HTTP)
DAP2 (GridFTP, HTTP)
RDF, OWL, JSON (HTTP)
PML output
28
Hyrax/ Back-end Server
Network Protocol andProcess start/stopactivities
Data Store Interfaces
BES Framework
PPT* Initialization/Termination
DAP2Access
NetCDF3 HDF4 RDF/ SPARQL…
Provenance
Commands**BES Commands/ XML Documents
*PPT is built in (other protocols)**Some commands are built inData DataData
DataCatalogs
Status of the Community OPeNDAP Server Software
• Hyrax 1.6 provides support for NcML-based aggregation
• Faster THREDDS implementation (but not full featured)
• Full security audit and static code analysis certification to comply with NOAA and NASA requirements
• DAP4 (which includes netCDF 4 support) is not available yet
• AND other things
Earth System Grid Center for Enabling Technologies: (ESG-CET)
Earth System Grid Center for Enabling Technologies
• Large data sets, numbers and sizes– High performance– Flexible architecture, both client and several types and numbers of
servers– Aggregation– Server side operations– Multiple transport protocol options
• Full ESG security support as well as loose federation• Full function client access via API (netCDF/CDM) To satisfy the new goals, the OPeNDAP services for ESG have been re-
architected. We now use parts of the standard OPeNDAP framework Hyrax, focusing
on high performance for the client side and extended flexibility.
Earth System Grid Center for Enabling Technologies: (ESG-CET)
Requirements leading to OPeNDAP-g
• Separation of the core Data Access Protocol (DAP) from the transport protocol (HTTP).
• High Performance Computing. The previous CGI based servers did not have the capacity required by ESG. Error and memory handling added.
• Security. Once the OPeNDAP was independent of the transport protocol, adding security was possible by relying on the Globus gsiFTP system.
• Aggregation. OPeNDAP 3.0 did not operate on aggregated datasets. OPeNDAP-g does.
• Transport protocol independence and HPC were incorporated back into OPeNDAP leading to the current version. Security and aggregation initially were ESG only features.
Earth System Grid Center for Enabling Technologies: (ESG-CET)
The Remote NetCDF Invocation (RNI)
The client is the netCDF library. It has exactly the same API as the standard C library netCDF, but it can deal with local files or files reachable via HTTP, PPT or gridFTP. The third tier, the BES server can be reached only via PPT. NetCDF services for all NetCDF calls are implemented a a BES module. The middle tier, acts like a proxy between the RNI client and server and deals with security.
Earth System Grid Center for Enabling Technologies: (ESG-CET)
RNI Architecture
CLIENTDATA
GridFTPOPeNDAP
BES
NetCDFLibrary
RNI Module
connection acts like
RNI Library
Earth System Grid Center for Enabling Technologies: (ESG-CET)
Characteristics of the RNI as part of a data access system
• Full Support of standard OPeNDAP URLs. RNI is being developed with the integrated Unidata/OPeNDAP netCDF library (and CDM)
• Transparent access to either standard netCDF files and aggregated datasets via the NetCDF Markup Language (NCML).
• For remote containers, all write operations are disable for security. That is, for HTTP/HTTPS, PPT and gridFTP/gsiFTP the RNI system is a read only API.
• RNI utilizes Just in Time access. Caching is only for metadata. No pre-fetching of data.
• RNI transparently accesses secure (gsiFTP, HTTPS) or insecure (gridFTP, HTTP) remote data.
Other DAP client/ API library status
• OPeNDAP-Unidata project to fold ‘libnc-dap’ into the standard netCDF distribution, i.e. you get ‘DAP’ for free
• New C-API for DAP – ‘oc’ replaces ocapi and will be the basis for rewrites of the IDL and Matlab (and other) client interfaces
Earth System Grid Center for Enabling Technologies: (ESG-CET)
Tetherless World Constellation 36
NOAA/IOOS
• DAP adopted by DMAC• Gateway project for OPeNDAP
– Support for WCS/WFS as source and response type in Hyrax
– Implementation of AIS (Ancillary Information Service) for RDF return prototype
– Initial DAP ontology data model
Tetherless World Constellation 37
Cloud
• Microsoft ported OPeNDAP Hyrax to their Azure cloud– http://opendap.cloudapp.net/dap – Web-client/form is at
http://opendap.cloudapp.net/dap/data/nc/contents.html
• Work on Azure Drive (Xdrive) underway• No decisions on future or other cloud
environments
Tetherless World Constellation 38
Security (authn/z)
• Developed with Bryan Lawrence (BADC/STFC) for federation of OPeNDAP security
• Specd. In May 2009, implementations presented at EGU in 2010
• Will appear in ESG and community OPeNDAP releases
• AAF compatible?
Tetherless World Constellation 39
Sensors
• Due to the increasing demand to process off the sensor:– Sky surveys – volume– Monitoring – for rapid response and decision
support– As part of a network, or on the internet, a web
• There is a corresponding increase in need to ingest/ publish data much earlier than has previously been needed
• Trend toward treating them as RT/NRT sensors
Tetherless World Constellation 40
Directions for sensor and spatial standards (my view)
• Has grown out of a limited set of semantic constructs– Geography, features, coverages, maps, streams
• Integration needs are driving different (good) developments, e.g. WCS 2 v WFS 2.
• Transparency requirements are going to drive very different approaches, e.g. encapsulation can be a barrier
• Refactoring of standards: much as is happening in astronomy will be required
Tetherless World Constellation 41
Who is developing?Your participation?
• VOs– U.S. – NASA, NSF, NOAA are developing/ funding– EU – many, e.g. HELIO, SOTERIA
• DAP/OPeNDAP– World-wide community, strong Australian contributions/
use• Sensors
– W3 recent – incubator for semantic sensor web – very, very important work
• Vocabulary servers (more than the vocabularies)– Interest in community-based (or W3) effort
• Scaling to large numbers of data providers• Security, policy enforcement• Data quality• Branding and attribution (where did this data come from
and who gets the credit, is it the correct version, is this an authoritative source?)
• Provenance/derivation (propagating key information as it passes through a variety of services, copies of processing algorithms, …)
• Sustainability
Issues for Virtual Observatories - Geo
Summary/ Discussion
• The VO paradigm in is wide-spread use in Earth and Space Sciences– Successful implementations in production and use (some even
have evaluations)– New science is being enabled and performed– There are active programs at the agency level– Active communities; meeting, publishing, developing,
implementing• Data access and transport is an active field• New attention to spatio-temporal standards and
vocabularies in the context of services• Substantial re-visiting of architectures due to the need to
accommodate explicit semantics (esp. in regard to sensors)
Tetherless World Constellation 44
Further Information
• http://tw.rpi.edu/• http://www.opendap.org and
http://docs.opendap.org • Lots of others (ask me)• Contact: