GeoDataspace: Simplifying Data Management Tasks with Globus
-
Upload
tanu-malik -
Category
Technology
-
view
383 -
download
0
Transcript of GeoDataspace: Simplifying Data Management Tasks with Globus
Simplifying Data Management Tasks with Globus
Tanu Malik, Ian Foster, Kyle Chard, Roselyne Tchoua, Joseph Baker, Mike Gurnis, Jonathan Goodall, ScoD Peckham
GeoDataspace
GeoDataspace
GeoDataspace
Share and Reproduce
Alice wants to share her models and simulation output with Bob, and Bob wants to re-execute Alice’s application to validate her inputs and outputs.
GeoDataspace
GeoDataspace
GeoDataspace
Alice’s Options
1. A tar and gzip
2. Build a website with model code, parameters, and data
3. Create a virtual machine
GeoDataspace
GeoDataspace
GeoDataspace
Bob’s Frustration
1. I do not find the lib.so required for building the model.
2. How do I?
GeoDataspace
GeoDataspace
GeoDataspace
Lack of easy and efficient methods for sharing and reproducibility
Amount of pain Bob suffers
Amount of pain Alice suffers
GeoDataspace
• Goal: Sharing and reproducibility hand-in-hand
• Target users: Computational geoscientists
• Data and model integration
• Research Output is More Than "Just" a Research Paper
GeoDataspace
GeoDataspace
GeoDataspace
GeoDataspaceCI Components
• The geounits
• Units of scientific activity/research output
• How to capture and track this activity
• Globus Catalog
• A scalable, flexible catalog for annotations conforming to open-world assumption
• Globus Publish and reproduce geounits
• Share/Publish geounits for others
• Replay geounits for analysis
GeoDataspace
GeoDataspace
GeoDataspace
geounits: package data , source code and
environment
GeoDataspace
GeoDataspace
GeoDataspace
geounit Client:Provenance is key
GeoDataspace
GeoDataspace
GeoDataspace
1. audit
<program name>
2. PROV
compliant
database
3. exec
<program name>
[activity]
geounit Client: Features
• Based on Code, Data, Environment (CDE’s) ptrace and okapi functionality
• Data/code can be local or distributed
• Data/code files are not manifested into the package until ready to share; only descriptions in package
• Specify granularity of auditing
• Partial replay
• Unpack into docker or vagrant
Globus Catalog: hosts geounits
• Dataset Management Model
• Catalog: a hosted resource that enables the grouping of related datasets
• Dataset: a virtual collection of (schemaless) metadata and distributed data elements viz files, provenance
• Annotation: a piece of metadata that exists within the context of a dataset or data member
GeoDataspace
GeoDataspace
GeoDataspace
Globus Catalog• Dataset Service
• Virtual views of data based on user-defined and/or automatically extracted metadata (annotations)
• Implemented as a service with web and REST interfaces
• Relies on Globus Nexus for user authentication and group management
• Client-side Tooling
• Dataset ingest
• Automatic creation of datasets and extraction of metadata from various common data formats and directory structures
• Globus endpoints
• Associate data (in files and directories) with one or more datasets
• Python Client library
• Integration with external services
• Transfer: Moving datasets from their storage endpoint(s) to a selected destination
• Faceted Browser Search
• Search based on provenance entities and activities
GeoDataspace
GeoDataspace
GeoDataspace
Globus Catalog:REST interface
GeoDataspace
GeoDataspace
GeoDataspace
Approach
• Hosted user-defined catalogs
• Based on annotation model <dataset/member, name, value>
• Association of data members
• Fine grained access control
• Flexible query language – Name:value, free text, facets,…
• Integrated with other services
/geodataspace/geodataspace/annotation
/geodataspace/geounit/geodataspace/geounit/annotation
/geodataspace/geounit/acl/geodataspace/geounit/members
/geodataspace/geounit/members/annotation/geodataspace/geounit/provenance
/geodataspace/geounit/version
Publish and Reexecute geounits
• Still in the works
• Each geounit can be published through Globus Publish and re-executed through analysis platform
GeoDataspace
GeoDataspace
GeoDataspace
Science DriversSolid Earth
Space Science
Hydrology
CSDMS
GeoDataspace
GeoDataspace
GeoDataspace
GeoDataspace
GeoDataspace
GeoDataspace
Solid Earth• Allow reproducible, replayable geounits of GPlates
• GPlates
• Software package has several dependencies
• Create geounits of Kinematic Representation of Surface of Earth (3D and 4D models)
• GPlates software,
• GPML files (XML for plate tectonics) used in the model,
• output GPML files are simple X/Y format or could be visualization files, a global set of visualization output, images as well.
• Integrating geounits in Python workflows
• Incorporate metadata from workflows and use geounit metadata to inform workflows
GeoDataspace
GeoDataspace
GeoDataspace
Hydrology• Data processing steps for the VIC model
geounit 1
geounit 2
geounit 3 geounit 4
Objective: Monitor changes in the data processing steps and compare them across the various runs
GeoDataspace
GeoDataspace
GeoDataspace
Space Science
• Create geounits of SuperDarn data and its plotting products
• Publish them for validation
GeoDataspace
GeoDataspace
GeoDataspace
CSDMS
• How geounits should be coupled
• Metadata alignment issues
• If we create geounits of CSDMS models, how do we enable suitable search interfaces with the provenance metadata and CSDMS metadata?
GeoDataspace
GeoDataspace
GeoDataspace
Current Work
• Working with use cases to bootstrap geounits
• Populating geounits based on Python workflows and incorporate geounits in workflows
• Interfacing geounit Client with Globus Catalog
• Improving distributed search functionality
GeoDataspace
GeoDataspace
GeoDataspace
Track it!
• http://workspace.earthcube.org/geodataspace
• Software, Source code, Science Usecases, Reports, Presentations, News
GeoDataspace
GeoDataspace
GeoDataspace
Acknowledgements
• National Science Foundation
• EarthCube Community
• Globus team
• CI team
GeoDataspace
GeoDataspace
GeoDataspace