Aspects of Reproducibility in Earth Science

17
Aspects of Reproducibility in Earth Science – ongoing work Raul Palma Poznan Supercomputing and Networking Center, Poland Dagstuhl seminar: Reproducibility of Data-Oriented Experiments in e-Science January, 2016

Transcript of Aspects of Reproducibility in Earth Science

Page 1: Aspects of Reproducibility in Earth Science

Aspects of Reproducibility in Earth Science – ongoing workRaul PalmaPoznan Supercomputing and Networking Center, PolandDagstuhl seminar: Reproducibility of Data-Oriented Experiments in e-ScienceJanuary, 2016

Page 2: Aspects of Reproducibility in Earth Science

Context

• Project ID: 674907• Project Type: RIA• Start Date: 01.10.2015• Duration: 36 Months• Website: TBC

• Maximum Grant Amount: 6,649,002 €• Total funded effort in person/months: 663• Coordinator: European Space Agency• Contact Person: Mirko Albani (ESA)

Page 3: Aspects of Reproducibility in Earth Science

EVEREST Consortium

Page 4: Aspects of Reproducibility in Earth Science

Key objectivesEstablish a VRE e-infrastructure for Earth Science

addressing the needs of different ES communities to facilitate their collaborative working and research

Discover, access, assess and process existing and new heterogeneous ES datasets and preserved knowledge held by distributed data centres

Share data, models, algorithms, scientific results and their own experiences within a community or across communities

Capture, annotate and store the workflows, processes and results from their research activities;

Ensure the long-term sustainability and preservation of data, models, workflows, tools and services developed by existing communities

Validate the VRE with four main Virtual Research Communities Sea Monitoring VRC Natural Hazards VRC (floods, geological, weather, wildfires) Land Monitoring VRC Supersites VRC (volcanoes and seismic)

Page 5: Aspects of Reproducibility in Earth Science

Key objectivesDefine, implement and validate the Research Objects (RO) concepts and technologies within the ES context as the mean for sharing information and establish more effective collaboration in the VRE

Page 6: Aspects of Reproducibility in Earth Science

Reproducibility aspects

Page 7: Aspects of Reproducibility in Earth Science

Earth Science Research and Information Lifecycle (high level story)

Page 8: Aspects of Reproducibility in Earth Science

Experimental Science (to compare)

Experiment Results (data)

Scientific Interpretatio

n

BackgroundHypothesis

AssumptionsInput data

Method

PublicationResults(Data)

Contribution to Science Communicatecontribution to the community

Contribution to Research Community

Peer review: “Are these novel findings? Was the method sound?”

Reader:“I trust that this method is sound.”

Reuse (incremental)

Page 9: Aspects of Reproducibility in Earth Science

Supersite Science - ES VRC (more concrete story)Historical science mostly based on

past observations, as opposed to experimental science

Testing of hypothesis is not normally the main activity

Main activities of the VRC: measure geophysical parameters in the natural

environment, derive information on the effects of the phenomena and processes, model this information to generate space/time representations of

geophysical phenomena, provide these representations to risk management stakeholders, use the information to develop theories or confirm hypotheses

Page 10: Aspects of Reproducibility in Earth Science

Supersite VRC operational scenario

In situ data providers (normally local monitoring agencies) provide open access to their data collections (with a data policy), including raw and processed data

Space agencies acquire and distribute satellite EO data (personal licenses to sign)

Authorized scientists should be able to access and display the data online, process them using community tools, validate the results, model the validated data, generate research products and build consensus on scientific information for end-users

Authorized end-users (local) should be able to access the scientific information online and provide feedback

The general public should be able to browse part of the data, the published results, part of the scientific information provided to users (if the latter authorize disclosure)

With a Supersite agreement in place:

Page 11: Aspects of Reproducibility in Earth Science

Research Objects in Supersite VRCCurrent main use scenariosDocumentation/communicationReproducibility of scientific results

Page 12: Aspects of Reproducibility in Earth Science

Research Objects in Supersite VRC

Document best practices (WFs, analysis methods, monitoring methods, etc.)

Training purposesProvide long term preservation of scientific knowledge

(how data are analyzed, how results are validated, etc.)Provide long term preservation of end-user stories

(demonstrating scientist-end-user interactions)Public disseminationProvide good management of intellectual property,

through licensing and PID/DOI, to allow fast work recognitionOthers tbd

Documentation/communication

Page 13: Aspects of Reproducibility in Earth Science

Research Objects in Supersite VRC

Execute “standard” WFs for data analysis/modelling.

validating results generate “standard” products (e.g. deformation maps) as

mass products training

Testing algorithms and data, either modifying the WF to execute new analysis

methods/models on the same dataset, or executing the original WF on different Supersites

datasets

Others tbd

Reproducibility of scientific results

Page 14: Aspects of Reproducibility in Earth Science

Some issues in reproducibility The VRC is not (yet) using formalized WFs. Their use, and the use of

ROs, must be promoted through a simple, incremental approach. The data access may be tricky, since their formats and metadata could

depend on the Supersite. Some datasets (and most results) are not maintained by external sources and

should be stored in the VRE (and exported as web services to the outside).

WFs reproducibility can be a problem, since they could use a mix of COTS and scientific SW, with licensing, HW compatibility, and computational resources issues. They do not use web processing services at present.

WFs are rarely fully automated. Some may require considerable manual intervention. Some other use a trial and error procedure, during repeated execution one could

discard some data or choose different parameters. In general some internal WF decisions may be based on expert judgment and

should be documented.

Page 15: Aspects of Reproducibility in Earth Science

Research Object example

Page 16: Aspects of Reproducibility in Earth Science

RO example for the Supersite VRC

Ground deformation mapping is a typical use case for this VRC. It may be carried out by different researchers on different volcanoes

or even on the same volcano.

It normally consists of two consecutive WFs: the analysis of a multitemporal InSAR image dataset to calculate

ground displacement time series the validation of the results by comparison with other data or

results.

RO for Volcano deformation mapping

Page 17: Aspects of Reproducibility in Earth Science

RO example for the Supersite VRC

The main engine of the WF is the analysis SW (COTS): SarScape, which requires IDL. Other scientists may be more comfortable using other SW, or even using

remote processing services (as those provided by the GEP).

Input data are normally accessed through remote web services: ESA Virtual Archive, Sentinel Hub, DLR Supersite portal, ASI Data Gateway.

Validation data (GPS time series, previous deformation data, levelling data) are not always provided as a service.

Output results must be placed in the VRC database, and exported as web services. They are subsequently used by other scientists during a consensus

process to generate a final product for the End-users.

RO for Volcano deformation mapping