Reproducibility of Execution Environments in .Reproducibility of Execution Environments in...

download Reproducibility of Execution Environments in .Reproducibility of Execution Environments in Computational

of 15

  • date post

    07-Sep-2018
  • Category

    Documents

  • view

    215
  • download

    0

Embed Size (px)

Transcript of Reproducibility of Execution Environments in .Reproducibility of Execution Environments in...

  • Reproducibility of Execution Environments in Computational ScienceUsing Semantics and Clouds

    Idafen Santana-Pereza,, Rafael Ferreira da Silvab, Mats Ryngeb, Ewa Deelmanb, Mara S. Perez-Hernandeza, Oscar Corchoa

    aOntology Engineering Group, Universidad Politecnica de Madrid, Madrid, SpainbUniversity of Southern California, Information Sciences Institute, Marina del Rey, CA, USA

    Abstract

    In the past decades, one of the most common forms of addressing reproducibility in scientific workflow-based computational sciencehas consisted of tracking the provenance of the produced and published results. Such provenance allows inspecting intermediateand final results, improves understanding, and permits replaying a workflow execution. Nevertheless, this approach does notprovide any means for capturing and sharing the very valuable knowledge about the experimental equipment of a computationalexperiment, i.e., the execution environment in which the experiments are conducted. In this work, we propose a novel approachbased on semantic vocabularies that describes the execution environment of scientific workflows, so as to conserve it. We definea process for documenting the workflow application and its related management system, as well as their dependencies. Then weapply this approach over three different real workflow applications running in three distinct scenarios, using public, private, andlocal Cloud platforms. In particular, we study one astronomy workflow and two life science workflows for genomic informationanalysis. Experimental results show that our approach can reproduce an equivalent execution environment of a predefined virtualmachine image on all evaluated computing platforms.

    Keywords: Semantic Metadata, Scientific Workflow, Reproducibility, Life Sciences.

    1. Introduction

    Reproducibility of results of scientific experiments is a cor-nerstone of the scientific method. Therefore, the scientific com-munity has been encouraging researchers to publish their con-tributions in a verifiable and understandable way [1, 2]. In com-putational science, or in-silico science, reproducibility often re-quires that researchers make code and data publicly availableso that the data can be analyzed in a similar manner as in theoriginal work described in the publication. Code must be madeavailable, and data must be accessible in a readable format [3].

    Scientific workflows are a useful representation for managingthe execution of large-scale computations. Many scientists nowformulate their computational problems as scientific workflowsrunning on distributed computing infrastructures such as cam-pus Clusters, Clouds, and Grids [4]. Researchers in bioinfor-matics have embraced workflows for a whole range of analyses,including protein folding [5], DNA and RNA sequencing [6, 7],and disease-related research [8, 9], among others. The work-flow representation not only facilitates the creation and man-agement of the computation but also builds a foundation uponwhich results can be validated and shared.

    Corresponding address: Ontology Engineering Group (OEG), Laboratoriode Inteligencia Artificial, Facultad de Informatica, Universidad Politecnica deMadrid, Avda. Monteprncipe, s/n, Boadilla del Monte, 28660, Spain

    Email addresses: isantana@fi.upm.es (Idafen Santana-Perez),rafsilva@isi.edu (Rafael Ferreira da Silva), rynge@isi.edu (MatsRynge), deelman@isi.edu (Ewa Deelman), mperez@fi.upm.es(Mara S. Perez-Hernandez), ocorcho@fi.upm.es (Oscar Corcho)

    Much research has focused on studying the reproducibility ofscientific results in life sciences. Some studies clearly show thedifficulties when trying to replicate experimental results in biol-ogy [10]. The Reproducibility Project: Cancer Biology [11] isan active project aiming to independently reproduce the experi-mental results of 50 high-impact cancer biology studies, evalu-ating the degree of reproducibility of those results and the mainissues related to them.

    Since workflows formally describe the sequence of computa-tional and data management tasks, it is easy to trace the originof the data produced. Many workflow systems capture prove-nance at runtime, what provides the lineage of data productsand as such underpins scientific understanding and data reuseby providing the basis on which trust and understanding arebuilt. A scientist would be able to look at the workflow andprovenance data, retrace the steps, and arrive at the same dataproducts. However, this information is not sufficient for achiev-ing full reproducibility.

    Reproducibility, replicability, and repeatability are often usedas synonyms. Even when they pursue similar goals, there areseveral differences between them [12]. In this work we con-sider them as separated concepts. While replicability can be de-fined as a strict recreation of the original experiment, using thesame method, data and equipment, reproducibility implies thatat least some changes have been introduced in the experiment,thus exposing different features. While being a less restrictiveterm, reproducibility is a key concept in science, as it allowsincremental research by modifying, improving and repurposingthe experimental methods and conditions.

    Preprint submitted to Future Generation Computer Systems January 21, 2016

  • In this work, we address the reproducibility of the execu-tion environment for a scientific workflow, as we do not aimto obtain an exact incarnation of the original setup, but ratheran environment that is able to support the required capabilitiesexposed by the former environment. In order to reproduce orreplicate any digital artifact we need to properly handle its con-servation. According to [13], to achieve conservation one needsto guarantee that sufficient information exists with which to un-derstand, evaluate, and build upon a prior work if a third partycould replicate the results without any additional informationfrom the author. Hence, we address workflow conservation inorder to attain its reproducibility.

    In [14], authors explain the problems they faced when theytried to reproduce an experiment [15] for mapping all putativeFDA and European drugs to protein receptors within the scopeof a given proteome. For each identified problem, they enumer-ate a set of suggestions for addressing the related issues. In fourout of the total six cases, execution environment problems arementioned.

    Currently, most of the approaches in computational scienceconservation, in particular for scientific workflow executions,have been focused on data, code, and the workflow description,but not on the underlying infrastructurewhich is composedof a set of computational resources (e.g., execution nodes, stor-age devices, and networking) and software components. Weidentify two approaches for conserving the environment of anexperiment: 1) physical conservation, where the real object isconserved due to its relevance and the difficulty in obtaining acounterpart; and 2) logical conservation, where objects are de-scribed in such a way that an equivalent one can be obtained ina future experiment.

    The computational environment is often conserved by usingthe physical approach, where computational resources are madeavailable to scientists over a sustained period of time. As a re-sult, scientists are able to reproduce their experiments in thesame environment. However, such infrastructures demand hugemaintenance efforts, and there is no guarantee that it will notchange or suffer from a natural decay process [16]. Further-more, the infrastructure may be subjected to organization poli-cies, which restrict its access to a selective group of scientists,thereby limiting reproducibility to this restricted group. On theother hand, data, code, and workflow descriptions can be con-served by using a logical approach, which is not subjected tonatural decay processes.

    Accordingly, we propose a logical-oriented approach to con-serve computational environments, where the capabilities of theresources (virtual machines (VM)) are described. From thisdescription, any scientist, interested in reproducing an exper-iment, will be able to reconstruct the former infrastructure (oran equivalent one) in any Cloud computing infrastructure (ei-ther private or public). One may argue that it would be easierto keep and share VM images (or containers) with the commu-nity research through a common repository, however the highstorage demand of VM images (in particular for Big Data ap-plications) may be a challenging problem [17, 18].

    Inspired by the aforementioned ideas, exposed in [13], weaim to define means for authors to share the relevant informa-

    tion about the execution environment of a given scientific work-flow. We argue that by explicitly describing this knowledge weincrease the degree of reproducibility of the environment and ofthe workflow.

    Semantics have been proposed as a way of attaining cu-ration and conservation of the digital assets related to scien-tific experiments (e.g., biomedical research [19]). Our ap-proach uses semantic-annotated workflow descriptions to gen-erate lightweight scripts for an experiment management APIthat can reconstruct the required infrastructure. We propose todescribe the resources involved in the execution of the exper-iment using a set of semantic vocabularies, and use those de-scriptions to define the infrastructure specification. This spec-ification can then be used to derive the set of instructions thatcan be executed to obtain a new equivalent infrastructure.

    We conduct a practical experimentation process in which wedescribe a set of workflows and their environments using a setof semantic models. Then, we use an experiment managementtool to reproduce a workflow execution in different Cloud plat-forms. In this work, we study three real and representative pro-duction workflows, which have been disseminated in severalscientifi