YesWorkflow: Yes, Scripts can be Workflows, Too!

18
YesWorkflow: Yes! Scripts can be Workflows, too! Bertram Ludäscher Graduate School of Library and Informa6on Science (GSLIS) Na6onal Center for Supercompu6ng Applica6ons (NCSA) Department of Computer Science (CS@UIUC) 24. – 29. Januar 2016, Dagstuhl Seminar 16041 Reproducibility of DataOriented Experiments in eScience

Transcript of YesWorkflow: Yes, Scripts can be Workflows, Too!

Page 1: YesWorkflow: Yes, Scripts can be Workflows, Too!

YesWorkflow:  Yes!  Scripts  can  be  Workflows,  too!  

Bertram  Ludäscher      

Graduate  School  of  Library  and  Informa6on  Science  (GSLIS)  Na6onal  Center  for  Supercompu6ng  Applica6ons  (NCSA)  

Department  of  Computer  Science  (CS@UIUC)  

24.  –  29.  Januar  2016,  Dagstuhl  Seminar  16041  Reproducibility  of  Data-­‐Oriented  Experiments  in  e-­‐Science  

Page 2: YesWorkflow: Yes, Scripts can be Workflows, Too!

YesWorkflow  ~  noWorkflow  •  “not  only  Workflow”  

–  AutomaPcally  capture  runPme/retrospec(ve  provenance  of  (Python)  scripts  

•  YesWorkflow  (“YW  1.0”)  –  Yes,  (some)  scripts  are  workflows,  too!  –  Expose  prospec(ve  provenance  (=  

workflow)  hidden  in  the  script  via  simple  user  annotaPon  

•  Combining  NW  +  YW  =>  more  provenance  mileage  (TaPP’15)  -­‐  Shared  vision:    –  focus  on  provenance  from/for  scripts    

•  “YW  2.0”  –  Focus  on  queries  and  views  that  combine  

many  sources  of  provenance  (recorded,  logged:  prov  engineer;  reconstructed  (prov  sleuth);  ...)    

=>  follow-­‐up  study  for  NW  +  YW  !    

B.  Ludäscher                                                                                Dagstuhl  Seminar  #16041:  Reproducibility  of  Data-­‐Oriented  Experiments  in  e-­‐Science   2  

Page 3: YesWorkflow: Yes, Scripts can be Workflows, Too!

Reproducibility:  (yesterday’s  discussion  cont’d)    -­‐  What  ques(ons  should  we  ask?    -­‐  What  queries  should  we  enable?  

•  Cui  bono?  (others,  publishers,  …  ?)  •  Provenance-­‐for-­‐Self                vs  Provenance-­‐for-­‐others  •  Reproducibility-­‐for-­‐Self    vs  Reproducibility-­‐for-­‐others  

•  For  key  terms,  e.g.,  Carole’s    –  …  rerun,  repeat,  replicate,  reproduce,  reuse,  …    

•  ..  ask  “what  informa(on/insight  do  I  gain  from  reproducing,  repea(ng,  replica(ng…  ?”    

•  What  is  fixed  and  what  does  the  study  vary?    =>  Research  ObjecPve,  Method/Algorithm,  ImplementaPon,  Plagorm/Environment,  Actors/People,  input  Data  (params,  raw  data)  

B.  Ludäscher                                                                                Dagstuhl  Seminar  #16041:  Reproducibility  of  Data-­‐Oriented  Experiments  in  e-­‐Science   3  

Page 4: YesWorkflow: Yes, Scripts can be Workflows, Too!

Proposal:  OMIPAD  =>  PRIMAD  ?  (yesterday’s  discussion  cont’d)  

•  OMIPAD    –   “Oh  my  Pad  …  ”  (or:  grandma’s  pad  in  German)  

•  PRIMAD  (Pronounced:  “primed”)  – ObjecPve  à  Research  ObjecPve  – Allows  to  say  things  like:  

“What  have  you  primed?”    (What  have  you  changed?  What  have  you  kept  the  same?  Which  X  have  you  changed  to  X’  ?)  

–  In  Spanish:  “primar”  •  primar [primando|primado] {vb} (also: imperar,

predominar, prevalecer, imponerse) •  to prevail [prevailed|prevailed] {vb}    B.  Ludäscher                                                                                Dagstuhl  Seminar  #16041:  Reproducibility  of  Data-­‐Oriented  Experiments  in  e-­‐Science   4  

Page 5: YesWorkflow: Yes, Scripts can be Workflows, Too!

Science  Example:  Paleoclimate  ReconstrucJon  

B.  Ludäscher                                                                                Dagstuhl  Seminar  #16041:  Reproducibility  of  Data-­‐Oriented  Experiments  in  e-­‐Science   5  

•  Kohler  &  Bocinsky:  study  rain-­‐fed  maize  of  Ancestral  Pueblo,  Anasazi    –  Four  Corners;  AD  600–1500  –  Climate  change  influenced  Mesa  Verde  MigraJons;  late  13th  century  AD.  –  Uses  network  of  tree-­‐ring  chronologies  to  reconstruct  a  spaJo-­‐temporal  

climate  field  at  a  fairly  high  resoluPon  (~800  m)  from  AD  1–2000    –  Algorithm  esPmates  joint  informaPon  in  tree-­‐rings  and  a  climate  signal  to  

idenPfy  “best”    tree-­‐ring  chronologies  for  reconstrucPng  climate  at  a  given  Pme  and  place.  

K.  Bocinsky,  T.  Kohler,  A  2000-­‐year  reconstrucPon  of  the  rain-­‐fed  maize  agricultural  niche  in  the  US  Southwest.  Nature  

Communica(ons.  doi:10.1038/ncomms6618    

… implemented as an R Script …

Page 6: YesWorkflow: Yes, Scripts can be Workflows, Too!

B.  Ludäscher                                                                                Dagstuhl  Seminar  #16041:  Reproducibility  of  Data-­‐Oriented  Experiments  in  e-­‐Science   6  

K.  Bocinsky,  T.  Kohler,  A  2000-­‐year  reconstrucPon  of  the  rain-­‐fed  maize  agricultural  niche  in  the  US  Southwest.  Nature  Communica(ons.  doi:10.1038/ncomms6618    

…  Paleoclimate    ReconstrucJon  …    

Map  showing  the  "selected"  trees  for  reconstruc6ng  precipita6on  at  four  sites  in  our  CAR  regression  approach  (Correla6on-­‐Adjusted  corRela6on).  

Reconstruc6ons  for  AD  1247    

Page 7: YesWorkflow: Yes, Scripts can be Workflows, Too!

YesWorkflow  =  Scripts  +  Comments  

•  Scripts  can  be  hard  to  digest,  communicate  •  Idea:    – Add  structured  comments  (cf.  JavaDoc)  =>    reveal  workflow  structure  and  dataflow  

 =>    obtain  some  scienPfic  workflow  benefits  •   …  ASAP  …    

B.  Ludäscher                                                                                Dagstuhl  Seminar  #16041:  Reproducibility  of  Data-­‐Oriented  Experiments  in  e-­‐Science   7  

Page 8: YesWorkflow: Yes, Scripts can be Workflows, Too!

User  Comments:  YW  @AnnotaJons  

B.  Ludäscher                                                                                Dagstuhl  Seminar  #16041:  Reproducibility  of  Data-­‐Oriented  Experiments  in  e-­‐Science   8  

@begin GO_Analysis @in hgCutoff @in … @out BP_Summl_file @out … @end GO_Analysis

...  

Page 9: YesWorkflow: Yes, Scripts can be Workflows, Too!

Get  3  views  for  the  price  of  1!  

B.  Ludäscher                                                                                Dagstuhl  Seminar  #16041:  Reproducibility  of  Data-­‐Oriented  Experiments  in  e-­‐Science   9  

Process  view  

Data  view  

Combined  view  

Page 10: YesWorkflow: Yes, Scripts can be Workflows, Too!

Paleoclimate  ReconstrucJon  …      

B.  Ludäscher                                                                                Dagstuhl  Seminar  #16041:  Reproducibility  of  Data-­‐Oriented  Experiments  in  e-­‐Science   10  

GetModernClimate

PRISM_annual_growing_season_precipitation

SubsetAllData

dendro_series_for_calibration

dendro_series_for_reconstruction CAR_Analysis_unique

cellwise_unique_selected_linear_models

CAR_Analysis_union

cellwise_union_selected_linear_models

CAR_Reconstruction_union

raster_brick_spatial_reconstruction raster_brick_spatial_reconstruction_errors

CAR_Reconstruction_union_output

ZuniCibola_PRISM_grow_prcp_ols_loocv_union_recons.tif ZuniCibola_PRISM_grow_prcp_ols_loocv_union_errors.tif

master_data_directory prism_directory

tree_ring_datacalibration_years retrodiction_years•  …  explained  using  YesWorkflow  

Kyle  B.,  (computaPonal)  archeologist:    "It  took  me  about  20  minutes  to  comment.  Less  than  an  hour  to  learn  and  YW-­‐annotate,  all-­‐told."  

Page 11: YesWorkflow: Yes, Scripts can be Workflows, Too!

User  Comments:  YW  @AnnotaJons  

B.  Ludäscher                                                                                Dagstuhl  Seminar  #16041:  Reproducibility  of  Data-­‐Oriented  Experiments  in  e-­‐Science   11  

@begin GO_Analysis @in hgCutoff @in … @out BP_Summl_file @out … @end GO_Analysis

...  

Add  OMIPAD/PRIMAD  declara6ons!?  Develop/propose    a  “standard”  to  say:    What  have  you  primed?    @P  lagorm  @R  esearchObjecPve  @I  mplementaPon  @A  ctor  @M  ethod    @D  ataInputs  

Now  that  we  all  love  Janiform  PDbF,  how  about  adding  some  METADATA  right  there  into  the  PDbF  as  well?    

Page 12: YesWorkflow: Yes, Scripts can be Workflows, Too!

YW  Demo!  

B.  Ludäscher                                                                                Dagstuhl  Seminar  #16041:  Reproducibility  of  Data-­‐Oriented  Experiments  in  e-­‐Science   12  

Page 13: YesWorkflow: Yes, Scripts can be Workflows, Too!

YesWorkflow  Architecture:  KISS!  •  YW-­‐Extract  – …  structured  comments    

•  YW-­‐Model  – Program  Block,  Workflow  – Port  (data,  parameters)  – Channels  (dataflow)  

 •  YW-­‐Graph  –  ...  using  GraphViz/DOT  files  

•  YW-­‐Query,  YW-­‐Validate,  YW-­‐CLI    B.  Ludäscher                                                                                Dagstuhl  Seminar  #16041:  Reproducibility  of  Data-­‐Oriented  Experiments  in  e-­‐Science   13  

hsp://YesWorkflow.org    

Page 14: YesWorkflow: Yes, Scripts can be Workflows, Too!

Figure 4: Process workflow view of an A↵ymetrix analysis script (in R).

4 YesWorkflow Examples

In the following we show YesWorkflow views extracted from real-world scientific use cases.The scripts were annoted with YW tags by scientists and script authors, using a verymodest training and mark-up e↵ort.1 Due to lack of space, the actual MATLAB and R

scripts with their YW markup are not included here. However, they are all availablefrom the yw-idcc-15 repository on the YW GitHub site [Yes15].

4.1 Analysis of Gene Expression Microarray Data

Bioinformatics workflows commonly possess a pattern of large numbers of incoming pa-rameters and outputs at each stage of computation. In addition, analysis of even asingle bioinformatics dataset tends to yield a large number of di↵erent output files.Hence, bioinformatics pipelines are attractive candidates for workflow systems, whichcan capture this complexity [Bie12]. Figure 4 shows a YesWorkflow representation ofan R script performing a classic, complex bioinformatics task: analysis of A↵ymetrixgene expression microarray data. This R script was modeled on our previous work-flows developed in the Kepler environment [SMLB12]. The script analyzes experimentdesigns consisting of two conditions (e.g., microarrays from control-treated cells vs mi-croarrays from drug-treated cells) with multiple replicates in each condition. The R

script employs a set of standard BioConductor [GCB+04] packages mixed with customprogramming. The workflow consists of four fundamental tasks: normalization of dataacross microarray datasets (Normalize), selection of di↵erentially expressed genes (DEGs)between conditions (SelectDEGs), determination of gene ontology (GO) statistics for theresulting datasets (GO Analysis), and creation of a heatmap of the di↵erentially ex-pressed genes (MakeHeatmap). Each module produces outputs, and each module (asidefrom MakeHeatmap) requires external parameter inputs. Importantly, this graphical rep-resentation clearly indicates the dependence of each module on datasets and parameterinputs. This example demonstrates that YesWorkflow can provide informative visualiza-tions of bioinformatics workflows, especially workflows involving large numbers of inputsand outputs.

1For all of these scripts, learning the YW model and annotating the scripts was done in a few hours.

6

Gene  Expression  Microarray  Data  Analysis  

•  [Normalize]    –  NormalizaPon  of  data  across  microarray  datasets    

•  [SelectDEGs]    –  SelecPon  of  differenPally  expressed  genes  between  condiPons    

•  [GO  Analysis]    –  determinaPon  of  gene  ontology  staPsPcs  for  the  resulPng  datasets    

•  [MakeHeatmap]    –  creaPon  of  a  heatmap  of  the  differenPally  expressed  genes.    

B.  Ludäscher                                                                                Dagstuhl  Seminar  #16041:  Reproducibility  of  Data-­‐Oriented  Experiments  in  e-­‐Science   14  

Tyler  Kolisnik,  Mark  Bieda  

Page 15: YesWorkflow: Yes, Scripts can be Workflows, Too!

MulJ-­‐Scale  Synthesis  and  Terrestrial  Model  Intercomparison  Project  (MsTMIP)  

B.  Ludäscher                                                                                Dagstuhl  Seminar  #16041:  Reproducibility  of  Data-­‐Oriented  Experiments  in  e-­‐Science   15  

fetch_drought_variable

drought_variable_1

fetch_effect_variable

effect_variable_1

convert_effect_variable_units

effect_variable_2

create_land_water_mask

land_water_mask

init_data_variables

predrought_effect_variable_1 drought_value_variable_1 recovery_time_variable_1 drought_number_variable_1

define_droughts

sigma_dv_event month_dv_length

detrend_deseasonalize_effect_variable

effect_variable_3

calculate_data_variables

recovery_time_variable_2 drought_value_variable_2 predrought_effect_variable_2 drought_number_variable_2

export_recovery_time_figure

output_recovery_time_figure

export_drought_value_variable_figure

output_drought_value_variable_figure

export_predrought_effect_variable_figure

output_predrought_effect_variable_figure

export_drought_number_variable_figure

output_drought_number_figure

input_drough_variable

input_effect_variable

Christopher  Schwalm,  Yaxing  Wei  

Page 16: YesWorkflow: Yes, Scripts can be Workflows, Too!

Summary:  ScienJfic  Workflows  ScienJfic  Workflows  •  [+]  AutomaPon  •  [+]  Scalability  •  [+]  AbstracPon  •  [+]  Provenance  •  …  •  [+/0]  Easy  to  use    

–  [0]  learning  a  new  paradigm  •  [-­‐]  Teaching  resources    •  [-­‐]  Special  experPse  needed  for  deep  changes  

 e.g.  new  Java  actors,  shims,  …  

B.  Ludäscher                                                                                Dagstuhl  Seminar  #16041:  Reproducibility  of  Data-­‐Oriented  Experiments  in  e-­‐Science   16  

Page 17: YesWorkflow: Yes, Scripts can be Workflows, Too!

Scripts  +  @YW  à  Workflows  Scripts:  [+]  AutomaPon,  [0]  Scalability,    [-­‐]  AbstracPon,  [0/-­‐]  Provenance  

Now:  Scripts  +  YesWorkflow  AnnotaJons  •  [+]  AbstracJon  

–  explain  your  methods  to  mere  mortals  =>  encourage  (re-­‐)use  

•  [+]  Provenance:  –  noWorkflow  (retrospec(ve  provenance)  –  YesWorkflow  (prospec(ve  provenance)  

•  [+]  Language  independent  (R,  Matlab,  Python,  …)    •  [+]  Empower  tool  makers  (script  programmers):  give  them  …  

–  …  some  immediate  benefits  (workflow  views)  –  …  some  medium  term  improvements  (provenance  integraJon)  –  …  some  long  term  benefits:  think  about  your  methods  differently    =>  dataflow  programming  =>  [+]  Scalability    

B.  Ludäscher                                                                                Dagstuhl  Seminar  #16041:  Reproducibility  of  Data-­‐Oriented  Experiments  in  e-­‐Science   17  

Page 18: YesWorkflow: Yes, Scripts can be Workflows, Too!

References  

•  hsp://yesworkflow.org      •  T.  McPhillips,  S.  Bowers,  K.  Belhajjame,  B.  Ludäscher  (2015).  

RetrospecJve  Provenance  Without  a  RunJme  Provenance  Recorder.  7th  USENIX  Workshop  on  the  Theory  and  Prac6ce  of  Provenance  (TaPP'15).    

 •  T.  McPhillips,  T.  Song,  T.  Kolisnik,  S.  Aulenbach,  K.  Belhajjame,  R.K.  

Bocinsky,  Y.  Cao,  J.  Cheney,  F.  ChirigaP,  S.  Dey,  J.  Freire,  C.  Jones,  J.  Hanken,  K.W.  KinPgh,  T.A.  Kohler,  D.  Koop,  J.A.  Macklin,  P.  Missier,  M.  Schildhauer,  C.  Schwalm,  Y.  Wei,  M.  Bieda,  B.  Ludäscher  (2015).  YesWorkflow:  A  User-­‐Oriented,  Language-­‐Independent  Tool  for  Recovering  Workflow  InformaJon  from  Scripts.  Interna6onal  Journal  of  Digital  Cura6on  10,  298-­‐313.  

B.  Ludäscher                                                                                Dagstuhl  Seminar  #16041:  Reproducibility  of  Data-­‐Oriented  Experiments  in  e-­‐Science   18