YesWorkflow: Yes, Scripts can be Workflows, Too!
-
Upload
bertram-ludaescher -
Category
Technology
-
view
292 -
download
0
Transcript of YesWorkflow: Yes, Scripts can be Workflows, Too!
YesWorkflow: Yes! Scripts can be Workflows, too!
Bertram Ludäscher
Graduate School of Library and Informa6on Science (GSLIS) Na6onal Center for Supercompu6ng Applica6ons (NCSA)
Department of Computer Science (CS@UIUC)
24. – 29. Januar 2016, Dagstuhl Seminar 16041 Reproducibility of Data-‐Oriented Experiments in e-‐Science
YesWorkflow ~ noWorkflow • “not only Workflow”
– AutomaPcally capture runPme/retrospec(ve provenance of (Python) scripts
• YesWorkflow (“YW 1.0”) – Yes, (some) scripts are workflows, too! – Expose prospec(ve provenance (=
workflow) hidden in the script via simple user annotaPon
• Combining NW + YW => more provenance mileage (TaPP’15) -‐ Shared vision: – focus on provenance from/for scripts
• “YW 2.0” – Focus on queries and views that combine
many sources of provenance (recorded, logged: prov engineer; reconstructed (prov sleuth); ...)
=> follow-‐up study for NW + YW !
B. Ludäscher Dagstuhl Seminar #16041: Reproducibility of Data-‐Oriented Experiments in e-‐Science 2
Reproducibility: (yesterday’s discussion cont’d) -‐ What ques(ons should we ask? -‐ What queries should we enable?
• Cui bono? (others, publishers, … ?) • Provenance-‐for-‐Self vs Provenance-‐for-‐others • Reproducibility-‐for-‐Self vs Reproducibility-‐for-‐others
• For key terms, e.g., Carole’s – … rerun, repeat, replicate, reproduce, reuse, …
• .. ask “what informa(on/insight do I gain from reproducing, repea(ng, replica(ng… ?”
• What is fixed and what does the study vary? => Research ObjecPve, Method/Algorithm, ImplementaPon, Plagorm/Environment, Actors/People, input Data (params, raw data)
B. Ludäscher Dagstuhl Seminar #16041: Reproducibility of Data-‐Oriented Experiments in e-‐Science 3
Proposal: OMIPAD => PRIMAD ? (yesterday’s discussion cont’d)
• OMIPAD – “Oh my Pad … ” (or: grandma’s pad in German)
• PRIMAD (Pronounced: “primed”) – ObjecPve à Research ObjecPve – Allows to say things like:
“What have you primed?” (What have you changed? What have you kept the same? Which X have you changed to X’ ?)
– In Spanish: “primar” • primar [primando|primado] {vb} (also: imperar,
predominar, prevalecer, imponerse) • to prevail [prevailed|prevailed] {vb} B. Ludäscher Dagstuhl Seminar #16041: Reproducibility of Data-‐Oriented Experiments in e-‐Science 4
Science Example: Paleoclimate ReconstrucJon
B. Ludäscher Dagstuhl Seminar #16041: Reproducibility of Data-‐Oriented Experiments in e-‐Science 5
• Kohler & Bocinsky: study rain-‐fed maize of Ancestral Pueblo, Anasazi – Four Corners; AD 600–1500 – Climate change influenced Mesa Verde MigraJons; late 13th century AD. – Uses network of tree-‐ring chronologies to reconstruct a spaJo-‐temporal
climate field at a fairly high resoluPon (~800 m) from AD 1–2000 – Algorithm esPmates joint informaPon in tree-‐rings and a climate signal to
idenPfy “best” tree-‐ring chronologies for reconstrucPng climate at a given Pme and place.
K. Bocinsky, T. Kohler, A 2000-‐year reconstrucPon of the rain-‐fed maize agricultural niche in the US Southwest. Nature
Communica(ons. doi:10.1038/ncomms6618
… implemented as an R Script …
B. Ludäscher Dagstuhl Seminar #16041: Reproducibility of Data-‐Oriented Experiments in e-‐Science 6
K. Bocinsky, T. Kohler, A 2000-‐year reconstrucPon of the rain-‐fed maize agricultural niche in the US Southwest. Nature Communica(ons. doi:10.1038/ncomms6618
… Paleoclimate ReconstrucJon …
Map showing the "selected" trees for reconstruc6ng precipita6on at four sites in our CAR regression approach (Correla6on-‐Adjusted corRela6on).
Reconstruc6ons for AD 1247
YesWorkflow = Scripts + Comments
• Scripts can be hard to digest, communicate • Idea: – Add structured comments (cf. JavaDoc) => reveal workflow structure and dataflow
=> obtain some scienPfic workflow benefits • … ASAP …
B. Ludäscher Dagstuhl Seminar #16041: Reproducibility of Data-‐Oriented Experiments in e-‐Science 7
User Comments: YW @AnnotaJons
B. Ludäscher Dagstuhl Seminar #16041: Reproducibility of Data-‐Oriented Experiments in e-‐Science 8
@begin GO_Analysis @in hgCutoff @in … @out BP_Summl_file @out … @end GO_Analysis
...
Get 3 views for the price of 1!
B. Ludäscher Dagstuhl Seminar #16041: Reproducibility of Data-‐Oriented Experiments in e-‐Science 9
Process view
Data view
Combined view
Paleoclimate ReconstrucJon …
B. Ludäscher Dagstuhl Seminar #16041: Reproducibility of Data-‐Oriented Experiments in e-‐Science 10
GetModernClimate
PRISM_annual_growing_season_precipitation
SubsetAllData
dendro_series_for_calibration
dendro_series_for_reconstruction CAR_Analysis_unique
cellwise_unique_selected_linear_models
CAR_Analysis_union
cellwise_union_selected_linear_models
CAR_Reconstruction_union
raster_brick_spatial_reconstruction raster_brick_spatial_reconstruction_errors
CAR_Reconstruction_union_output
ZuniCibola_PRISM_grow_prcp_ols_loocv_union_recons.tif ZuniCibola_PRISM_grow_prcp_ols_loocv_union_errors.tif
master_data_directory prism_directory
tree_ring_datacalibration_years retrodiction_years• … explained using YesWorkflow
Kyle B., (computaPonal) archeologist: "It took me about 20 minutes to comment. Less than an hour to learn and YW-‐annotate, all-‐told."
User Comments: YW @AnnotaJons
B. Ludäscher Dagstuhl Seminar #16041: Reproducibility of Data-‐Oriented Experiments in e-‐Science 11
@begin GO_Analysis @in hgCutoff @in … @out BP_Summl_file @out … @end GO_Analysis
...
Add OMIPAD/PRIMAD declara6ons!? Develop/propose a “standard” to say: What have you primed? @P lagorm @R esearchObjecPve @I mplementaPon @A ctor @M ethod @D ataInputs
Now that we all love Janiform PDbF, how about adding some METADATA right there into the PDbF as well?
YW Demo!
B. Ludäscher Dagstuhl Seminar #16041: Reproducibility of Data-‐Oriented Experiments in e-‐Science 12
YesWorkflow Architecture: KISS! • YW-‐Extract – … structured comments
• YW-‐Model – Program Block, Workflow – Port (data, parameters) – Channels (dataflow)
• YW-‐Graph – ... using GraphViz/DOT files
• YW-‐Query, YW-‐Validate, YW-‐CLI B. Ludäscher Dagstuhl Seminar #16041: Reproducibility of Data-‐Oriented Experiments in e-‐Science 13
hsp://YesWorkflow.org
Figure 4: Process workflow view of an A↵ymetrix analysis script (in R).
4 YesWorkflow Examples
In the following we show YesWorkflow views extracted from real-world scientific use cases.The scripts were annoted with YW tags by scientists and script authors, using a verymodest training and mark-up e↵ort.1 Due to lack of space, the actual MATLAB and R
scripts with their YW markup are not included here. However, they are all availablefrom the yw-idcc-15 repository on the YW GitHub site [Yes15].
4.1 Analysis of Gene Expression Microarray Data
Bioinformatics workflows commonly possess a pattern of large numbers of incoming pa-rameters and outputs at each stage of computation. In addition, analysis of even asingle bioinformatics dataset tends to yield a large number of di↵erent output files.Hence, bioinformatics pipelines are attractive candidates for workflow systems, whichcan capture this complexity [Bie12]. Figure 4 shows a YesWorkflow representation ofan R script performing a classic, complex bioinformatics task: analysis of A↵ymetrixgene expression microarray data. This R script was modeled on our previous work-flows developed in the Kepler environment [SMLB12]. The script analyzes experimentdesigns consisting of two conditions (e.g., microarrays from control-treated cells vs mi-croarrays from drug-treated cells) with multiple replicates in each condition. The R
script employs a set of standard BioConductor [GCB+04] packages mixed with customprogramming. The workflow consists of four fundamental tasks: normalization of dataacross microarray datasets (Normalize), selection of di↵erentially expressed genes (DEGs)between conditions (SelectDEGs), determination of gene ontology (GO) statistics for theresulting datasets (GO Analysis), and creation of a heatmap of the di↵erentially ex-pressed genes (MakeHeatmap). Each module produces outputs, and each module (asidefrom MakeHeatmap) requires external parameter inputs. Importantly, this graphical rep-resentation clearly indicates the dependence of each module on datasets and parameterinputs. This example demonstrates that YesWorkflow can provide informative visualiza-tions of bioinformatics workflows, especially workflows involving large numbers of inputsand outputs.
1For all of these scripts, learning the YW model and annotating the scripts was done in a few hours.
6
Gene Expression Microarray Data Analysis
• [Normalize] – NormalizaPon of data across microarray datasets
• [SelectDEGs] – SelecPon of differenPally expressed genes between condiPons
• [GO Analysis] – determinaPon of gene ontology staPsPcs for the resulPng datasets
• [MakeHeatmap] – creaPon of a heatmap of the differenPally expressed genes.
B. Ludäscher Dagstuhl Seminar #16041: Reproducibility of Data-‐Oriented Experiments in e-‐Science 14
Tyler Kolisnik, Mark Bieda
MulJ-‐Scale Synthesis and Terrestrial Model Intercomparison Project (MsTMIP)
B. Ludäscher Dagstuhl Seminar #16041: Reproducibility of Data-‐Oriented Experiments in e-‐Science 15
fetch_drought_variable
drought_variable_1
fetch_effect_variable
effect_variable_1
convert_effect_variable_units
effect_variable_2
create_land_water_mask
land_water_mask
init_data_variables
predrought_effect_variable_1 drought_value_variable_1 recovery_time_variable_1 drought_number_variable_1
define_droughts
sigma_dv_event month_dv_length
detrend_deseasonalize_effect_variable
effect_variable_3
calculate_data_variables
recovery_time_variable_2 drought_value_variable_2 predrought_effect_variable_2 drought_number_variable_2
export_recovery_time_figure
output_recovery_time_figure
export_drought_value_variable_figure
output_drought_value_variable_figure
export_predrought_effect_variable_figure
output_predrought_effect_variable_figure
export_drought_number_variable_figure
output_drought_number_figure
input_drough_variable
input_effect_variable
Christopher Schwalm, Yaxing Wei
Summary: ScienJfic Workflows ScienJfic Workflows • [+] AutomaPon • [+] Scalability • [+] AbstracPon • [+] Provenance • … • [+/0] Easy to use
– [0] learning a new paradigm • [-‐] Teaching resources • [-‐] Special experPse needed for deep changes
e.g. new Java actors, shims, …
B. Ludäscher Dagstuhl Seminar #16041: Reproducibility of Data-‐Oriented Experiments in e-‐Science 16
Scripts + @YW à Workflows Scripts: [+] AutomaPon, [0] Scalability, [-‐] AbstracPon, [0/-‐] Provenance
Now: Scripts + YesWorkflow AnnotaJons • [+] AbstracJon
– explain your methods to mere mortals => encourage (re-‐)use
• [+] Provenance: – noWorkflow (retrospec(ve provenance) – YesWorkflow (prospec(ve provenance)
• [+] Language independent (R, Matlab, Python, …) • [+] Empower tool makers (script programmers): give them …
– … some immediate benefits (workflow views) – … some medium term improvements (provenance integraJon) – … some long term benefits: think about your methods differently => dataflow programming => [+] Scalability
B. Ludäscher Dagstuhl Seminar #16041: Reproducibility of Data-‐Oriented Experiments in e-‐Science 17
References
• hsp://yesworkflow.org • T. McPhillips, S. Bowers, K. Belhajjame, B. Ludäscher (2015).
RetrospecJve Provenance Without a RunJme Provenance Recorder. 7th USENIX Workshop on the Theory and Prac6ce of Provenance (TaPP'15).
• T. McPhillips, T. Song, T. Kolisnik, S. Aulenbach, K. Belhajjame, R.K.
Bocinsky, Y. Cao, J. Cheney, F. ChirigaP, S. Dey, J. Freire, C. Jones, J. Hanken, K.W. KinPgh, T.A. Kohler, D. Koop, J.A. Macklin, P. Missier, M. Schildhauer, C. Schwalm, Y. Wei, M. Bieda, B. Ludäscher (2015). YesWorkflow: A User-‐Oriented, Language-‐Independent Tool for Recovering Workflow InformaJon from Scripts. Interna6onal Journal of Digital Cura6on 10, 298-‐313.
B. Ludäscher Dagstuhl Seminar #16041: Reproducibility of Data-‐Oriented Experiments in e-‐Science 18