Workflow management:motivation and vision
2Ela Hunt, SyBIT
Plan
Overview of existing workflows
Gains to be achieved via workflows
Methodological assumptions: how to support and construct workflows with less effort and more effectively
3Ela Hunt, SyBIT
Three areas of workflow use:
Deep sequencing
High content screening
Proteomics
Future: workflows combining those three methodologies, possibly including metabolomics, NMR. etc
4Ela Hunt, SyBIT
Deep sequencing
Management of reads (images) coming off the microscopy devices
Processing of images into sequence files
Aligment to a genome or genome assembly from short reads
Annnotation with data from external sources
Candidate gene/drug target identification
5Ela Hunt, SyBIT
DeepSequencingWorkflow Status (Lausanne)1b. Illuminasequencing Possible extensions
6. DAS server
8. AssociationViewer
7. MicrobeBrowser
1a. Web – sample metadata capture
Perl
4. Submit analysispipeline
2. fileserver
3.
Web-browse
Sequenceanalysis
Meta-data
Sequence data
6Ela Hunt, SyBIT
Deep sequencing workflow status
Lausanne – alignment via Eland (Emmanuel Beaudoing, Sylvain Pradervand)
Basel – under construction (Manuel Kohler)
Zurich – FGCZ – under construction (Remy Bruggmann)
7
Proteomics workflows
MS spectra
Mapping to proteins (merging output from various analysis programs)
Annotation with additional data
ETHZ – Perl scripts and KNIME (Andreas Quandt)
Lausanne, Geneva, Basel (?)
Ela Hunt, SyBIT
8Ela Hunt, SyBIT
ETHZ proteomics example (drawn in KNIME by Andreas Quandt)
9Ela Hunt, SyBIT
Screening workflows
Microscopy, image transfer, compression
Matlab scripts (light intensity adjustment, feature recognition, etc, leading to the identification of features) writing feature counts to a DB/files
Stats and chart generation, sometimes including a user interface showing images (also for training), KNIME, R, Matlab, etc
10Ela Hunt, SyBIT
Screening workflows
Lausanne – Petr Strnad‘s workflows in KNIME, Matlab, MySQL
iBRAIN developed by Berend Snijder - an end-to-end solution with a GUI (shell script, XML, XSLT, HTML)
imageJ in S. Maerkl‘s lab in Lausanne, needing more automation and DB
HCDC (Postgress, Matlab, KNIME)
11Ela Hunt, SyBIT
Lausanne workflow fragment
Loop for every plate…Read availableplates
…read cell datafor the plate in the loop
Calculate the number of centrosomesfor 7 different threshold
12Ela Hunt, SyBIT
iBRAIN overview
Purpose: plates, wells, images => compress images, classify cells into types, count cells of various types, graph
Submit project via drag-and-drop of a file
Monitor progress on cluster via HTML pages
Technology: bash, Matlab, cluster, XML, HTML, web pages generated from a bash script, paths and file names are embedded
13Ela Hunt, SyBIT
iBRAIN use cases
14Ela Hunt, SyBIT
OUR GOALS: addressing technical challenges
Maintainablility (extendability) of the entire workflow
Portability
Automation (end-to-end execution)
Cost savings via code base sharing
Various architectures (storage, clusters)
Multiple logins (security, ease of administration)
Privacy
Most of those can be solved via extending KNIME (next talk)
15Ela Hunt, SyBIT
Extending KNIME:see workflows wiki page
16Ela Hunt, SyBIT
What is KNIME?
A Java workflow management system
Integrates Python, R, Perl, Java snippets, jdbc
GUI – can be used by a bioinformatician
Also server and cluster products (SunGRID engine)
Used at several locations (below P. Strnad‘s at Lausanne)
KNIME Analysis (from P. Strnad)
GFP-Centrin expression threshold
50% of cells have2 centrosomes
Usually exclude 10% of cellswith low GFP-Centrin signalPe
rcen
tage
of c
ells
bel
low
thre
shol
d
KNIME Analysis
Centrosome number
Cell
coun
t
Image Regions Viewer
Image Regions Viewer
21Ela Hunt, SyBIT
Goals of KNIME extension
Maintainablility (extendability)
Portability
Automation (end-to-end execution)
Cost savings via code base sharing
Various architectures (storage, clusters)
Doing away with multiple logins or no logins (security, ease of administration, privacy)
22Ela Hunt, SyBIT
Security
Security – one uname/passw per user, one login that carries out the whole workflow
Will include cluster/db logins
KNIME – needs the concepts of user/session, login, accounting of who did what
Allows for workflow tracking, scientific repeatability, accounting
23Ela Hunt, SyBIT
Distributed data and computation
Data Mover as a KNIME node (expose input params, input and output as KNIME ports) – KNIME abstracts over those, and calls them ports
Usage of clusters (LSF and others, as needed) – probably involving the spawning of several Java workflows distributed over a cluster, also reporting of status as jobs are being processed
24
Language additions
Wrapping for Matlab
Improved wrapping of Perl
Better facilities for R embedding (viewports)
CP2 embedding
Sequence: Eland, MAQ, Bowtie, BWA
Proteomics: Mascot, Xtandem, OMSSA, SpectraSS
Ela Hunt, SyBIT
25Ela Hunt, SyBIT
GUI additions
Job submission GUI
Job monitoring GUI (to show errors in a manner appropriate for a biological user)
Workflow sharing GUI (choose workflow, associate with data)
GUI embedding facility for Java GUIs (currently implementation is too fiddly)
26Ela Hunt, SyBIT
Workflow portability
A reconfiguration tool, based on the XML workflow description format supported by KNIME, in XPath or Xquery (GUI?):
select all data paths and change them
select all software paths and change them
select db/login/cluster user data, update
check the updated values by testing all new parameters, report
for two identical workflow instances, report the config differences
27Ela Hunt, SyBIT
Better workflow management
An open repository of workflow nodes, shared by all KNIME user groups (two parts – mature and beta)
Saving of graphing parameters, so that an entire workflow can be automated
Adding a workflow start node with iteration over directories
Data flow efficiency - data exchange between nodes – via hierarchical structures (XML?) and tables (for Perl?)
28Ela Hunt, SyBIT
Image handling
Image type improvements (this type is under development and may not be mature yet)
Image storage in openBIS (various levels of resolution, by well, plate, etc), with associated indexes, so that stats at various levels can be generated easily
29Ela Hunt, SyBIT
openBIS/B-Fabric connectivity
Access to raw data from KNIME
Image indexing, so that KNIME can effectively query features
Analysis results storage
Dumping of workflow run parameters/outcomes to DB (maybe picking up a workflow from DB)
30Ela Hunt, SyBIT
SQL handling
Better table merging (to merge data from several tables, supported by a query definition), as this is cumbersome
31Ela Hunt, SyBIT
Summary
KNIME is used in Zurich and Lausanne, but does not provide end-to-end processing
List of new requirements was gathered from workflow users
An outline grant submitted to KTI
Your input is needed!
Top Related