2011-06-08 Taverna workflow system

41
S"an SoilandReyes & Robert Haines myGrid, School of Computer Science University of Manchester, UK ITER IM workshop Château de Cadarache, 20110608 http://taverna.org.uk/

description

Taverna workflow system - presented by Stian Soiland-Reyes at ITER Integrated Modelling workshop in Cadarache, France on 2011-06-08.

Transcript of 2011-06-08 Taverna workflow system

Page 1: 2011-06-08 Taverna workflow system

S"an  Soiland-­‐Reyes  &  Robert  Haines  myGrid,  School  of  Computer  Science  

University  of  Manchester,  UK  

ITER  IM  workshop  Château  de  Cadarache,  2011-­‐06-­‐08  

http://taverna.org.uk/  

Page 2: 2011-06-08 Taverna workflow system

http://www.taverna.org.uk/  http://www.mygrid.org.uk/  

What  is  myGrid?    An  e-­‐Science  Collabora"on  Since  2001    Not  a  grid!    Numerous  partners  involved:  

  University  of  Manchester    University  of  Southampton    University  of  Oxford    EMBL-­‐EBI  

  Provides  sustainable  and  produc"on  quality  soTware    Supported  by  OMII-­‐UK,  EPSRC  and  BBSRC  

  Mixture  of  developers,  bioinforma"cians  and  researchers  

SoTware  |  Services  |  Content  |  Skills  |  Community  

Page 3: 2011-06-08 Taverna workflow system

http://www.taverna.org.uk/  http://www.mygrid.org.uk/  

Mo"va"on:  Bioinforma)cs    Challenge:  

  Large  amounts  of  data    Many  open  ques"ons    Numerous  freely  available  public  datasets  and  analysis  tools  

Page 4: 2011-06-08 Taverna workflow system

http://www.taverna.org.uk/  http://www.mygrid.org.uk/  

Huge  amounts  of  data  

100+  Genes  

QTL  regions  

Microarray  

1000+  Genes  

Next  Gen  Sequencing  

100,000+  Genes  

How  do  I  look  at  all  the  genes  systema)cally?  

Page 5: 2011-06-08 Taverna workflow system

http://www.taverna.org.uk/  http://www.mygrid.org.uk/  

Manual  approach    Search  using  public  web  sites  and  databases  

  Pubmed    Uniprot    EBI  BioMart  

  Copy  and  paste  to  web  tools  for  analysis    NCBI  Blast    EBI  InterPro  

  Further  processing  locally    R    Perl    Python  

Page 6: 2011-06-08 Taverna workflow system

http://www.taverna.org.uk/  http://www.mygrid.org.uk/  

Manual:  disadvantages  •  Scale  of  analysis  task  overwhelms  researchers  

–  lots  of  data  •  User  bias  and  premature  filtering  of  datasets  –  

cherry  picking  •  Hypothesis-­‐Driven  approach  to  data  analysis  •  Constant  changes  in  data  -­‐  problems  with  re-­‐

analysis  of  data  •  Implicit  methodologies  (hyper-­‐linking  through  

web  pages)  •  Error  prolifera)on  from  any  of  the  listed  issues  

–  notably  human  error  

Page 7: 2011-06-08 Taverna workflow system

http://www.taverna.org.uk/  http://www.mygrid.org.uk/  

Web  services  and  workflows   Web  services  

  Technology  and  standards  for  exposing  code  and  data  resources  that  can  be  programma)cally  consumed  by  a  remote  third  party  

  Descrip"on  on  how  to  interact  with  the  service,  parameters,  documenta"on  

 Workflows    General  technique  for  describing  and  execu"ng  a  process  

  Describe  what  you  want  to  do  running  which  services  

Page 8: 2011-06-08 Taverna workflow system

The Taverna Open Source Suite of Tools

Client User Interfaces GUI Workbench Workflow Repository

Service Catalogue

Third Party Tools

Programming and APIs

Web Portals

Activity and Service Plug-in Manager

Provenance Store

Workflow Server

Open Provenance

Model

Secure Service Access

Virtual Machine

Workflow Engine

Page 9: 2011-06-08 Taverna workflow system

http://www.taverna.org.uk/  http://www.mygrid.org.uk/  

Taverna  workflows    A  set  of  (local  and  remote)  

services  to  analyze  or  manage  data  

  Nested  workflows  are  also  services  

  Data-­‐links  connects  services    i.e.  output  from  service  A  is  input  to  

service  B  and  C    Describes  the  desired  dataflow  

instead  of  process  coordina"on    Automa"c  itera"ons    Can  customize  list  handling  and  

control  links    

Get_pathways

Workflow Inputs

Workflow Outputs

Workflow Inputs

Workflow Outputs

remove_uniprot_duplicates

merge_uniprot_ids

species

getcurrentdatabase

kegg_pathway_release

binfo

regex_2

split_for_duplicates

split_for_duplicate_pathways

remove_duplicate_kegg_genes

merge_genes_and_pathways_3

flatten_pathway_files

merged_pathways

merge_genes_and_pathways

merge_genes_and_pathways_2

merge_kegg_references

kegg_external_gene_reference

remove_pathway_duplicates

merge_pathway_desc

merge_pathway_list_1

merge_pathway_list_2

remove_duplicate_ids

merge_patwhay_ids

pathway_descriptions

merge_reports

report

merge_gene_desc

remove_nulls_3

gene_descriptions

gene_ids

REMOVE_NULLS_2

remove_entrez_duplicates

merge_entrez_genes

remove_pathway_nulls

remove_Nulls

concat_kegg_genes

split_gene_ids

remove_pathway_nulls_2

add_uniprot_to_string

gene_descriptions pathway_descriptions

add_ncbi_to_string

Kegg_gene_ids_2

pathway_ids

Kegg_gene_ids

genes_in_qtl

mmusculus_gene_ensembl

create_report

ensembl_database_releasegenes_pathways kegg_pathway_release

Merge_pathways

concat_ids

pathway_ids

regex

split_by_regex

lister

Merge_gene_pathways

pathway_genes

concat_gene_pathway_ids

get_pathways_by_genes1

chromosome_namestart_position end_position

Page 10: 2011-06-08 Taverna workflow system

http://www.taverna.org.uk/  http://www.mygrid.org.uk/  

What  types  of  services  and  data?    WSDL/SOAP  web  services  

  Secured  invoca"on  with  HTTPS/SSL/WS-­‐Security    RESTful  web  services  

  Secured  invoca"on  with  HTTPS/Basic  Auth    Spreadsheet  import    Command  line  tools  (local,  SSH)    Inline  scripts  (Beanshell,  R)    Excel/CSV  spreadsheets    Java  APIs    Customiza"ons:  

  BioMart,  BioMoby  /  SADI    Soaplab    Grid  services  (EGEE  gLite,  caGrid,  PBS,  UNICORE)    …  your  tool  (Plugin  tutorial  in  wiki)  

Page 11: 2011-06-08 Taverna workflow system

http://www.taverna.org.uk/  http://www.mygrid.org.uk/  

Service  limita"ons   Web  service  crea"on  involves  wrapping  

exis"ng  tools  or  wri"ng  WS  code   Web  services  can  go  down  

   can  use  redundant  services  in  workflow     Service  monitoring  

  Transferring  data  up/down  to  WS  slow     Support  references  in  WS  interface  

  Execu"ng  command  line  tools  directly  requires  execu"on  access    Trickier  to  share  workflows,  require  either  SSH/grid  creden)als  or  installing  tools  locally  

Page 12: 2011-06-08 Taverna workflow system

http://www.taverna.org.uk/  http://www.mygrid.org.uk/  

Which  services?    Taverna  is  general,  can  connect  to  standard  

web  services  and  command  line  tools  for  any  domain  

  in  bioinforma"cs..    From  professional  third-­‐party  organisa"ons  providing  robust  &  open  data/analysis  services  

  ..to  under-­‐the-­‐desk  web  services  for  one  par"cular  purpose,  ran  by  PhD  students  

   hhp://biocatalogue.org/  -­‐  2000+  services  from  140+  providers  –  crowd  sourced  and  quality  monitored  

Page 13: 2011-06-08 Taverna workflow system

http://www.taverna.org.uk/  http://www.mygrid.org.uk/  

Page 14: 2011-06-08 Taverna workflow system

http://www.taverna.org.uk/  http://www.mygrid.org.uk/  

BioCatalogue  integra"on    Search  services  from  

workbench    Add  services  to  workflow    View  service  descrip)ons  

and  up)me  status  from  within  workflow  

Page 15: 2011-06-08 Taverna workflow system

http://www.taverna.org.uk/  http://www.mygrid.org.uk/  

Taverna    workbench  

  Graphical  desktop  tool      No  server  installa"on  

required    Drag-­‐and-­‐drop  services  

into  diagram    Connect  services,  run,  

reconnect,  rerun    Integrates  diverse  set  

of  tools  

Page 16: 2011-06-08 Taverna workflow system

http://www.taverna.org.uk/  http://www.mygrid.org.uk/  

Page 17: 2011-06-08 Taverna workflow system

http://www.taverna.org.uk/  http://www.mygrid.org.uk/  

Page 18: 2011-06-08 Taverna workflow system

http://www.taverna.org.uk/  http://www.mygrid.org.uk/  

Page 19: 2011-06-08 Taverna workflow system

http://www.taverna.org.uk/  http://www.mygrid.org.uk/  

Sharing  workflows  

 myExperiment.org  allows  users  to  share,  find,  download  and  rate  workflows  

  “Facebook  for  the  scien"st”    4000+  members,  1400+  workflows   Open  source  code,  can  set  up  own  instance  

Page 20: 2011-06-08 Taverna workflow system

http://www.taverna.org.uk/  http://www.mygrid.org.uk/  

myExperiment  integra"on  

  Search  and  browse  workflows    By  tags    Free  text  search    Own/group  workflows    Packs,  e.g.  “Examples”  

 Upload/share  workflows  

Page 21: 2011-06-08 Taverna workflow system

http://www.taverna.org.uk/  http://www.mygrid.org.uk/  

Taverna  workflow  features    Nested  workflows  

  Reuse  exis"ng  components    Implicit  itera"ons  

  With  customizable  list  handling    Pipelining  

  Process  par"al  itera"on  results  early    Parallelisa"on  

  Run  as  soon  as  data  is  available    Retries,  failover,  looping  

  For  stability  and  condi"onal  tes"ng    Plugin-­‐extensible  execu"on  control  

  Ideas:  caching,  error  detec"on,  dynamic  service  lookup  

Page 22: 2011-06-08 Taverna workflow system

http://www.taverna.org.uk/  http://www.mygrid.org.uk/  

Extensible  UI  and  engine  

  Plugins  can  provide  new  “perspec"ves”    e.g.:  BioCatalogue,  myExperiment  

  Provide  service-­‐specific  customiza"on    e.g.:  BioMart  interface  replicates  web  site  

  Adding  new  func"onality    New  service  types,  eg:  …    Execu"on  control  like  looping/branching    Design  helpers,  “Find  matching  service”  

Page 23: 2011-06-08 Taverna workflow system

http://www.taverna.org.uk/  http://www.mygrid.org.uk/  

Workflow  limita"ons    Ini"ally  designed  for  dataflows  

  Not  suitable  for  business  processes  like  “HR  procedure  for  hiring  new  staff”  ○  Long-­‐running  workflows  require  Taverna  Server  

  ..  But  suitable  for  coordina)ng  command  line  and  grid  execu"ons,  the  data  might  just  be  job  references  

  Execu"on  control  extensible,  eg:  ○  Looping,  Branching  ○  Dynamic  service  lookup  ○  Data  manipula"on,  Error  detec"on  

Page 24: 2011-06-08 Taverna workflow system

http://www.taverna.org.uk/  http://www.mygrid.org.uk/  

Data  and  provenance  handling    Data  references  passed  between  services  in  workflow  

  http,  file,  sftp,  gridftp,  etc  (extensible)  

  Data  downloaded/uploaded  or  references  translated  when  needed  

  Provenance  captured  for  workflow  runs    Trace  execu"on  steps,  view  intermediate  values  while  running    Export  as  Open  Provenance  Model  (OPM)  /  RDF    Proof  and  origin  of  produced  outputs    Extensible  annota)ons  

  Wf4Ever:  reproducible  research  objects    Workflow/data  as  a  scien"fic  publica"on    preserva"on    Need  to  capture  more  service  data  and  metadata  

Page 25: 2011-06-08 Taverna workflow system

http://www.taverna.org.uk/  http://www.mygrid.org.uk/  

Data  limita"ons    Running  Workbench  limited  by:  

  Local  disk  space  for  storing  data    Network  speeds  for  up/download    Firewall  access     Execute  wf  using  Taverna  Server  or  command  line  remotely  with  ssh/job  submission  

 No  standardized  WS  reference  mechanism    Agree  on  mechanism  within  WS  ‘family’  with  shared  disk  (eg.  deconstruct  local  path  from  HTTP  URI)  

Page 26: 2011-06-08 Taverna workflow system

http://www.taverna.org.uk/  http://www.mygrid.org.uk/  

Parameter  sweeps  

  Implicit  itera)ons  with  pipelining  provides  an  intui"ve  way  to  set  up  parameter  sweeps  

  Advanced  looping  and  extensible  execu)on  control  allows  itera"ve  &  recursive  reduc"ons/approxima"ons  

Page 27: 2011-06-08 Taverna workflow system

http://www.taverna.org.uk/  http://www.mygrid.org.uk/  

Taverna  command  line    Executes  from  a  

Windows/Linux/OSX  shells  

  Takes  a  predefined  workflow  with  files  as  inputs  and  outputs  

  Quick  way  to  “produc"onize”  a  workflow  

Page 28: 2011-06-08 Taverna workflow system

http://www.taverna.org.uk/  http://www.mygrid.org.uk/  

Taverna  Server    REST/SOAP  interface  to    

execute  workflows    Client  libraries  for  Ruby  and  Java    Two  demonstra"on  web  interfaces  

  Ruby    Java  Portlets  

 Upcoming:    Security  delega"on    AWS  image  

Page 29: 2011-06-08 Taverna workflow system

http://www.taverna.org.uk/  http://www.mygrid.org.uk/  

Taverna  portlet    Example  portlet  

interface    Executes  workflows  

using  Taverna  Server  

Page 30: 2011-06-08 Taverna workflow system

http://www.taverna.org.uk/  http://www.mygrid.org.uk/  

Page 31: 2011-06-08 Taverna workflow system

http://www.taverna.org.uk/  http://www.mygrid.org.uk/  

Ruby  web  interface    Example  customized  

web  interface    Uses  Ruby  gem  

t2-­‐server  

Page 32: 2011-06-08 Taverna workflow system

http://www.taverna.org.uk/  http://www.mygrid.org.uk/  

Grids  and  clusters  

  Taverna  have  been  integrated  with  several  leading  grid  and  middleware  infrastructures,  such  as:    PBS    caGrid/Globus    EGEE/gLite    NorduGrid’s  ARC    JSDL/GridSAM  

  Plans  for  SAGA  integra"on  

Page 33: 2011-06-08 Taverna workflow system

http://www.taverna.org.uk/  http://www.mygrid.org.uk/  

Taverna  on  the  cloud   Use-­‐case:  

  SNP  analysis  and  annota"on  of  genome  sequenced  from  breeds  of  cows  in  Africa  –  why  are    some  of  them  resistent  to  X?  

  Amazon  EC2  with  Taverna  Server  and  local  services  

  Ruby  on  Rails  web  interface    Runs  through  31  chromosomes  in  2  hours  using  10  instances  -­‐  $10  

Page 34: 2011-06-08 Taverna workflow system

http://www.taverna.org.uk/  http://www.mygrid.org.uk/  

Page 35: 2011-06-08 Taverna workflow system

http://www.taverna.org.uk/  http://www.mygrid.org.uk/  

Taverna  3  roadmap  

 OSGi  plugin  system   Workflow  language:  Scufl2  

  Compound  format;  embedding  metadata,  dependencies,  independent  API  for  crea"ng/inspec"ng  workflows  

  Components    Finding/sharing  command  line  tool  descrip"ons    Richer  way  of  finding  compa"ble  services  

Page 36: 2011-06-08 Taverna workflow system

http://www.taverna.org.uk/  http://www.mygrid.org.uk/  

Open  source,  open  development  

  Taverna  suite  of  tools  are  all  open  source,  free  to  use  and  customize  

  Large  user  community,  ac"ve  mailing  lists    Lead  developers:  myGrid  in  Manchester  UK    Contributors  from  across  the  world    PAL  programme   myGrid  provides  training,  tutorials  and  

documenta)on  

Page 37: 2011-06-08 Taverna workflow system

http://www.taverna.org.uk/  http://www.mygrid.org.uk/  

Who  uses  Taverna?  

  Bioinforma"cs:  EMBL-­‐EBI,  ONDEX    Astronomy:  HELIO,  AstroGrid,  SAMPO    Engineering:  NASA  Jet  Propulsion  Lab  (JPL)    Chemistry:  CDK,  CIC    Biodiversity:  BioVel    Preserva"on:  Wf4Ever,  SCAPE    BioMedicine/Cancer  research:  caGrid   Data/text  mining:  eLico,  AID  

Page 38: 2011-06-08 Taverna workflow system

http://www.taverna.org.uk/  http://www.mygrid.org.uk/  

Taverna  in  numbers  

  Taverna:    361  organisa"ons    48  countries    70,000+  downloads  ○  ~4000  source  

  myExperiment:      4000+  registered  users    56  countries    1400+  workflows    

  BioCatalogue:      2000+  services    150+  service  providers    500+  members    27  countries  

Page 39: 2011-06-08 Taverna workflow system

http://www.taverna.org.uk/  http://www.mygrid.org.uk/  

Acknowledgements  

Page 40: 2011-06-08 Taverna workflow system

http://www.taverna.org.uk/  http://www.mygrid.org.uk/  

Page 41: 2011-06-08 Taverna workflow system

http://www.taverna.org.uk/  http://www.mygrid.org.uk/  

More  informa"on  

  hhp://www.mygrid.org.uk/  

  hhp://www.taverna.org.uk/  

  hhp://www.myexperiment.org/  

  hhp://www.biocatalogue.org/