ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH...

Post on 28-Aug-2020

6 views 0 download

Transcript of ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH...

ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES

WITH NEXTFLOW Paolo Di Tommaso, CRG

Wellcome Trust Sanger Institute, 1 May 2018, Cambridge

WHO IS THIS CHAP? @PaoloDiTommasoResearch software engineerComparative Bioinformatics, Notredame LabCenter for Genomic Regulation (CRG)Author of Nextflow project

AGENDA• The challenges with computational workflows

• Nextflow main principles

• Handling parallelisation and portability

• Deployments scenarios

• Comparison with other tools

• Future plans

GENOMIC WORKFLOWS• Data analysis applications to extract information from

(large) genomic datasets

• Embarrassingly parallelisation, can spawn 100s-100k jobs over distributed cluster

• Mash-up of many different tools and scripts

• Complex dependency trees and configuration → very fragile ecosystem

Steinbiss et al., Companion parassite genome annotation pipeline, DOI: 10.1093/nar/gkw292

To reproduce the result of a typical computational biology paper

requires 280 hours. ≈1.7 months!

THE SAME APPLICATION DEPLOYED IN

DIFFERENT ENVIRONMENTSPRODUCES

DIFFERENT RESULTS (!)

Platform Amazon Linux Debian Linux Mac OSX

Number of chromosomes 36 36 36

Overall length (bp) 32,032,223 32,032,223 32,032,223

Number of genes 7,781 7,783 7,771

Gene density 236.64 236.64 236.32

Number of coding genes 7,580 7,580 7570

Average coding length (bp) 1,764 1,764 1,762

Number of genes with multiple CDS 113 113 111

Number of genes with known function 4,147 4,147 4,142

Number of t-RNAs 88 90 88

Comparison of the Companion pipeline annotation of Leishmania infantum genome executed across different platforms *

* Di Tommaso P, et al., Nextflow enables computational reproducibility, Nature Biotech, 2017

CHALLENGES• Reproducibility, replicate results over time

• Portability, run across different platforms

• Scalability ie. deploy big distributed workloads

• Usability, streamline execution and deployment of complex workloads ie. remove complexity instead of adding new one

• Consistency ie. track changes and revisions consistently for code, config files and binary dependencies

PUSH-THE-BUTTON PIPELINES

HOW?

• Fast prototyping ⇒ custom DSL that enables tasks composition, simplifies

most use cases + general purpose programming lang. for corner cases

• Easy parallelisation ⇒ declarative reactive programming model based on

dataflow paradigm, implicit portable parallelism

• Self-contained ⇒ functional approach, a task execution is idempotent ie.

cannot modify the state of other tasks + isolate dependencies with containers

• Portable deployments ⇒ executor abstraction layer + deployment

configuration from implementation logic

Orchestration& Parallelisation

Scalability& Portability

Deployment &Reproducibility

containers

Git GitHub

TASK EXAMPLE

bwa mem reference.fa sample.fq \ | samtools sort -o sample.bam

process align_sample {

input: file 'reference.fa' from genome_ch file 'sample.fq' from reads_ch

output: file 'sample.bam' into bam_ch

script: """ bwa mem reference.fa sample.fq \ | samtools sort -o sample.bam """

}

TASK EXAMPLE

bwa mem reference.fa sample.fq \ | samtools sort -o sample.bam

TASKS COMPOSITION

process index_sample {

input: file 'sample.bam' from bam_ch

output: file 'sample.bai' into bai_ch

script: """ samtools index sample.bam """

}

process align_sample {

input: file 'reference.fa' from genome_ch file 'sample.fq' from reads_ch

output: file 'sample.bam' into bam_ch

script: """ bwa mem reference.fa sample.fq \ | samtools sort -o sample.bam """

}

DATAFLOW • Declarative computational model for parallel

process executions

• Processes wait for data, when an input set is ready the process is executed

• They communicate by using dataflow variables i.e. async FIFO queues called channels

• Parallelisation and tasks dependencies are implicitly defined by process in/out declarations

HOW PARALLELISATION WORKSsamples_ch = Channel.fromPath('data/sample.fastq')

process FASTQC {

input: file reads from samples_ch

output: file 'fastqc_logs' into fastqc_ch """ mkdir fastqc_logs fastqc -o fastqc_logs -f fastq -q ${reads} """ }

samples_ch = Channel.fromPath('data/*.fastq')

process FASTQC {

input: file reads from samples_ch

output: file 'fastqc_logs' into fastqc_ch """ mkdir fastqc_logs fastqc -o fastqc_logs -f fastq -q ${reads} """ }

HOW PARALLELISATION WORKS

IMPLICIT PARALLELISM

clustalo

Channel.fromPath("data/*.fastq")

clustaloFASTQC

SUPPORTED PLATFORMS

DEPLOYMENT SCENARIOS

LOCAL EXECUTION

• Common development scenario

• Dependencies can be managed using a container runtime

• Parallelisations is managed spawning posix processes

• Can scale vertically using fat server / shared mem. machine

nextflow

OS

local storage

docker/singularity

laptop / workstation

CENTRALISED ORCHESTRATION

computer cluster• Nextflow orchestrates

workflow execution submitting jobs to a compute cluster eg. SLURM

• It can run in the head node or a compute node

• Requires a shared storage to exchange data between tasks

• Ideal for corse-grained parallelisms

NFS/Lustre

cluster node

cluster node

cluster node

cluster node

submit jobs

cluster node

nextflow

DISTRIBUTED ORCHESTRATION

login node

NFS/Lustre

job request

cluster node

cluster node

launcher wrapper

nextflow cluster

nextflow driver

nextflow worker

nextflow worker

nextflow worker

HPC cluster

• A single job request allocates the desired computes nodes

• Nextflow deploys its own embedded compute cluster

• The main instance orchestrate the workflow execution

• The worker instances execute workflow jobs (work stealing approach)

KUBERNETES

• Next generation native cloud clustering for containerised workloads

• There's the need of workflow orchestration

• Latest NF version includes a new command that streamline the workflow deployment to K8s

K8S DEPLOYMENT

PORTABILITY

PORTABILITY

process { executor = 'slurm' queue = 'my-queue' memory = '8 GB' cpus = 4 container = 'user/image' }

PORTABILITY

process { executor = 'awsbatch' queue = 'my-queue' memory = '8 GB' cpus = 4 container = 'user/image' }

CONFIGURATION DECOUPLING IS THE KEY TO

PORTABLE DEPLOYMENTS

DEMO!

A QUICK COMPARISON

GALAXY vs. NEXTFLOW• Command line oriented tool

• Can incorporate any tool w/o any extra adapter

• Fine control over tasks parallelisation

• Scalability 100⇒1M jobs

• One liner installer

• Suited for production workflows + experienced bioinformaticians

• Web based platform

• Built-in integration with many tools and dataset

• Little control over tasks parallelisation

• Scalability 10⇒1K jobs

• Complex installation and maintenance

• Suited for training + not experienced bioinformaticians

SNAKEMAKE vs. NEXTFLOW• Command line oriented tool

• Push model

• Can manage any data structure

• Compute DAG at runtime

• All major container runtimes

• Built-in support for clusters and cloud

• No (yet) support for sub-workflows

• Built-in support for Git/GitHub, etc., manage pipeline revisions

• Groovy/JVM based

• Command line oriented tool

• Pull model

• Rules defined using file name patterns

• Compute DAG ahead

• Built-in support for Singularity

• Custom scripts for cluster deployments

• Support for sub-workflows

• No support for source code management system

• Python based

CWL vs. NEXTFLOW

• Language + app. runtime

• DSL on top of a general purpose programming lang.

• Concise, fluent (at least try to be!)

• Community driven

• Single implementation, quick iterations

• Language specification

• Declarative meta-language (YAML/JSON)

• Verbose

• Committee driven

• Many vendors/implementations (and specification version)

CONTAINERISATION

CONTAINERISATION• Nextflow envisioned the use

of software containers to fix computational reproducibility

• Mar 2014 (ver 0.7), support for Docker

• Dec 2016 (ver 0.23), support for Singularity

Nextflow

job job job

SINGULARITY FEATURES

Kurtzer et al. Singularity: Scientific containers for mobility of compute. PLoS ONE 12(5): e0177459

BENCHMARK*

* Di Tommaso P, Palumbo E, Chatzou M, Prieto P, Heuer ML, Notredame C. (2015) The impact of Docker containers on the performance of genomic pipelines. PeerJ 3:e1273 https://dx.doi.org/10.7717/peerj.1273

container execution can have an impact on short running tasks ie. < 1min

SINGULARITY BENCHMARK

https://github.com/wresch/python_import_problem

Singularity image format speeds up Python execution having many imports from a shared file system !

WHEN USE CONTAINERS?

ALWAYS!

BEST PRACTICES• Helps to isolate dependencies from dev or local deployment

environment

• Provides a reproducibles sandbox for third party users

• Binary images preserve against software decay

• Make it transparent ie. always include the Dockefile

• Docker image format is de-facto standard, it can be executed by different runtime eg. Singularity, Shifter, uDocker, etc.

ERROR RECOVERY• Each task outputs are saved in a

separate directory

• This allows to safely record interrupted executions discarding

• Dramatically simplify debugging !

• Computing resources can be defined in a *dynamic* manner, so that a failing task can be automatically re-execute with more memory, longer timeout, etc.

EXECUTION REPORT

EXECUTION REPORT

EXECUTION TIMELINE

DAG VISUALISATION

EDITORS !

WHAT'S NEXT

IMPROVEMENTS

• Built-in support for Bioconda recipies

• Better meta-data and provenance handling

• Workflow composition aka sub-workflows

• More clouds support ie. Azure and GCP

APACHE SPARK

• Native support for Apache Spark clusters and execution model

• Allow hybrid Nextflow and Spark applications

• Mix the best of the two worlds, Nextflow for legacy tools/corse grain parallelisation and Spark for fine grain/distributed execution eg. GATK4

• Partecipate in Cloud Work Stream working group

• TES: Task Execution API (working prototype)

• WES: Workflow Execution API

• Enable interoperability with GA4GH complaint platforms eg. Cancer Genomics Cloud and Broad FireCloud

WHO IS USING NEXTFLOW?

• Community effort to collect production ready analysis pipelines built with Nextflow

• Initially supported by SciLifeLab, QBiC and A*Star Genome Institute Singapore

• https://nf-core.github.io Alexander

PeltzerPhil

EwelsAndreas

Wilm

CONCLUSION• Data analysis reproducibility is hard and it's often underestimated.

• Nextflow does not provide a magic solution but enables best-practices and provide support for community and industry standards.

• It strictly separates the application logic from the configuration and deployment logic, enabling self-contained workflows.

• Applications can be easily deployed across different environment in a reproducible manner with a single command.

• The functional/reactive model allows applications to scale to millions of jobs with ease.

ACKNOWLEDGMENT

Evan Floden

Emilio Palumbo

Cedric Notredame

Notredame Lab, CRG

http://nextflow.io