FabSim: Facilitating computational research through automation on large-scale and distributed...

1
FabSim facilitating computational research through automation on large-scale and distributed e-infrastructures FabSim is a highly modifiable toolkit developed to simplify a range of computational tasks for researchers in diverse disciplines [FabSim]. FabSim is flexible, adaptable, and allows users to perform a wide range of tasks with ease. It also provides a systematic way to automate the use of resources, including HPC and distributed resources, and to make tasks easier to repeat by recording contextual information. See below for an overview of all the (modifiable) components of FabSim. Automating computational research: Concepts and architecture Installing and using FabSim Complex multiscale applications enabled with FabSim Most software is designed to be easy to use, yet many academics spent most of their time writing or modifying software. FabSim is designed for these academics. FabSim has enabled complex production simulations in diverse domains, including blood flow, composite materials, and protein-ligand binding affinities. Derek Groen a,b , Agastya Bhati b , James Suter b , James Hetherington b,c , Stefan J. Zasada b and Peter V. Coveney b a Department of Computer Science, Brunel University London. b Centre for Computational Science, University College London, c Research Software Development Group, University College London. Calculating protein-ligand binding affinities (using multiscale modelling) Multiscale modelling of bloodflow in the brain Multiscale modelling of clay-polymer nanocomposites We have used FabSim extensively to automate multiscale simulations of clay-polymer nanocomposites. These simulations are unusually extensive, as we required a total of 1320 simulation runs, using three different petascale supercomputers, to obtain the results presented in our first publication [M1]. Most of these runs were performed using the LAMMPS molecular dynamics code, in coarse-grained mode. We created two automated workflows, full_ibi_multi and full_pmf_multi, to iteratively obtain coarse-grained potential curves which resulted in properties that closely matched those obtained from all-atom molecular dynamics simulations. We then applied these two “sub” workflows to parametrize the coarse-grained interactions between each pair of particles (see directly right). Using the full set of potentials we created a chemically specific CG model, and proceeded to run large production simulations of clay-polymer interactions. FabSim is written in Python, and relies on the following dependencies: Fabric - www.fabric.org Paramiko - www.paramiko.org YaML - www.yaml.org FabSim can be installed locally, in user space on Linux or Mac OS X without needing any root privileges. Using the local installation, FabSim can be used to access and use any remote resources without further installation work. FabSim uses SSH to access remote resources, and the latest version supports GSI-SSH as well. We have used FabSim to enable computations on a range of remote petascale resources, including Cray resources such as ARCHER (EPCC) and Hermit (HLRS), IBM resources such as BlueJoule (STFC) and SuperMUC (LRZ), as well as newly established resources such as Eagle (PSNC) and Prometheus (Cyfronet). Users modify FabSim for example by filling in templates for custom machine architectures and simulation approaches, by defining computational activities through writing Python functions. And then by running these functions using one-line bash commands, modifying default parameters by adding arguments to the one-liner command. FabSim is available at: http://www.github.com/UCL-CCS/FabSim . Example FabSim function: @task def lammps(config,**args): update_environment(args) with_config(config) execute(put_configs,config) job(dict(script='lammps', cores=4, wall_time='0:15:0', memory='2G'),args) This function submits a LAMMPS job to the remote queue. The job results will be stored with a name pattern as defined in the environment, e.g. h2o-abcd1234-legion-256. The config argument is required, and points to: <FabSim path>/config_files/<config> The command can for example be used in the Bash terminal as follows: fab archer lammps:clay,cores=12288, wall_time=12:00:00,memory=1G Or simply (using default values) as: fab archer lammps:clay Results are then for example stored in: <FabSim path>/results/<config>_ <machine>_<cores>_<timestamp> Example definition of a machine-specific configuration (in machines.yml) archer: max_job_name_chars: 15 job_dispatch: "qsub" run_command: "aprun -n $cores" batch_header: pbs-archer no_ssh: true # ARCHER doesn't allow outgoing ssh sessions. remote: "login.archer.ac.uk" home_path_template: “/home/$project/ $project/$username" runtime_path_template: "/work/$project/ $project/$username" modules: ["load lammps/lammps-28Jun14", "load namd"] temp_path_template: "$work_path/tmp" queue: "standard" python_build: "lib64/python2.6" pwd: "export PBS_O_WORKDIR=$(readlink -f $$PBS_O_WORKDIR)" corespernode: 24 We use FabSim in conjunction with HemeLB to automate the installation of HemeLB on remote resources (see figure above), to automate the performance investigations of the code (see figure below, a), and to automatically construct and run ensemble multiscale simulations (b) [H1]. A specially adapted version of FabSim (FabHemeLB) is packaged with each HemeLB installation. FabSim has also been used to perform automated calculations of protein- ligand binding affinities. Here we perform an ensemble of 25-50 molecular dynamics workflows, each of which requires a simulation to perform equilibration, a simulation to perform the main production run, and a processing step to perform MMPBSA or NMODE calculations on the resulting data [B1]. Acknowledgements We thank Dr Shunzhou Wan for his help in constructing the Binding Affinity Calculation section of this article, and Miguel Bernabeu, Rupert Nash, Sebastian Schmieschek, Mohamed Itani and Hywel Carver for their contributions to FabHemeLB. This work was funded in part by the EU FP7 MAPPER, CRESTA, P-medicine and VPH-SHARE project (grant numbers 261507, 287703, 270089, 269978), by the EU H2020 ComPat project (grant no. 671564) by EPSRC via the 2020 Science Programme (EP/I017909/1), the Qatar National Research Fund (grant number 092601048), MRC Bioinformatics project (MR/L016311/1) and the UCL Provost. AB is funded by the INLAKS Foundation Scholarship and a UCL Overseas Research Studentship Award (2014-2017). Supercomputing time was provided by the Hartree Centre (Daresbury Laboratory) on BlueJoule and BlueWonder via the CGCLAY project, and on HECToR and ARCHER, the UK national supercomputing facility at the University of Edinburgh, via EPSRC through grants EP/F00521/1, EP/E045111/1, EP/I017763/1 and the UK Consortium on Mesoscopic Engineering Sciences (EP/L00030X/1). References [FabSim] Groen et al., ArXiv:1512.02194, (2016). [M1] J. Suter et al., Chemically specific multiscale modeling of clay-polymer nanocomposites reveals intercalation dynamics, tactoid self-assembly and emergent materials properties, Adv. Mater. 27 (6) (2015) 966–984. [H1] M. Itani et al., An automated multiscale ensemble simulation approach for vascular blood flow, JoCS 9 (2015) 150–155. [B1] S. Wan, et al., Rapid, precise, and reproducible prediction of Peptide-MHC binding affinities from molecular dynamics that correlate well with experiment, JCTC 11 (7) (2015) 3346–3356. Above: diagrammatic overview of FabSim automated workflows (a & b), and an Overview of the parameterization tasks required to coarse-grain a molecular system of clay sheets embedded in a sea of polymers (c). Below: example result of several iterations of an Iterative Boltzmann Inversion, as performed automatically by FabSim. Diagrammatic overview of a binding affinity calculation workflow. Reproduced from Wan et al. [B1]. For the full journal paper, see http://dx.doi.org/10.1016/j.cpc.2016.05.020 .

Transcript of FabSim: Facilitating computational research through automation on large-scale and distributed...

Page 1: FabSim: Facilitating computational research through automation on large-scale and distributed e-infrastructures

FabSim facilitating computational research through automation on large-scale and distributed e-infrastructures

FabSim is a highly modifiable toolkit developed to simplify a range of computational tasks for researchers in diverse disciplines [FabSim]. FabSim is flexible, adaptable, and allows users to perform a wide range of tasks with ease. It also provides a systematic way to automate the use of resources, including HPC and distributed resources, and to make tasks easier to repeat by recording contextual information. See below for an overview of all the (modifiable) components of FabSim.

Automating computational research:Concepts and architecture

Installing and using FabSim

Complex multiscale applications enabled with FabSim

Most software is designed to be easy to use, yet many academics spent most of their time writing or modifying software.

FabSim is designed for these academics.

FabSim has enabled complex production simulations in diverse domains, including blood flow, composite materials, and protein-ligand binding affinities.

Derek Groena,b, Agastya Bhatib, James Suterb, James Hetheringtonb,c, Stefan J. Zasadab and Peter V. CoveneybaDepartment of Computer Science, Brunel University London. bCentre for Computational Science, University College London, cResearch Software Development Group, University College London.

Calculating protein-ligand binding affinities (using multiscale modelling)

Multiscale modelling of bloodflow in the brainMultiscale modelling of clay-polymer nanocomposites

We have used FabSim extensively to automate multiscale simulations of clay-polymer nanocomposites. These simulations are unusually extensive, as we required a total of 1320 simulation runs, using three different petascale supercomputers, to obtain the results presented in our first publication [M1]. Most of these runs were performed using the LAMMPS molecular dynamics code, in coarse-grained mode.

We created two automated workflows, full_ibi_multi and full_pmf_multi, to iteratively obtain coarse-grained potential curves which resulted in properties that closely matched those obtained from all-atom molecular dynamics simulations.

We then applied these two “sub” workflows to parametrize the coarse-grained interactions between each pair of particles (see directly right). Using the full set of potentials we created a chemically specific CG model, and proceeded to run large production simulations of clay-polymer interactions.

FabSim is written in Python, and relies on the following dependencies:● Fabric - www.fabric.org● Paramiko - www.paramiko.org● YaML - www.yaml.org

FabSim can be installed locally, in user space on Linux or Mac OS X without needing any root privileges. Using the local installation, FabSim can be used to access and use any remote resources without further installation work. FabSim uses SSH to access remote resources, and the latest version supports GSI-SSH as well.

We have used FabSim to enable computations on a range of remote petascale resources, including Cray resources such as ARCHER (EPCC) and Hermit (HLRS), IBM resources such as BlueJoule (STFC) and SuperMUC (LRZ), as well as newly established resources such as Eagle (PSNC) and Prometheus (Cyfronet).

Users modify FabSim for example by filling in templates for custom machine architectures and simulation approaches, by defining computational activities through writing Python functions. And then by running these functions using one-line bash commands, modifying default parameters by adding arguments to the one-liner command.

FabSim is available at: http://www.github.com/UCL-CCS/FabSim.

Example FabSim function:

@taskdef lammps(config,**args):

update_environment(args)with_config(config)execute(put_configs,config)job(dict(script='lammps',

cores=4, wall_time='0:15:0', memory='2G'),args)

This function submits a LAMMPS job to the remote queue. The job results will be stored with a name pattern as defined in the environment, e.g. h2o-abcd1234-legion-256. The config argument is required, and points to:

<FabSim path>/config_files/<config>

The command can for example be used in the Bash terminal as follows:

fab archer lammps:clay,cores=12288,wall_time=12:00:00,memory=1G

Or simply (using default values) as:

fab archer lammps:clay

Results are then for example stored in:

<FabSim path>/results/<config>_<machine>_<cores>_<timestamp>

Example definition of a machine-specific configuration (in machines.yml)

archer: max_job_name_chars: 15 job_dispatch: "qsub" run_command: "aprun -n $cores" batch_header: pbs-archer no_ssh: true # ARCHER doesn't allow outgoing ssh sessions. remote: "login.archer.ac.uk"

home_path_template: “/home/$project/$project/$username"

runtime_path_template: "/work/$project/$project/$username"

modules: ["load lammps/lammps-28Jun14", "load namd"]

temp_path_template: "$work_path/tmp" queue: "standard" python_build: "lib64/python2.6" pwd: "export PBS_O_WORKDIR=$(readlink -f $$PBS_O_WORKDIR)" corespernode: 24

We use FabSim in conjunction with HemeLB to automate the installation of HemeLB on remote resources (see figure above), to automate the performance investigations of the code (see figure below, a), and to automatically construct and run ensemble multiscale simulations (b) [H1]. A specially adapted version of FabSim (FabHemeLB) is packaged with each HemeLB installation.

FabSim has also been used to perform automated calculations of protein-ligand binding affinities. Here we perform an ensemble of 25-50 molecular dynamics workflows, each of which requires a simulation to perform equilibration, a simulation to perform the main production run, and a processing step to perform MMPBSA or NMODE calculations on the resulting data [B1].

AcknowledgementsWe thank Dr Shunzhou Wan for his help in constructing the Binding Affinity Calculation section of this article, and Miguel Bernabeu, Rupert Nash, Sebastian Schmieschek, Mohamed Itani and Hywel Carver for their contributions to FabHemeLB. This work was funded in part by the EU FP7 MAPPER, CRESTA, P-medicine and VPH-SHARE project (grant numbers 261507, 287703, 270089, 269978), by the EU H2020 ComPat project (grant no. 671564) by EPSRC via the 2020 Science Programme (EP/I017909/1), the Qatar National Research Fund (grant number 092601048), MRC Bioinformatics project (MR/L016311/1) and the UCL Provost. AB is funded by the INLAKS Foundation Scholarship and a UCL Overseas Research Studentship Award (2014-2017). Supercomputing time was provided by the Hartree Centre (Daresbury Laboratory) on BlueJoule and BlueWonder via the CGCLAY project, and on HECToR and ARCHER, the UK national supercomputing facility at the University of Edinburgh, via EPSRC through grants EP/F00521/1, EP/E045111/1, EP/I017763/1 and the UK Consortium on Mesoscopic Engineering Sciences (EP/L00030X/1).

References[FabSim] Groen et al., ArXiv:1512.02194, (2016).[M1] J. Suter et al., Chemically specific multiscalemodeling of clay-polymer nanocomposites reveals intercalation dynamics, tactoid self-assembly and emergent materials properties, Adv. Mater. 27 (6) (2015) 966–984.[H1] M. Itani et al., An automated multiscale ensemble simulation approach for vascular blood flow, JoCS 9 (2015) 150–155.[B1] S. Wan, et al., Rapid, precise, and reproducible prediction of Peptide-MHC binding affinities from molecular dynamics that correlate well with experiment, JCTC 11 (7) (2015) 3346–3356.

Above: diagrammatic overview of FabSim automated workflows (a & b), and an Overview of the parameterization tasks required to coarse-grain a molecular system of clay sheets embedded in a sea of polymers (c).Below: example result of several iterations of an Iterative Boltzmann Inversion, as performed automatically by FabSim.

Diagrammatic overview of a binding affinity calculation workflow. Reproduced from Wan et al. [B1].

For the full journal paper, see http://dx.doi.org/10.1016/j.cpc.2016.05.020.