The BioHPC Nucleus Cluster & Future Developments

Today we’ll talk about the BioHPC Nucleus

HPC cluster – with some technical details for

those interested!

How is it designed?

What hardware does it use?

How does this affect the work I need to run?

Future Plans

(2017 cluster upgrade and more!)

Overview

HPC Clusters

HPC clusters consist of 3 major components

Compute Nodes

• Powerful servers that run your jobs

• Some also contain GPU cards

High-Speed Network

• Transfers data to/from compute nodes

• Carries communication for parallel code

High-Speed, High Capacity Storage

• Terabytes of storage for your research data

• 10s of GB per second bandwidth to feed nodes

HPC Cluster

Balance in Clusters

Some clusters need much more storage than compute.Data intensive tasks (e.g. Next Generation Sequencing)

Some clusters need very little storage, but a lot of compute.Compute intensive tasks (e.g. physical process modelling)

Some clusters don’t need very high performance networkingEmbarrassingly parallel tasks (no communication between tasks)

Best solution depends on the workload of users

Nucleus is a balanced, general purpose cluster

Slight bias toward storage – more storage than typical HPC system of its size

Compared to your PC

Combined, the cluster is between 1,000 and 8,000x faster/larger than a typical PC

Compute Nodes

• 8500 cores (~2,000x Desktop)

• 45TB RAM (~5,000x Desktop)

High-Speed Network

• 5.5Tbps Throughput (~5,000x Desktop)

High-Speed, High Capacity Storage

• >8PB Storage (~8000x Desktop)

• 90GB/s Throughput (~1000x Desktop)

4 Cores8GB RAM1TB HDD

openclipart.org - https://openclipart.org/detail/17924/computer

Nucleus has 196 Compute Nodes

Based on standard servers, used by businesses:

Lots of CPU cores: 32, 48 or 56 logical per server

each physical core has 2 logical cores

Lots of RAM: 128, 256, or 384GB per server

Differences from business servers:

Very little local storage

Keep things on central storage systems

High Speed Infiniband Network

Much faster than normal business networking

Compute Nodes

Possible to buy much individual faster machines…But this is the sweet spot for price-performance of a cluster of machines.

We add nodes often, and buy newer, faster machines when they become available:

24 * 128GB Nodes Oldest Xeon E5 32 Cores

78 * 256GB Nodes Most nodes Xeon E5 v3 48 Cores

48 * 256GBv1 Nodes Fastest CPU nodes Xeon E5 v4 56 Cores

2 * 384GB Nodes Largest RAM Xeon E5 /v2 32/40 Cores

Compute Nodes - Types

Newer nodes have more cores

Can be much faster if your work can use the extra cores

Also have newer numerical features – can speed up linear algebra a lot

Logical Cores

Most jobs that users run don’t use compute nodes fully.

56 cores is a lot to fill up. Might be slower to split task into 56 parts due to overhead.

Combine smaller jobs – run programs in parallel on fewer nodes

Watch out for RAM usage – 256GB / 56 cores is 4.5GB per core.

You might need to run fewer than 56 tasks

How Does this Affect Me – Cores and RAM?

8 http://www.exascience.com/wp-content/uploads/2013/12/Herzeel-BWAReport.pdf

Herzeel et. alPerformance Analysis of BWA AlignmentExaScience Life Lab

Older nodes are often less busy – shorter waits.

If your code is not specifically optimized for new CPUs the older nodes are often not much slower.

E.g. newest 256GBv1 node is often only 25% faster than oldest 128GB node on code not specifically

optimized for many cores, and CPU numerical improvements.

Running a test ChipSeq workflow (minimal 385MB test dataset)

32 Cores AVX Xeon E5 (v1) 128GB - 255s

56 Cores AVX2 Xeon E5 v4 256GBv1 - 194s

How Does this Affect Me? – CPU types

75% more cores24% speedup

astrocyte_example_chipseq workflow, run on a single node

Optimized numerical code will benefit from new CPUs – but you must compile it for specific

machines

To compile for specific machine (fastest possible binaries) use:

GNU gcc: -march=native

Intel icc: -xHost

4096x4096 Element Matrix multiplication benchmark:

32 Cores AVX Xeon E5 (v1) 128GB - 507ms

56 Cores AVX2 Xeon E5 v4 256GBv1 - 168ms

How Does this Affect Me? – CPU types

75% more cores3x speedup

MKL sgemm – mean time across 1000 replicate computationsIntel 2016 compiler –xHost –O3 options for machine specific optimization

Approx. 300 nodes will soon be added to the cluster (from TACC Stampede)

32GB RAM, 32 logical cores, similar to existing 128GB nodes

Ideal for smaller RAM, interactive jobs. Will improve immediate availability of sessions.

New Nodes for Low Memory Tasks – Coming Soon

Nucleus has 20 GPU Compute Nodes

Single or Multiple GPUs

GPU – NVIDIA Tesla K20 or K40

GPUv1 – Dual NVIDIA Tesla P100

Differences vs Consumer GPUs

Double Precision Arithmetic Performance

Can be important for high accuracy work

Reliability and Stability

GPU Nodes

On well-suited tasks, 2x P100 GPUs can be 20x faster than using 56 CPU cores

New Dual P100 nodes are much faster than K40 nodes for GPU compute intensive software

Relion CryoEM Classification

Dual P100 approx. >6x faster than single K40

Speed-up on small benchmark limited by CPU initialization step

TensorFlow AlexNet Benchmark

Dual P100 approx. 4.3x faster than single K40

If you are using heavy GPU compute, the new GPUv1 nodes should be preferred

Make sure your application can use, and is set to use 2 GPU cards!

Relion & Tensorflow Benchmarking

13 GFDL, https://commons.wikimedia.org/w/index.php?curid=11356884

Older K20 and K40 nodes are still ideal for:

3D Visualization – very good 3D rendering performance

Programs with limited GPU support (only 1 GPU, not much code GPU optimized)

Please use them when they are appropriate, so P100 nodes are available for heavy computation

K20 & K40 GPU Nodes Still Very Useful!

We use a normal Ethernet network to manage the nodes

Just like the network connected to your desktop – 1Gbps

>125us latency for messages

Your Jobs on Nucleus use the high-speed Infiniband network.

56Gbps connection per node

2:1 blocking – each node guaranteed at least 28Gbps

>0.7us latency for messages

Supports RDMA - Remote Direct Memory Access

Transfer data between RAM of nodes, without using CPU

High Speed Network - Infiniband

15 David.Monniaux CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=1587748

Nodes have 2 network addresses

192.168.54.x - 1Gbps Ethernet

10.10.10.x - 56Gbps Infiniband

Storage traffic, MPI traffic is setup to use the fast Infiniband network.

Sometimes parallel programs (non-MPI) try to use the first network interface (1GbE)

Must tell them to use Infiniband or things will be slow!

How Does this Affect Me? - Infiniband

Storage Systems

We use 2 main high-performance storage systems, plus others for lower-speed tasks

They use large hard drives – to give a lot of capacity per $ for your data

A single hard drive in your desktop/laptop is slow

Lots of hard drives (100s) working together can be very quick!

DDN SFA12K ExaScaler Lustre System

420 6TB drives in 40 disk pools

4 IO Servers, 2 Metadata Servers

Redundancy in pools, gives 1.7PB usable

Each pool provides up to 1GB/s

throughput

Total Max throughput ~30GB/s

Connected to cluster Infiniband Network

Project Storage System

IBM Elastic Storage Server GL6

(SpectrumScale/GPFS)

712 8TB drives, and 4 IO Servers

Redundancy in pools, gives 3.4 PB usable space

Total Max throughput ~20GB/s*

* Limited by network

Located in Clements University Hospital

Connected to cluster Infiniband network with 4 pairs of fiber

under Harry Hines Boulevard

Work Storage System

Metadata is the information about a file or directory

Name, dates, permissions, location of data

Data is what’s actually stored in your files

On a single PC the data and metadata stay together on the disk

On HPC storage we spread the data out over disk pools, so we can

have fast parallel access to read and write

Metadata has to be kept separately, and served to clients separately

HPC filesystems have huge numbers of files = a lot of metadata to

manage

Difficult problem to do this when many clients could use same files

Data & Metadata

HPC storage systems read and write data very quickly

HPC storage systems handle metadata slowly

Slow operations:

Creating, deleting many files and folders

Getting information about directories containing 1000s of files

When writing code and workflows prefer using large files, instead of many small files.

Use image stacks, instead of 1000s of individual TIFF files

Use archives (tar, zip) to store small files you aren’t working with

Use node /local and /tmp space for large numbers of very small files

Data & Metadata – How Does this Affect Me?

On HPC filesystems the data for a file is striped across the disk pools to achieve high speed.

/work does this for you automatically

/project does not stripe by default. Need to stripe very large files to get best speeds

File Striping

1 Default – Any file that doesn’t fit the criteria below. Don’t stripe small files!2 Moderate size files 2-10GB that are read by 1-2 concurrent processes

4Moderate size files 2-10GB that are read by 3+ concurrent processes regularlyLarge files 10GB+ that are read by 1-2 concurrent processes

8Large files 10GB+ that are read by 3+ concurrent processes regularlyAny very large files 200GB+ (to balance storage target usage)

https://portal.biohpc.swmed.edu/content/guides/storage-cheat-cheet/

lfs setstripe -c 4 /project/department/myuser/bigfiles

Future Plans…

Nucleus is an excellent resource, built up over the past 4 years thanks to our contributing departments

BioHPC is focused on an exciting future with new ways to use Nucleus, to advance your research

A newer version of Linux (we currently use 6)

Improved security, usability and compatibility with newer software

Popular software that will work (again)

Google Chrome

Visual Studio Code

Atom Editor

Update to Red Hat EL 7

More graphical tools:

Modern desktop environment, web browsers, office suite, editors

OpenGL (3D) support on all compute nodes, not just GPU nodes

Use simple 3D software in webGUI on any machine

Full-featured Interactive Sessions

Containers - Singularity

Singularity will allow containers to be run on BioHPC

Supports docker containersSupport containers using GPUs

Use software in a different environmente.g. Ubuntu Linux

Direct access to 3045 tools from the biocontainers project

Integrate with Astrocyte/Nextflow for reproducible workflows

http://singularity.lbl.gov/

Approx. 300 nodes will soon be added to the cluster (from TACC Stampede)

32GB RAM, 32 logical cores, similar to existing 128GB nodes

Ideal for smaller RAM, interactive jobs. Will improve immediate availability of sessions.

New Nodes for Low Memory Tasks

Xeon Phi (Knights Corner)

Nodes from Stampede have Xeon Phi(Knights Corner) Coprocessors

61 Cores, 8GB RAM

3x faster than CPUs for numerical workRun standard code, unlike GPUs

Can be used to speed up compute intensive, highly parallel code

Will add function to portal to help launch code on the Xeon Phi MICs

On the new Nucleus cluster, you can launch an NVIDIA DIGITS session from the BioHPC Portal.

Digits provides an easy-to-use, web-browser interface to deep learning tools.

Easy to define models. Create and execute multiple runs, using GPU computation.

Deep Learning with NVIDIA DIGITS – Coming January

Coming Soon!

Start & connect to dedicated Python, R, and DIGITS environments

Directly from the BioHPC Portal

Portal DIGITS, RStudio & Jupyter – Coming 2018

NucleusCluster

Astrocyte will become a gateway to use resources beyond Nucleus

Distributed Computing, on Campus or in the Cloud - Planned

3rd Party Cloud

BioHPC CloudWorkstations/Thin Clients

~500 cores

Workflow Designer – Alpha version November

Choose tools to create a workflow in your web browser

Run analyses and share workflows with your lab, or wider

Workflow Visualization & Interactivity - Planned

Downstream visualization of workflow results with interactive tools

NGS Visualization apps

Clinical / Microscopy

It’s Your Cluster!

Nucleus was built with your department contributions

BioHPC is here to help you do your research

What works well?What do you need?

Let us know!

Biohpc-help@utsouthwestern.edu

Microsoft Teams: BioHPC General

The BioHPC Nucleus Cluster & Future Developments · The BioHPC Nucleus Cluster & Future...

Transcript of The BioHPC Nucleus Cluster & Future Developments · The BioHPC Nucleus Cluster & Future...

The BioHPC Nucleus Cluster & Future Developments · The BioHPC Nucleus Cluster & Future...

Documents

Transcript of The BioHPC Nucleus Cluster & Future Developments · The BioHPC Nucleus Cluster & Future...

CUDA on BioHPC

MATLAB on BioHPC...Matlab Distributed Computing Server: Setup and Test your MATLAB environment You should have two Cluster Profiles ready for Matlab parallel computing: “local “

Modeling Cluster Productionat the AGS · Deuteron coalescence, during relativistic nucleus-nucleus collisions, is carried out in a model incorporating a minimal quantal treatment

Parallel Programming in MATLAB on BioHPC

BICF Variant Analysis Tools - BioHPC Portal Home · BICF Variant Analysis Tools Using the BioHPC Workflow Launching Tool Astrocyte

The Linux Command Line & Shell Scripting - BioHPC · PDF fileThe Linux Command Line & Shell Scripting 1 Updated for 2017-11-18 [web] [email] biohpc-help@utsouthwestern.edu

Visualization on BioHPC

Brainstem projections to molecular layer heterotopia of ... · nucleus, dorsal raphe nucleus, facial motor nucleus, gracilis nucleus, gigantocellular reticular nucleus, paragigantocellular

BioHPC Cloud Storage & Backup

Introduction to BioHPC€¦ · 5/1/2019 · BioHPC Storage Backup 14 Mirror/Full backup is the starting point for all other backups and contains all the data and files that are selected

Globus at BioHPC Lab

Introduction to BioHPC...Introduction to BioHPC New User Training 1 Updated for 2018-05-02 [web] portal.biohpc.swmed.edu [email] biohpc-help@utsouthwestern.edu Today we’re going

CHAPTER 12 · interphase cytosol nucleus cytosol 2. interphase dna content doubled cytosol nucleus cytosol 6. interphase cytosol nucleus ... nucleus nucleus chromatin chromatin chromatin

Parallel Programming in MATLAB on BioHPC › media › filer_public › 8d › d7 › ... · 2019-06-27 · Parallel Programming in MATLAB on BioHPC 1 [web] portal.biohpc.swmed.edu

Run R on BioHPC

Point and Click Microbiome Analysis Tools from the BioHPC ...

BioHPC Web Computing Resources at CBSU

-Schwann: neurons in PNS -Satellite cells: ganglia of PNS · -The human basal forebrain nucleus basalis (NbM) is a neuronal nucleus, a cluster of similar types of neurons. Deep Deep

Monitoring and Trouble Shooting on BioHPC · Monitoring and Trouble Shooting on BioHPC 1 Updated for 2017-03-15 [web] portal.biohpc.swmed.edu [email] biohpc-help@utsouthwestern.edu

2- Basal nuclei 1- Caudate nucleus 2- Lentiform nucleus 3- Amygdaloid nucleus 4- Claustrum nucleus 5- Accumbens nucleus 6- Substantia inominata.