Post on 20-May-2020
The BioHPC Nucleus Cluster & Future Developments
1
Today we’ll talk about the BioHPC Nucleus
HPC cluster – with some technical details for
those interested!
How is it designed?
What hardware does it use?
How does this affect the work I need to run?
Future Plans
(2017 cluster upgrade and more!)
Overview
2
HPC Clusters
3
HPC clusters consist of 3 major components
Compute Nodes
• Powerful servers that run your jobs
• Some also contain GPU cards
High-Speed Network
• Transfers data to/from compute nodes
• Carries communication for parallel code
High-Speed, High Capacity Storage
• Terabytes of storage for your research data
• 10s of GB per second bandwidth to feed nodes
HPC Cluster
Balance in Clusters
4
Some clusters need much more storage than compute.Data intensive tasks (e.g. Next Generation Sequencing)
Some clusters need very little storage, but a lot of compute.Compute intensive tasks (e.g. physical process modelling)
Some clusters don’t need very high performance networkingEmbarrassingly parallel tasks (no communication between tasks)
Best solution depends on the workload of users
Nucleus is a balanced, general purpose cluster
Slight bias toward storage – more storage than typical HPC system of its size
Compared to your PC
5
Combined, the cluster is between 1,000 and 8,000x faster/larger than a typical PC
Compute Nodes
• 8500 cores (~2,000x Desktop)
• 45TB RAM (~5,000x Desktop)
High-Speed Network
• 5.5Tbps Throughput (~5,000x Desktop)
High-Speed, High Capacity Storage
• >8PB Storage (~8000x Desktop)
• 90GB/s Throughput (~1000x Desktop)
4 Cores8GB RAM1TB HDD
openclipart.org - https://openclipart.org/detail/17924/computer
Nucleus has 196 Compute Nodes
Based on standard servers, used by businesses:
Lots of CPU cores: 32, 48 or 56 logical per server
each physical core has 2 logical cores
Lots of RAM: 128, 256, or 384GB per server
Differences from business servers:
Very little local storage
Keep things on central storage systems
High Speed Infiniband Network
Much faster than normal business networking
Compute Nodes
6
Possible to buy much individual faster machines…But this is the sweet spot for price-performance of a cluster of machines.
We add nodes often, and buy newer, faster machines when they become available:
24 * 128GB Nodes Oldest Xeon E5 32 Cores
78 * 256GB Nodes Most nodes Xeon E5 v3 48 Cores
48 * 256GBv1 Nodes Fastest CPU nodes Xeon E5 v4 56 Cores
2 * 384GB Nodes Largest RAM Xeon E5 /v2 32/40 Cores
Compute Nodes - Types
7
Newer nodes have more cores
Can be much faster if your work can use the extra cores
Also have newer numerical features – can speed up linear algebra a lot
Logical Cores
Most jobs that users run don’t use compute nodes fully.
56 cores is a lot to fill up. Might be slower to split task into 56 parts due to overhead.
Combine smaller jobs – run programs in parallel on fewer nodes
Watch out for RAM usage – 256GB / 56 cores is 4.5GB per core.
You might need to run fewer than 56 tasks
How Does this Affect Me – Cores and RAM?
8 http://www.exascience.com/wp-content/uploads/2013/12/Herzeel-BWAReport.pdf
Herzeel et. alPerformance Analysis of BWA AlignmentExaScience Life Lab
Older nodes are often less busy – shorter waits.
If your code is not specifically optimized for new CPUs the older nodes are often not much slower.
E.g. newest 256GBv1 node is often only 25% faster than oldest 128GB node on code not specifically
optimized for many cores, and CPU numerical improvements.
Running a test ChipSeq workflow (minimal 385MB test dataset)
32 Cores AVX Xeon E5 (v1) 128GB - 255s
56 Cores AVX2 Xeon E5 v4 256GBv1 - 194s
How Does this Affect Me? – CPU types
9
75% more cores24% speedup
astrocyte_example_chipseq workflow, run on a single node
Optimized numerical code will benefit from new CPUs – but you must compile it for specific
machines
To compile for specific machine (fastest possible binaries) use:
GNU gcc: -march=native
Intel icc: -xHost
4096x4096 Element Matrix multiplication benchmark:
32 Cores AVX Xeon E5 (v1) 128GB - 507ms
56 Cores AVX2 Xeon E5 v4 256GBv1 - 168ms
How Does this Affect Me? – CPU types
10
75% more cores3x speedup
MKL sgemm – mean time across 1000 replicate computationsIntel 2016 compiler –xHost –O3 options for machine specific optimization
Approx. 300 nodes will soon be added to the cluster (from TACC Stampede)
32GB RAM, 32 logical cores, similar to existing 128GB nodes
Ideal for smaller RAM, interactive jobs. Will improve immediate availability of sessions.
New Nodes for Low Memory Tasks – Coming Soon
11
Nucleus has 20 GPU Compute Nodes
Single or Multiple GPUs
GPU – NVIDIA Tesla K20 or K40
GPUv1 – Dual NVIDIA Tesla P100
Differences vs Consumer GPUs
Double Precision Arithmetic Performance
Can be important for high accuracy work
Reliability and Stability
GPU Nodes
12
On well-suited tasks, 2x P100 GPUs can be 20x faster than using 56 CPU cores
New Dual P100 nodes are much faster than K40 nodes for GPU compute intensive software
Relion CryoEM Classification
Dual P100 approx. >6x faster than single K40
Speed-up on small benchmark limited by CPU initialization step
TensorFlow AlexNet Benchmark
Dual P100 approx. 4.3x faster than single K40
If you are using heavy GPU compute, the new GPUv1 nodes should be preferred
Make sure your application can use, and is set to use 2 GPU cards!
Relion & Tensorflow Benchmarking
13 GFDL, https://commons.wikimedia.org/w/index.php?curid=11356884
Older K20 and K40 nodes are still ideal for:
3D Visualization – very good 3D rendering performance
Programs with limited GPU support (only 1 GPU, not much code GPU optimized)
Please use them when they are appropriate, so P100 nodes are available for heavy computation
K20 & K40 GPU Nodes Still Very Useful!
14
We use a normal Ethernet network to manage the nodes
Just like the network connected to your desktop – 1Gbps
>125us latency for messages
Your Jobs on Nucleus use the high-speed Infiniband network.
56Gbps connection per node
2:1 blocking – each node guaranteed at least 28Gbps
>0.7us latency for messages
Supports RDMA - Remote Direct Memory Access
Transfer data between RAM of nodes, without using CPU
High Speed Network - Infiniband
15 David.Monniaux CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=1587748
Nodes have 2 network addresses
192.168.54.x - 1Gbps Ethernet
10.10.10.x - 56Gbps Infiniband
Storage traffic, MPI traffic is setup to use the fast Infiniband network.
Sometimes parallel programs (non-MPI) try to use the first network interface (1GbE)
Must tell them to use Infiniband or things will be slow!
How Does this Affect Me? - Infiniband
16
Storage Systems
17
We use 2 main high-performance storage systems, plus others for lower-speed tasks
They use large hard drives – to give a lot of capacity per $ for your data
A single hard drive in your desktop/laptop is slow
Lots of hard drives (100s) working together can be very quick!
DDN SFA12K ExaScaler Lustre System
420 6TB drives in 40 disk pools
4 IO Servers, 2 Metadata Servers
Redundancy in pools, gives 1.7PB usable
space
Each pool provides up to 1GB/s
throughput
Total Max throughput ~30GB/s
Connected to cluster Infiniband Network
Project Storage System
18
IBM Elastic Storage Server GL6
(SpectrumScale/GPFS)
712 8TB drives, and 4 IO Servers
Redundancy in pools, gives 3.4 PB usable space
Total Max throughput ~20GB/s*
* Limited by network
Located in Clements University Hospital
Connected to cluster Infiniband network with 4 pairs of fiber
under Harry Hines Boulevard
Work Storage System
19
Metadata is the information about a file or directory
Name, dates, permissions, location of data
Data is what’s actually stored in your files
On a single PC the data and metadata stay together on the disk
On HPC storage we spread the data out over disk pools, so we can
have fast parallel access to read and write
Metadata has to be kept separately, and served to clients separately
HPC filesystems have huge numbers of files = a lot of metadata to
manage
Difficult problem to do this when many clients could use same files
Data & Metadata
20
HPC storage systems read and write data very quickly
HPC storage systems handle metadata slowly
Slow operations:
Creating, deleting many files and folders
Getting information about directories containing 1000s of files
When writing code and workflows prefer using large files, instead of many small files.
Use image stacks, instead of 1000s of individual TIFF files
Use archives (tar, zip) to store small files you aren’t working with
Use node /local and /tmp space for large numbers of very small files
Data & Metadata – How Does this Affect Me?
21
On HPC filesystems the data for a file is striped across the disk pools to achieve high speed.
/work does this for you automatically
/project does not stripe by default. Need to stripe very large files to get best speeds
File Striping
22
1 Default – Any file that doesn’t fit the criteria below. Don’t stripe small files!2 Moderate size files 2-10GB that are read by 1-2 concurrent processes
4Moderate size files 2-10GB that are read by 3+ concurrent processes regularlyLarge files 10GB+ that are read by 1-2 concurrent processes
8Large files 10GB+ that are read by 3+ concurrent processes regularlyAny very large files 200GB+ (to balance storage target usage)
https://portal.biohpc.swmed.edu/content/guides/storage-cheat-cheet/
lfs setstripe -c 4 /project/department/myuser/bigfiles
Future Plans…
23
Nucleus is an excellent resource, built up over the past 4 years thanks to our contributing departments
BioHPC is focused on an exciting future with new ways to use Nucleus, to advance your research
A newer version of Linux (we currently use 6)
Improved security, usability and compatibility with newer software
Popular software that will work (again)
Google Chrome
Visual Studio Code
Atom Editor
Update to Red Hat EL 7
24
More graphical tools:
Modern desktop environment, web browsers, office suite, editors
OpenGL (3D) support on all compute nodes, not just GPU nodes
Use simple 3D software in webGUI on any machine
Full-featured Interactive Sessions
25
Containers - Singularity
26
Singularity will allow containers to be run on BioHPC
Supports docker containersSupport containers using GPUs
Use software in a different environmente.g. Ubuntu Linux
Direct access to 3045 tools from the biocontainers project
Integrate with Astrocyte/Nextflow for reproducible workflows
http://singularity.lbl.gov/
Approx. 300 nodes will soon be added to the cluster (from TACC Stampede)
32GB RAM, 32 logical cores, similar to existing 128GB nodes
Ideal for smaller RAM, interactive jobs. Will improve immediate availability of sessions.
New Nodes for Low Memory Tasks
27
Xeon Phi (Knights Corner)
28
Nodes from Stampede have Xeon Phi(Knights Corner) Coprocessors
61 Cores, 8GB RAM
3x faster than CPUs for numerical workRun standard code, unlike GPUs
Can be used to speed up compute intensive, highly parallel code
Will add function to portal to help launch code on the Xeon Phi MICs
On the new Nucleus cluster, you can launch an NVIDIA DIGITS session from the BioHPC Portal.
Digits provides an easy-to-use, web-browser interface to deep learning tools.
Easy to define models. Create and execute multiple runs, using GPU computation.
Deep Learning with NVIDIA DIGITS – Coming January
29
Coming Soon!
Start & connect to dedicated Python, R, and DIGITS environments
Directly from the BioHPC Portal
Portal DIGITS, RStudio & Jupyter – Coming 2018
30
NucleusCluster
Astrocyte will become a gateway to use resources beyond Nucleus
Distributed Computing, on Campus or in the Cloud - Planned
31
3rd Party Cloud
BioHPC CloudWorkstations/Thin Clients
~500 cores
Workflow Designer – Alpha version November
32
Choose tools to create a workflow in your web browser
Run analyses and share workflows with your lab, or wider
Workflow Visualization & Interactivity - Planned
33
Downstream visualization of workflow results with interactive tools
NGS Visualization apps
Clinical / Microscopy
It’s Your Cluster!
34
Nucleus was built with your department contributions
BioHPC is here to help you do your research
What works well?What do you need?
Let us know!
Biohpc-help@utsouthwestern.edu
Microsoft Teams: BioHPC General